[2/2] ethdev: fix race condition in fast-path ops setup

Message ID 20230220060839.1267349-2-ashok.k.kaladi@intel.com (mailing list archive)
State New
Delegated to: Thomas Monjalon
Headers
Series [1/2] eventdev: fix race condition in fast-path set function |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-broadcom-Performance fail Performance Testing issues
ci/iol-intel-Functional success Functional Testing PASS
ci/github-robot: build success github build: passed
ci/iol-intel-Performance success Performance Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/iol-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-abi-testing success Testing PASS

Commit Message

Ashok Kaladi Feb. 20, 2023, 6:08 a.m. UTC
  If ethdev enqueue or dequeue function is called during
eth_dev_fp_ops_setup(), it may get pre-empted after setting
the function pointers, but before setting the pointer to port data.
In this case the newly registered enqueue/dequeue function will use
dummy port data and end up in seg fault.

This patch moves the updation of each data pointers before updating
corresponding function pointers.

Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate structure")
Cc: stable@dpdk.org

Signed-off-by: Ashok Kaladi <ashok.k.kaladi@intel.com>
  

Comments

Chengwen Feng Feb. 20, 2023, 6:57 a.m. UTC | #1
On 2023/2/20 14:08, Ashok Kaladi wrote:
> If ethdev enqueue or dequeue function is called during
> eth_dev_fp_ops_setup(), it may get pre-empted after setting
> the function pointers, but before setting the pointer to port data.
> In this case the newly registered enqueue/dequeue function will use
> dummy port data and end up in seg fault.
> 
> This patch moves the updation of each data pointers before updating
> corresponding function pointers.
> 
> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate structure")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Ashok Kaladi <ashok.k.kaladi@intel.com>
> 
> diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
> index 48090c879a..a0232c669f 100644
> --- a/lib/ethdev/ethdev_private.c
> +++ b/lib/ethdev/ethdev_private.c
> @@ -270,17 +270,17 @@ void
>  eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
>  		const struct rte_eth_dev *dev)
>  {
> +	fpo->rxq.data = dev->data->rx_queues;
>  	fpo->rx_pkt_burst = dev->rx_pkt_burst;
> +	fpo->txq.data = dev->data->tx_queues;
>  	fpo->tx_pkt_burst = dev->tx_pkt_burst;
>  	fpo->tx_pkt_prepare = dev->tx_pkt_prepare;
>  	fpo->rx_queue_count = dev->rx_queue_count;
>  	fpo->rx_descriptor_status = dev->rx_descriptor_status;
>  	fpo->tx_descriptor_status = dev->tx_descriptor_status;
>  
> -	fpo->rxq.data = dev->data->rx_queues;
>  	fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
>  
> -	fpo->txq.data = dev->data->tx_queues;
>  	fpo->txq.clbk = (void **)(uintptr_t)dev->pre_tx_burst_cbs;

Hi Ashok,

The modification is OK for the x86 platform (which has strong memory order, and will keep write-after-write order in here, and read-after-read in rte_eth_rx/tx_burst),
but for other weak memory order (like ARM platform) will fail.

For the weak memory order, suggest add write-mb in here, and read-mb in rte_eth_rx/tx_burst.
But the read-mb in rte_eth_rx/tx_burst will affect performance, especially the variable will changes only once when start.

So I suggest use write-mb + delay in here:
   fpo->rxq.data = dev->data->rx_queues;
   fpo->txq.data = dev->data->tx_queues;
   mdelay(5); // delay e.g. 5ms
   fpo->rx_pkt_burst = dev->rx_pkt_burst;
   fpo->tx_pkt_burst = dev->tx_pkt_burst;

And also cc ARMv8 maintainer.

>  }
>  
>
  
Chengwen Feng Feb. 20, 2023, 7:01 a.m. UTC | #2
Sorry resend, because forget one line.

On 2023/2/20 14:08, Ashok Kaladi wrote:
> If ethdev enqueue or dequeue function is called during
> eth_dev_fp_ops_setup(), it may get pre-empted after setting
> the function pointers, but before setting the pointer to port data.
> In this case the newly registered enqueue/dequeue function will use
> dummy port data and end up in seg fault.
> 
> This patch moves the updation of each data pointers before updating
> corresponding function pointers.
> 
> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate structure")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Ashok Kaladi <ashok.k.kaladi@intel.com>
> 
> diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
> index 48090c879a..a0232c669f 100644
> --- a/lib/ethdev/ethdev_private.c
> +++ b/lib/ethdev/ethdev_private.c
> @@ -270,17 +270,17 @@ void
>  eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
>  		const struct rte_eth_dev *dev)
>  {
> +	fpo->rxq.data = dev->data->rx_queues;
>  	fpo->rx_pkt_burst = dev->rx_pkt_burst;
> +	fpo->txq.data = dev->data->tx_queues;
>  	fpo->tx_pkt_burst = dev->tx_pkt_burst;
>  	fpo->tx_pkt_prepare = dev->tx_pkt_prepare;
>  	fpo->rx_queue_count = dev->rx_queue_count;
>  	fpo->rx_descriptor_status = dev->rx_descriptor_status;
>  	fpo->tx_descriptor_status = dev->tx_descriptor_status;
>  
> -	fpo->rxq.data = dev->data->rx_queues;
>  	fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
>  
> -	fpo->txq.data = dev->data->tx_queues;
>  	fpo->txq.clbk = (void **)(uintptr_t)dev->pre_tx_burst_cbs;

Hi Ashok,

The modification is OK for the x86 platform (which has strong memory order, and will keep write-after-write order in here, and read-after-read in rte_eth_rx/tx_burst),
but for other weak memory order (like ARM platform) will fail.

For the weak memory order, suggest add write-mb in here, and read-mb in rte_eth_rx/tx_burst.
But the read-mb in rte_eth_rx/tx_burst will affect performance, especially the variable will changes only once when start.

So I suggest use write-mb + delay in here:
   fpo->rxq.data = dev->data->rx_queues;
   fpo->txq.data = dev->data->tx_queues;
   wmb();
   mdelay(5); // delay e.g. 5ms
   fpo->rx_pkt_burst = dev->rx_pkt_burst;
   fpo->tx_pkt_burst = dev->tx_pkt_burst;

And also cc ARMv8 maintainer.

>  }
>  
>
  
Konstantin Ananyev Feb. 20, 2023, 9:44 a.m. UTC | #3
> If ethdev enqueue or dequeue function is called during
> eth_dev_fp_ops_setup(), it may get pre-empted after setting
> the function pointers, but before setting the pointer to port data.
> In this case the newly registered enqueue/dequeue function will use
> dummy port data and end up in seg fault.
> 
> This patch moves the updation of each data pointers before updating
> corresponding function pointers.

First, such re-ordering wouldn't really fix that race condition.
Second, eth_dev_fp_ops_setup() supposed to be called only by dev/queue start/stop functions.
With current DPDK design it is not allowed to simultaneously call dev start/stop
and  data-path RX/TX functions and it is user responsibility to ensure that.
In other words - it is user responsibility to ensure that such race condition would 
not happen.
So, NACK.

> 
> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate structure")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Ashok Kaladi <ashok.k.kaladi@intel.com>
> 
> diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
> index 48090c879a..a0232c669f 100644
> --- a/lib/ethdev/ethdev_private.c
> +++ b/lib/ethdev/ethdev_private.c
> @@ -270,17 +270,17 @@ void
>  eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
>  		const struct rte_eth_dev *dev)
>  {
> +	fpo->rxq.data = dev->data->rx_queues;
>  	fpo->rx_pkt_burst = dev->rx_pkt_burst;
> +	fpo->txq.data = dev->data->tx_queues;
>  	fpo->tx_pkt_burst = dev->tx_pkt_burst;
>  	fpo->tx_pkt_prepare = dev->tx_pkt_prepare;
>  	fpo->rx_queue_count = dev->rx_queue_count;
>  	fpo->rx_descriptor_status = dev->rx_descriptor_status;
>  	fpo->tx_descriptor_status = dev->tx_descriptor_status;
> 
> -	fpo->rxq.data = dev->data->rx_queues;
>  	fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
> 
> -	fpo->txq.data = dev->data->tx_queues;
>  	fpo->txq.clbk = (void **)(uintptr_t)dev->pre_tx_burst_cbs;
>  }
> 
> --
> 2.25.1
  
Ruifeng Wang Feb. 21, 2023, 7:24 a.m. UTC | #4
> -----Original Message-----
> From: fengchengwen <fengchengwen@huawei.com>
> Sent: Monday, February 20, 2023 2:58 PM
> To: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com; thomas@monjalon.net
> Cc: dev@dpdk.org; s.v.naga.harish.k@intel.com; erik.g.carrillo@intel.com;
> abhinandan.gujjar@intel.com; stable@dpdk.org; Ruifeng Wang <Ruifeng.Wang@arm.com>
> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
> 
> On 2023/2/20 14:08, Ashok Kaladi wrote:
> > If ethdev enqueue or dequeue function is called during
> > eth_dev_fp_ops_setup(), it may get pre-empted after setting the
> > function pointers, but before setting the pointer to port data.
> > In this case the newly registered enqueue/dequeue function will use
> > dummy port data and end up in seg fault.
> >
> > This patch moves the updation of each data pointers before updating
> > corresponding function pointers.
> >
> > Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
> > structure")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Ashok Kaladi <ashok.k.kaladi@intel.com>
> >
> > diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
> > index 48090c879a..a0232c669f 100644
> > --- a/lib/ethdev/ethdev_private.c
> > +++ b/lib/ethdev/ethdev_private.c
> > @@ -270,17 +270,17 @@ void
> >  eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
> >  		const struct rte_eth_dev *dev)
> >  {
> > +	fpo->rxq.data = dev->data->rx_queues;
> >  	fpo->rx_pkt_burst = dev->rx_pkt_burst;
> > +	fpo->txq.data = dev->data->tx_queues;
> >  	fpo->tx_pkt_burst = dev->tx_pkt_burst;
> >  	fpo->tx_pkt_prepare = dev->tx_pkt_prepare;
> >  	fpo->rx_queue_count = dev->rx_queue_count;
> >  	fpo->rx_descriptor_status = dev->rx_descriptor_status;
> >  	fpo->tx_descriptor_status = dev->tx_descriptor_status;
> >
> > -	fpo->rxq.data = dev->data->rx_queues;
> >  	fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
> >
> > -	fpo->txq.data = dev->data->tx_queues;
> >  	fpo->txq.clbk = (void **)(uintptr_t)dev->pre_tx_burst_cbs;
> 
> Hi Ashok,
> 
> The modification is OK for the x86 platform (which has strong memory order, and will keep
> write-after-write order in here, and read-after-read in rte_eth_rx/tx_burst), but for
> other weak memory order (like ARM platform) will fail.
> 
> For the weak memory order, suggest add write-mb in here, and read-mb in
> rte_eth_rx/tx_burst.
> But the read-mb in rte_eth_rx/tx_burst will affect performance, especially the variable
> will changes only once when start.
> 
> So I suggest use write-mb + delay in here:
>    fpo->rxq.data = dev->data->rx_queues;
>    fpo->txq.data = dev->data->tx_queues;
>    mdelay(5); // delay e.g. 5ms
>    fpo->rx_pkt_burst = dev->rx_pkt_burst;
>    fpo->tx_pkt_burst = dev->tx_pkt_burst;
> 
> And also cc ARMv8 maintainer.

Thanks Chengwen for the heads up.
Agree that moving the queue data assignment around won't solve the problem on systems with relaxed memory ordering.
Even with write-mb/read-mb in eth_dev_fp_ops_setup/rte_eth_rx_burst is not perfectly fine. There is a chance that
dummy enqueue/dequeue function will use updated queue data and mess it up.
Adding delay in eth_dev_fp_ops_setup is not a good way. But I haven't found a solution that doesn't hurt fast path performance.

> 
> >  }
> >
> >
  
Stephen Hemminger Feb. 21, 2023, 5 p.m. UTC | #5
On Tue, 21 Feb 2023 07:24:19 +0000
Ruifeng Wang <Ruifeng.Wang@arm.com> wrote:

> > -----Original Message-----
> > From: fengchengwen <fengchengwen@huawei.com>
> > Sent: Monday, February 20, 2023 2:58 PM
> > To: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com; thomas@monjalon.net
> > Cc: dev@dpdk.org; s.v.naga.harish.k@intel.com; erik.g.carrillo@intel.com;
> > abhinandan.gujjar@intel.com; stable@dpdk.org; Ruifeng Wang <Ruifeng.Wang@arm.com>
> > Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
> > 
> > On 2023/2/20 14:08, Ashok Kaladi wrote:  
> > > If ethdev enqueue or dequeue function is called during
> > > eth_dev_fp_ops_setup(), it may get pre-empted after setting the
> > > function pointers, but before setting the pointer to port data.
> > > In this case the newly registered enqueue/dequeue function will use
> > > dummy port data and end up in seg fault.
> > >
> > > This patch moves the updation of each data pointers before updating
> > > corresponding function pointers.
> > >
> > > Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
> > > structure")
> > > Cc: stable@dpdk.org

Why is something calling enqueue/dequeue when device is not fully started.
A correctly written application would not call rx/tx burst until after
ethdev start had finished.

Would something like this work better?

Note: there is another bug in current code. The check for link state interrupt
and link_ops could return -ENOTSUP and leave device in indeterminate state.
The check should be done before calling PMD.

diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 0266cc82acb6..d6c163ed85e7 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
 		return 0;
 	}
 
+	if (dev->data->dev_conf.intr_conf.lsc == 0 &&
+	    dev->dev_ops->link_update == NULL) {
+		RTE_ETHDEV_LOG(INFO,
+			       "Device with port_id=%"PRIu16" link update not supported\n",
+			       port_id);
+			return -ENOTSUP;
+	}
+
 	ret = rte_eth_dev_info_get(port_id, &dev_info);
 	if (ret != 0)
 		return ret;
@@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
 		eth_dev_mac_restore(dev, &dev_info);
 
 	diag = (*dev->dev_ops->dev_start)(dev);
-	if (diag == 0)
-		dev->data->dev_started = 1;
-	else
+	if (diag != 0)
 		return eth_err(port_id, diag);
 
 	ret = eth_dev_config_restore(dev, &dev_info, port_id);
@@ -1611,16 +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
 		return ret;
 	}
 
-	if (dev->data->dev_conf.intr_conf.lsc == 0) {
-		if (*dev->dev_ops->link_update == NULL)
-			return -ENOTSUP;
-		(*dev->dev_ops->link_update)(dev, 0);
-	}
-
 	/* expose selection of PMD fast-path functions */
 	eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
 
+	/* ensure state is set before marking device ready */
+	rte_smp_wmb();
+
 	rte_ethdev_trace_start(port_id);
+
+	/* Update current link state */
+	if (dev->data->dev_conf.intr_conf.lsc == 0)
+		(*dev->dev_ops->link_update)(dev, 0);
+
 	return 0;
 }
  
Chengwen Feng Feb. 22, 2023, 1:07 a.m. UTC | #6
On 2023/2/22 1:00, Stephen Hemminger wrote:
> On Tue, 21 Feb 2023 07:24:19 +0000
> Ruifeng Wang <Ruifeng.Wang@arm.com> wrote:
> 
>>> -----Original Message-----
>>> From: fengchengwen <fengchengwen@huawei.com>
>>> Sent: Monday, February 20, 2023 2:58 PM
>>> To: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com; thomas@monjalon.net
>>> Cc: dev@dpdk.org; s.v.naga.harish.k@intel.com; erik.g.carrillo@intel.com;
>>> abhinandan.gujjar@intel.com; stable@dpdk.org; Ruifeng Wang <Ruifeng.Wang@arm.com>
>>> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
>>>
>>> On 2023/2/20 14:08, Ashok Kaladi wrote:  
>>>> If ethdev enqueue or dequeue function is called during
>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting the
>>>> function pointers, but before setting the pointer to port data.
>>>> In this case the newly registered enqueue/dequeue function will use
>>>> dummy port data and end up in seg fault.
>>>>
>>>> This patch moves the updation of each data pointers before updating
>>>> corresponding function pointers.
>>>>
>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
>>>> structure")
>>>> Cc: stable@dpdk.org
> 
> Why is something calling enqueue/dequeue when device is not fully started.
> A correctly written application would not call rx/tx burst until after
> ethdev start had finished.

Please refer the eb0d471a894 (ethdev: add proactive error handling mode), when driver
recover itself, the application may still invoke enqueue/dequeue API.

> 
> Would something like this work better?
> 
> Note: there is another bug in current code. The check for link state interrupt
> and link_ops could return -ENOTSUP and leave device in indeterminate state.
> The check should be done before calling PMD.
> 
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 0266cc82acb6..d6c163ed85e7 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
>  		return 0;
>  	}
>  
> +	if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> +	    dev->dev_ops->link_update == NULL) {
> +		RTE_ETHDEV_LOG(INFO,
> +			       "Device with port_id=%"PRIu16" link update not supported\n",
> +			       port_id);
> +			return -ENOTSUP;
> +	}
> +
>  	ret = rte_eth_dev_info_get(port_id, &dev_info);
>  	if (ret != 0)
>  		return ret;
> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
>  		eth_dev_mac_restore(dev, &dev_info);
>  
>  	diag = (*dev->dev_ops->dev_start)(dev);
> -	if (diag == 0)
> -		dev->data->dev_started = 1;
> -	else
> +	if (diag != 0)
>  		return eth_err(port_id, diag);
>  
>  	ret = eth_dev_config_restore(dev, &dev_info, port_id);
> @@ -1611,16 +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
>  		return ret;
>  	}
>  
> -	if (dev->data->dev_conf.intr_conf.lsc == 0) {
> -		if (*dev->dev_ops->link_update == NULL)
> -			return -ENOTSUP;
> -		(*dev->dev_ops->link_update)(dev, 0);
> -	}
> -
>  	/* expose selection of PMD fast-path functions */
>  	eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
>  
> +	/* ensure state is set before marking device ready */
> +	rte_smp_wmb();
> +
>  	rte_ethdev_trace_start(port_id);
> +
> +	/* Update current link state */
> +	if (dev->data->dev_conf.intr_conf.lsc == 0)
> +		(*dev->dev_ops->link_update)(dev, 0);
> +
>  	return 0;
>  }
>  
> 
> .
>
  
Ruifeng Wang Feb. 22, 2023, 9:41 a.m. UTC | #7
> -----Original Message-----
> From: fengchengwen <fengchengwen@huawei.com>
> Sent: Wednesday, February 22, 2023 9:07 AM
> To: Stephen Hemminger <stephen@networkplumber.org>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com; thomas@monjalon.net;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org;
> s.v.naga.harish.k@intel.com; erik.g.carrillo@intel.com; abhinandan.gujjar@intel.com;
> stable@dpdk.org; nd <nd@arm.com>
> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
> 
> On 2023/2/22 1:00, Stephen Hemminger wrote:
> > On Tue, 21 Feb 2023 07:24:19 +0000
> > Ruifeng Wang <Ruifeng.Wang@arm.com> wrote:
> >
> >>> -----Original Message-----
> >>> From: fengchengwen <fengchengwen@huawei.com>
> >>> Sent: Monday, February 20, 2023 2:58 PM
> >>> To: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com;
> >>> thomas@monjalon.net
> >>> Cc: dev@dpdk.org; s.v.naga.harish.k@intel.com;
> >>> erik.g.carrillo@intel.com; abhinandan.gujjar@intel.com;
> >>> stable@dpdk.org; Ruifeng Wang <Ruifeng.Wang@arm.com>
> >>> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops
> >>> setup
> >>>
> >>> On 2023/2/20 14:08, Ashok Kaladi wrote:
> >>>> If ethdev enqueue or dequeue function is called during
> >>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting the
> >>>> function pointers, but before setting the pointer to port data.
> >>>> In this case the newly registered enqueue/dequeue function will use
> >>>> dummy port data and end up in seg fault.
> >>>>
> >>>> This patch moves the updation of each data pointers before updating
> >>>> corresponding function pointers.
> >>>>
> >>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
> >>>> structure")
> >>>> Cc: stable@dpdk.org
> >
> > Why is something calling enqueue/dequeue when device is not fully started.
> > A correctly written application would not call rx/tx burst until after
> > ethdev start had finished.
> 
> Please refer the eb0d471a894 (ethdev: add proactive error handling mode), when driver
> recover itself, the application may still invoke enqueue/dequeue API.
> 
> >
> > Would something like this work better?
> >
> > Note: there is another bug in current code. The check for link state
> > interrupt and link_ops could return -ENOTSUP and leave device in indeterminate state.
> > The check should be done before calling PMD.
> >
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > 0266cc82acb6..d6c163ed85e7 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
> >  		return 0;
> >  	}
> >
> > +	if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> > +	    dev->dev_ops->link_update == NULL) {
> > +		RTE_ETHDEV_LOG(INFO,
> > +			       "Device with port_id=%"PRIu16" link update not supported\n",
> > +			       port_id);
> > +			return -ENOTSUP;
> > +	}
> > +
> >  	ret = rte_eth_dev_info_get(port_id, &dev_info);
> >  	if (ret != 0)
> >  		return ret;
> > @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
> >  		eth_dev_mac_restore(dev, &dev_info);
> >
> >  	diag = (*dev->dev_ops->dev_start)(dev);
> > -	if (diag == 0)
> > -		dev->data->dev_started = 1;
> > -	else
> > +	if (diag != 0)
> >  		return eth_err(port_id, diag);
> >
> >  	ret = eth_dev_config_restore(dev, &dev_info, port_id); @@ -1611,16
> > +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
> >  		return ret;
> >  	}
> >
> > -	if (dev->data->dev_conf.intr_conf.lsc == 0) {
> > -		if (*dev->dev_ops->link_update == NULL)
> > -			return -ENOTSUP;
> > -		(*dev->dev_ops->link_update)(dev, 0);
> > -	}
> > -
> >  	/* expose selection of PMD fast-path functions */
> >  	eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
> >
> > +	/* ensure state is set before marking device ready */
> > +	rte_smp_wmb();

Without a read barrier at the reader side (rte_eth_rx_burst), the wmb here may not fulfill the required data sync.

One solution is to change eth_dev_fp_ops_reset. Replacing dummy_eth_rx_burst with rte_eth_pkt_burst_dummy and
not touch rxq.data/txq.data. By doing this, the sync requirement between pkt_burst handle and qdata can be removed.
Because rte_eth_pkt_burst_dummy doesn't work on any data.
The downside is loss of error log and stack dump when dummy handle is called.

> > +
> >  	rte_ethdev_trace_start(port_id);
> > +
> > +	/* Update current link state */
> > +	if (dev->data->dev_conf.intr_conf.lsc == 0)
> > +		(*dev->dev_ops->link_update)(dev, 0);
> > +
> >  	return 0;
> >  }
> >
> >
> > .
> >
  
Konstantin Ananyev Feb. 22, 2023, 10:41 a.m. UTC | #8
> -----Original Message-----
> From: fengchengwen <fengchengwen@huawei.com>
> Sent: Wednesday, February 22, 2023 1:07 AM
> To: Stephen Hemminger <stephen@networkplumber.org>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com; thomas@monjalon.net; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org; s.v.naga.harish.k@intel.com; erik.g.carrillo@intel.com;
> abhinandan.gujjar@intel.com; stable@dpdk.org; nd <nd@arm.com>
> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
> 
> On 2023/2/22 1:00, Stephen Hemminger wrote:
> > On Tue, 21 Feb 2023 07:24:19 +0000
> > Ruifeng Wang <Ruifeng.Wang@arm.com> wrote:
> >
> >>> -----Original Message-----
> >>> From: fengchengwen <fengchengwen@huawei.com>
> >>> Sent: Monday, February 20, 2023 2:58 PM
> >>> To: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com; thomas@monjalon.net
> >>> Cc: dev@dpdk.org; s.v.naga.harish.k@intel.com; erik.g.carrillo@intel.com;
> >>> abhinandan.gujjar@intel.com; stable@dpdk.org; Ruifeng Wang <Ruifeng.Wang@arm.com>
> >>> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
> >>>
> >>> On 2023/2/20 14:08, Ashok Kaladi wrote:
> >>>> If ethdev enqueue or dequeue function is called during
> >>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting the
> >>>> function pointers, but before setting the pointer to port data.
> >>>> In this case the newly registered enqueue/dequeue function will use
> >>>> dummy port data and end up in seg fault.
> >>>>
> >>>> This patch moves the updation of each data pointers before updating
> >>>> corresponding function pointers.
> >>>>
> >>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
> >>>> structure")
> >>>> Cc: stable@dpdk.org
> >
> > Why is something calling enqueue/dequeue when device is not fully started.
> > A correctly written application would not call rx/tx burst until after
> > ethdev start had finished.
> 
> Please refer the eb0d471a894 (ethdev: add proactive error handling mode), when driver
> recover itself, the application may still invoke enqueue/dequeue API.

Right now DPDK ethdev layer *does not* provide synchronization mechanisms
between data-path and control-path functions.
That was a deliberate deisgn choice. If we want to change that rule, then I suppose
we need a community consensus for it. 
I think that if the driver wants to provide some sort of error recovery procedure,
then it has to provide some synchronization mechanism inside it between data-path
and control-path functions.
Actually looking at eb0d471a894 (ethdev: add proactive error handling mode),
and following patches I wonder how it creeped in?
It seems we just introduced a loophole for race condition with this approach...
It probably needs to be either deprecated or reworked.

> 
> >
> > Would something like this work better?
> >
> > Note: there is another bug in current code. The check for link state interrupt
> > and link_ops could return -ENOTSUP and leave device in indeterminate state.
> > The check should be done before calling PMD.
> >
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > index 0266cc82acb6..d6c163ed85e7 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
> >  		return 0;
> >  	}
> >
> > +	if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> > +	    dev->dev_ops->link_update == NULL) {
> > +		RTE_ETHDEV_LOG(INFO,
> > +			       "Device with port_id=%"PRIu16" link update not supported\n",
> > +			       port_id);
> > +			return -ENOTSUP;
> > +	}
> > +
> >  	ret = rte_eth_dev_info_get(port_id, &dev_info);
> >  	if (ret != 0)
> >  		return ret;
> > @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
> >  		eth_dev_mac_restore(dev, &dev_info);
> >
> >  	diag = (*dev->dev_ops->dev_start)(dev);
> > -	if (diag == 0)
> > -		dev->data->dev_started = 1;
> > -	else
> > +	if (diag != 0)
> >  		return eth_err(port_id, diag);
> >
> >  	ret = eth_dev_config_restore(dev, &dev_info, port_id);
> > @@ -1611,16 +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
> >  		return ret;
> >  	}
> >
> > -	if (dev->data->dev_conf.intr_conf.lsc == 0) {
> > -		if (*dev->dev_ops->link_update == NULL)
> > -			return -ENOTSUP;
> > -		(*dev->dev_ops->link_update)(dev, 0);
> > -	}
> > -
> >  	/* expose selection of PMD fast-path functions */
> >  	eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
> >
> > +	/* ensure state is set before marking device ready */
> > +	rte_smp_wmb();
> > +
> >  	rte_ethdev_trace_start(port_id);
> > +
> > +	/* Update current link state */
> > +	if (dev->data->dev_conf.intr_conf.lsc == 0)
> > +		(*dev->dev_ops->link_update)(dev, 0);
> > +
> >  	return 0;
> >  }
> >
> >
> > .
> >
  
Honnappa Nagarahalli Feb. 22, 2023, 10:48 p.m. UTC | #9
> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Sent: Wednesday, February 22, 2023 4:41 AM
> To: Fengchengwen <fengchengwen@huawei.com>; Stephen Hemminger
> <stephen@networkplumber.org>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com;
> thomas@monjalon.net; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org;
> s.v.naga.harish.k@intel.com; erik.g.carrillo@intel.com;
> abhinandan.gujjar@intel.com; stable@dpdk.org; nd <nd@arm.com>
> Subject: RE: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
> 
> 
> 
> > -----Original Message-----
> > From: fengchengwen <fengchengwen@huawei.com>
> > Sent: Wednesday, February 22, 2023 1:07 AM
> > To: Stephen Hemminger <stephen@networkplumber.org>; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>
> > Cc: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com;
> > thomas@monjalon.net; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org;
> > s.v.naga.harish.k@intel.com; erik.g.carrillo@intel.com;
> > abhinandan.gujjar@intel.com; stable@dpdk.org; nd <nd@arm.com>
> > Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops
> > setup
> >
> > On 2023/2/22 1:00, Stephen Hemminger wrote:
> > > On Tue, 21 Feb 2023 07:24:19 +0000
> > > Ruifeng Wang <Ruifeng.Wang@arm.com> wrote:
> > >
> > >>> -----Original Message-----
> > >>> From: fengchengwen <fengchengwen@huawei.com>
> > >>> Sent: Monday, February 20, 2023 2:58 PM
> > >>> To: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com;
> > >>> thomas@monjalon.net
> > >>> Cc: dev@dpdk.org; s.v.naga.harish.k@intel.com;
> > >>> erik.g.carrillo@intel.com; abhinandan.gujjar@intel.com;
> > >>> stable@dpdk.org; Ruifeng Wang <Ruifeng.Wang@arm.com>
> > >>> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path
> > >>> ops setup
> > >>>
> > >>> On 2023/2/20 14:08, Ashok Kaladi wrote:
> > >>>> If ethdev enqueue or dequeue function is called during
> > >>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting the
> > >>>> function pointers, but before setting the pointer to port data.
> > >>>> In this case the newly registered enqueue/dequeue function will
> > >>>> use dummy port data and end up in seg fault.
> > >>>>
> > >>>> This patch moves the updation of each data pointers before
> > >>>> updating corresponding function pointers.
> > >>>>
> > >>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
> > >>>> structure")
> > >>>> Cc: stable@dpdk.org
> > >
> > > Why is something calling enqueue/dequeue when device is not fully
> started.
> > > A correctly written application would not call rx/tx burst until
> > > after ethdev start had finished.
> >
> > Please refer the eb0d471a894 (ethdev: add proactive error handling
> > mode), when driver recover itself, the application may still invoke
> enqueue/dequeue API.
> 
> Right now DPDK ethdev layer *does not* provide synchronization
> mechanisms between data-path and control-path functions.
> That was a deliberate deisgn choice. If we want to change that rule, then I
> suppose we need a community consensus for it.
+1
Any such synchronization typically requires using load-acquire on data plane, which brings down the performance. But, init time synchronization would not affect the performance (stating the obvious).

> I think that if the driver wants to provide some sort of error recovery
> procedure, then it has to provide some synchronization mechanism inside it
> between data-path and control-path functions.
> Actually looking at eb0d471a894 (ethdev: add proactive error handling
> mode), and following patches I wonder how it creeped in?
> It seems we just introduced a loophole for race condition with this
> approach...
> It probably needs to be either deprecated or reworked.
> 
> >
> > >
> > > Would something like this work better?
> > >
> > > Note: there is another bug in current code. The check for link state
> > > interrupt and link_ops could return -ENOTSUP and leave device in
> indeterminate state.
> > > The check should be done before calling PMD.
> > >
> > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > > 0266cc82acb6..d6c163ed85e7 100644
> > > --- a/lib/ethdev/rte_ethdev.c
> > > +++ b/lib/ethdev/rte_ethdev.c
> > > @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
> > >  		return 0;
> > >  	}
> > >
> > > +	if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> > > +	    dev->dev_ops->link_update == NULL) {
> > > +		RTE_ETHDEV_LOG(INFO,
> > > +			       "Device with port_id=%"PRIu16" link update not
> supported\n",
> > > +			       port_id);
> > > +			return -ENOTSUP;
> > > +	}
> > > +
> > >  	ret = rte_eth_dev_info_get(port_id, &dev_info);
> > >  	if (ret != 0)
> > >  		return ret;
> > > @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
> > >  		eth_dev_mac_restore(dev, &dev_info);
> > >
> > >  	diag = (*dev->dev_ops->dev_start)(dev);
> > > -	if (diag == 0)
> > > -		dev->data->dev_started = 1;
> > > -	else
> > > +	if (diag != 0)
> > >  		return eth_err(port_id, diag);
> > >
> > >  	ret = eth_dev_config_restore(dev, &dev_info, port_id); @@ -1611,16
> > > +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
> > >  		return ret;
> > >  	}
> > >
> > > -	if (dev->data->dev_conf.intr_conf.lsc == 0) {
> > > -		if (*dev->dev_ops->link_update == NULL)
> > > -			return -ENOTSUP;
> > > -		(*dev->dev_ops->link_update)(dev, 0);
> > > -	}
> > > -
> > >  	/* expose selection of PMD fast-path functions */
> > >  	eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
> > >
> > > +	/* ensure state is set before marking device ready */
> > > +	rte_smp_wmb();
> > > +
> > >  	rte_ethdev_trace_start(port_id);
> > > +
> > > +	/* Update current link state */
> > > +	if (dev->data->dev_conf.intr_conf.lsc == 0)
> > > +		(*dev->dev_ops->link_update)(dev, 0);
> > > +
> > >  	return 0;
> > >  }
> > >
> > >
> > > .
> > >
  
Stephen Hemminger Feb. 23, 2023, 1:15 a.m. UTC | #10
On Wed, 22 Feb 2023 22:48:25 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> > -----Original Message-----
> > From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > Sent: Wednesday, February 22, 2023 4:41 AM
> > To: Fengchengwen <fengchengwen@huawei.com>; Stephen Hemminger
> > <stephen@networkplumber.org>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> > Cc: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com;
> > thomas@monjalon.net; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org;
> > s.v.naga.harish.k@intel.com; erik.g.carrillo@intel.com;
> > abhinandan.gujjar@intel.com; stable@dpdk.org; nd <nd@arm.com>
> > Subject: RE: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
> > 
> > 
> >   
> > > -----Original Message-----
> > > From: fengchengwen <fengchengwen@huawei.com>
> > > Sent: Wednesday, February 22, 2023 1:07 AM
> > > To: Stephen Hemminger <stephen@networkplumber.org>; Ruifeng Wang
> > > <Ruifeng.Wang@arm.com>
> > > Cc: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com;
> > > thomas@monjalon.net; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org;
> > > s.v.naga.harish.k@intel.com; erik.g.carrillo@intel.com;
> > > abhinandan.gujjar@intel.com; stable@dpdk.org; nd <nd@arm.com>
> > > Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops
> > > setup
> > >
> > > On 2023/2/22 1:00, Stephen Hemminger wrote:  
> > > > On Tue, 21 Feb 2023 07:24:19 +0000
> > > > Ruifeng Wang <Ruifeng.Wang@arm.com> wrote:
> > > >  
> > > >>> -----Original Message-----
> > > >>> From: fengchengwen <fengchengwen@huawei.com>
> > > >>> Sent: Monday, February 20, 2023 2:58 PM
> > > >>> To: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com;
> > > >>> thomas@monjalon.net
> > > >>> Cc: dev@dpdk.org; s.v.naga.harish.k@intel.com;
> > > >>> erik.g.carrillo@intel.com; abhinandan.gujjar@intel.com;
> > > >>> stable@dpdk.org; Ruifeng Wang <Ruifeng.Wang@arm.com>
> > > >>> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path
> > > >>> ops setup
> > > >>>
> > > >>> On 2023/2/20 14:08, Ashok Kaladi wrote:  
> > > >>>> If ethdev enqueue or dequeue function is called during
> > > >>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting the
> > > >>>> function pointers, but before setting the pointer to port data.
> > > >>>> In this case the newly registered enqueue/dequeue function will
> > > >>>> use dummy port data and end up in seg fault.
> > > >>>>
> > > >>>> This patch moves the updation of each data pointers before
> > > >>>> updating corresponding function pointers.
> > > >>>>
> > > >>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
> > > >>>> structure")
> > > >>>> Cc: stable@dpdk.org  
> > > >
> > > > Why is something calling enqueue/dequeue when device is not fully  
> > started.  
> > > > A correctly written application would not call rx/tx burst until
> > > > after ethdev start had finished.  
> > >
> > > Please refer the eb0d471a894 (ethdev: add proactive error handling
> > > mode), when driver recover itself, the application may still invoke  
> > enqueue/dequeue API.
> > 
> > Right now DPDK ethdev layer *does not* provide synchronization
> > mechanisms between data-path and control-path functions.
> > That was a deliberate deisgn choice. If we want to change that rule, then I
> > suppose we need a community consensus for it.  
> +1
> Any such synchronization typically requires using load-acquire on data plane, which brings down the performance. But, init time synchronization would not affect the performance (stating the obvious).
> 
> > I think that if the driver wants to provide some sort of error recovery
> > procedure, then it has to provide some synchronization mechanism inside it
> > between data-path and control-path functions.
> > Actually looking at eb0d471a894 (ethdev: add proactive error handling
> > mode), and following patches I wonder how it creeped in?
> > It seems we just introduced a loophole for race condition with this
> > approach...
> > It probably needs to be either deprecated or reworked.
> >   
> > >  
> > > >
> > > > Would something like this work better?
> > > >
> > > > Note: there is another bug in current code. The check for link state
> > > > interrupt and link_ops could return -ENOTSUP and leave device in  
> > indeterminate state.  
> > > > The check should be done before calling PMD.
> > > >
> > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > > > 0266cc82acb6..d6c163ed85e7 100644
> > > > --- a/lib/ethdev/rte_ethdev.c
> > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
> > > >  		return 0;
> > > >  	}
> > > >
> > > > +	if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> > > > +	    dev->dev_ops->link_update == NULL) {
> > > > +		RTE_ETHDEV_LOG(INFO,
> > > > +			       "Device with port_id=%"PRIu16" link update not  
> > supported\n",  
> > > > +			       port_id);
> > > > +			return -ENOTSUP;
> > > > +	}
> > > > +
> > > >  	ret = rte_eth_dev_info_get(port_id, &dev_info);
> > > >  	if (ret != 0)
> > > >  		return ret;
> > > > @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
> > > >  		eth_dev_mac_restore(dev, &dev_info);
> > > >
> > > >  	diag = (*dev->dev_ops->dev_start)(dev);
> > > > -	if (diag == 0)
> > > > -		dev->data->dev_started = 1;
> > > > -	else
> > > > +	if (diag != 0)
> > > >  		return eth_err(port_id, diag);
> > > >
> > > >  	ret = eth_dev_config_restore(dev, &dev_info, port_id); @@ -1611,16
> > > > +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
> > > >  		return ret;
> > > >  	}
> > > >
> > > > -	if (dev->data->dev_conf.intr_conf.lsc == 0) {
> > > > -		if (*dev->dev_ops->link_update == NULL)
> > > > -			return -ENOTSUP;
> > > > -		(*dev->dev_ops->link_update)(dev, 0);
> > > > -	}
> > > > -
> > > >  	/* expose selection of PMD fast-path functions */
> > > >  	eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
> > > >
> > > > +	/* ensure state is set before marking device ready */
> > > > +	rte_smp_wmb();
> > > > +
> > > >  	rte_ethdev_trace_start(port_id);
> > > > +
> > > > +	/* Update current link state */
> > > > +	if (dev->data->dev_conf.intr_conf.lsc == 0)
> > > > +		(*dev->dev_ops->link_update)(dev, 0);
> > > > +
> > > >  	return 0;
> > > >  }
> > > >
> > > >
> > > > .
> > > >  
> 

What about making started a real flag (with weak atomic's) and then
any dataplane threads should wait for started flag before going into
main loop.

It would not solve the error recovery case where the device decides
to take itself offline. But that design is racy to start with and
needs to be redesigned.
  
Honnappa Nagarahalli Feb. 23, 2023, 4:40 a.m. UTC | #11
<snip>

> > >>>
> > >>> On 2023/2/20 14:08, Ashok Kaladi wrote:
> > >>>> If ethdev enqueue or dequeue function is called during
> > >>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting the
> > >>>> function pointers, but before setting the pointer to port data.
> > >>>> In this case the newly registered enqueue/dequeue function will
> > >>>> use dummy port data and end up in seg fault.
> > >>>>
> > >>>> This patch moves the updation of each data pointers before
> > >>>> updating corresponding function pointers.
> > >>>>
> > >>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
> > >>>> structure")
> > >>>> Cc: stable@dpdk.org
> > >
> > > Why is something calling enqueue/dequeue when device is not fully
> started.
> > > A correctly written application would not call rx/tx burst until
> > > after ethdev start had finished.
> >
> > Please refer the eb0d471a894 (ethdev: add proactive error handling
> > mode), when driver recover itself, the application may still invoke
> enqueue/dequeue API.
> 
> Right now DPDK ethdev layer *does not* provide synchronization
> mechanisms between data-path and control-path functions.
> That was a deliberate deisgn choice. If we want to change that rule, then I
> suppose we need a community consensus for it.
> I think that if the driver wants to provide some sort of error recovery
> procedure, then it has to provide some synchronization mechanism inside it
> between data-path and control-path functions.
> Actually looking at eb0d471a894 (ethdev: add proactive error handling
> mode), and following patches I wonder how it creeped in?
> It seems we just introduced a loophole for race condition with this
> approach...
> It probably needs to be either deprecated or reworked.
Looking at the commit, it does not say anything about the data plane functions which probably means, the error recovery is happening within the data plane thread. What happens to other data plane threads that are polling the same port on which the error recovery is happening?

Also, the commit log says that while the error recovery is under progress, the application should not call any control plane APIs. Does that mean, the application has to check for error condition every time it calls a control plane API?

The commit message also says that "PMD makes sure the control path operations failed with retcode -EBUSY". It does not say how it does this. But, any communication from the PMD thread to control plane thread may introduce race conditions if not done correctly.

> 
> >
> > >
> > > Would something like this work better?
> > >
> > > Note: there is another bug in current code. The check for link state
> > > interrupt and link_ops could return -ENOTSUP and leave device in
> indeterminate state.
> > > The check should be done before calling PMD.
> > >
> > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > > 0266cc82acb6..d6c163ed85e7 100644
> > > --- a/lib/ethdev/rte_ethdev.c
> > > +++ b/lib/ethdev/rte_ethdev.c
> > > @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
> > >  		return 0;
> > >  	}
> > >
> > > +	if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> > > +	    dev->dev_ops->link_update == NULL) {
> > > +		RTE_ETHDEV_LOG(INFO,
> > > +			       "Device with port_id=%"PRIu16" link update not
> supported\n",
> > > +			       port_id);
> > > +			return -ENOTSUP;
> > > +	}
> > > +
> > >  	ret = rte_eth_dev_info_get(port_id, &dev_info);
> > >  	if (ret != 0)
> > >  		return ret;
> > > @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
> > >  		eth_dev_mac_restore(dev, &dev_info);
> > >
> > >  	diag = (*dev->dev_ops->dev_start)(dev);
> > > -	if (diag == 0)
> > > -		dev->data->dev_started = 1;
> > > -	else
> > > +	if (diag != 0)
> > >  		return eth_err(port_id, diag);
> > >
> > >  	ret = eth_dev_config_restore(dev, &dev_info, port_id); @@ -1611,16
> > > +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
> > >  		return ret;
> > >  	}
> > >
> > > -	if (dev->data->dev_conf.intr_conf.lsc == 0) {
> > > -		if (*dev->dev_ops->link_update == NULL)
> > > -			return -ENOTSUP;
> > > -		(*dev->dev_ops->link_update)(dev, 0);
> > > -	}
> > > -
> > >  	/* expose selection of PMD fast-path functions */
> > >  	eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
> > >
> > > +	/* ensure state is set before marking device ready */
> > > +	rte_smp_wmb();
> > > +
> > >  	rte_ethdev_trace_start(port_id);
> > > +
> > > +	/* Update current link state */
> > > +	if (dev->data->dev_conf.intr_conf.lsc == 0)
> > > +		(*dev->dev_ops->link_update)(dev, 0);
> > > +
> > >  	return 0;
> > >  }
> > >
> > >
> > > .
> > >
  
Honnappa Nagarahalli Feb. 23, 2023, 4:47 a.m. UTC | #12
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Wednesday, February 22, 2023 7:15 PM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Cc: Konstantin Ananyev <konstantin.ananyev@huawei.com>; Fengchengwen
> <fengchengwen@huawei.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>;
> Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com;
> thomas@monjalon.net; dev@dpdk.org; s.v.naga.harish.k@intel.com;
> erik.g.carrillo@intel.com; abhinandan.gujjar@intel.com; stable@dpdk.org; nd
> <nd@arm.com>
> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
> 
> On Wed, 22 Feb 2023 22:48:25 +0000
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:
> 
> > > -----Original Message-----
> > > From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > > Sent: Wednesday, February 22, 2023 4:41 AM
> > > To: Fengchengwen <fengchengwen@huawei.com>; Stephen Hemminger
> > > <stephen@networkplumber.org>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> > > Cc: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com;
> > > thomas@monjalon.net; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org;
> > > s.v.naga.harish.k@intel.com; erik.g.carrillo@intel.com;
> > > abhinandan.gujjar@intel.com; stable@dpdk.org; nd <nd@arm.com>
> > > Subject: RE: [PATCH 2/2] ethdev: fix race condition in fast-path ops
> > > setup
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: fengchengwen <fengchengwen@huawei.com>
> > > > Sent: Wednesday, February 22, 2023 1:07 AM
> > > > To: Stephen Hemminger <stephen@networkplumber.org>; Ruifeng
> Wang
> > > > <Ruifeng.Wang@arm.com>
> > > > Cc: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com;
> > > > thomas@monjalon.net; Honnappa Nagarahalli
> > > > <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org;
> > > > s.v.naga.harish.k@intel.com; erik.g.carrillo@intel.com;
> > > > abhinandan.gujjar@intel.com; stable@dpdk.org; nd <nd@arm.com>
> > > > Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path
> > > > ops setup
> > > >
> > > > On 2023/2/22 1:00, Stephen Hemminger wrote:
> > > > > On Tue, 21 Feb 2023 07:24:19 +0000 Ruifeng Wang
> > > > > <Ruifeng.Wang@arm.com> wrote:
> > > > >
> > > > >>> -----Original Message-----
> > > > >>> From: fengchengwen <fengchengwen@huawei.com>
> > > > >>> Sent: Monday, February 20, 2023 2:58 PM
> > > > >>> To: Ashok Kaladi <ashok.k.kaladi@intel.com>;
> > > > >>> jerinj@marvell.com; thomas@monjalon.net
> > > > >>> Cc: dev@dpdk.org; s.v.naga.harish.k@intel.com;
> > > > >>> erik.g.carrillo@intel.com; abhinandan.gujjar@intel.com;
> > > > >>> stable@dpdk.org; Ruifeng Wang <Ruifeng.Wang@arm.com>
> > > > >>> Subject: Re: [PATCH 2/2] ethdev: fix race condition in
> > > > >>> fast-path ops setup
> > > > >>>
> > > > >>> On 2023/2/20 14:08, Ashok Kaladi wrote:
> > > > >>>> If ethdev enqueue or dequeue function is called during
> > > > >>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting
> > > > >>>> the function pointers, but before setting the pointer to port data.
> > > > >>>> In this case the newly registered enqueue/dequeue function
> > > > >>>> will use dummy port data and end up in seg fault.
> > > > >>>>
> > > > >>>> This patch moves the updation of each data pointers before
> > > > >>>> updating corresponding function pointers.
> > > > >>>>
> > > > >>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into
> > > > >>>> separate
> > > > >>>> structure")
> > > > >>>> Cc: stable@dpdk.org
> > > > >
> > > > > Why is something calling enqueue/dequeue when device is not
> > > > > fully
> > > started.
> > > > > A correctly written application would not call rx/tx burst until
> > > > > after ethdev start had finished.
> > > >
> > > > Please refer the eb0d471a894 (ethdev: add proactive error handling
> > > > mode), when driver recover itself, the application may still
> > > > invoke
> > > enqueue/dequeue API.
> > >
> > > Right now DPDK ethdev layer *does not* provide synchronization
> > > mechanisms between data-path and control-path functions.
> > > That was a deliberate deisgn choice. If we want to change that rule,
> > > then I suppose we need a community consensus for it.
> > +1
> > Any such synchronization typically requires using load-acquire on data
> plane, which brings down the performance. But, init time synchronization
> would not affect the performance (stating the obvious).
> >
> > > I think that if the driver wants to provide some sort of error
> > > recovery procedure, then it has to provide some synchronization
> > > mechanism inside it between data-path and control-path functions.
> > > Actually looking at eb0d471a894 (ethdev: add proactive error
> > > handling mode), and following patches I wonder how it creeped in?
> > > It seems we just introduced a loophole for race condition with this
> > > approach...
> > > It probably needs to be either deprecated or reworked.
> > >
> > > >
> > > > >
> > > > > Would something like this work better?
> > > > >
> > > > > Note: there is another bug in current code. The check for link
> > > > > state interrupt and link_ops could return -ENOTSUP and leave
> > > > > device in
> > > indeterminate state.
> > > > > The check should be done before calling PMD.
> > > > >
> > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > index
> > > > > 0266cc82acb6..d6c163ed85e7 100644
> > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
> > > > >  		return 0;
> > > > >  	}
> > > > >
> > > > > +	if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> > > > > +	    dev->dev_ops->link_update == NULL) {
> > > > > +		RTE_ETHDEV_LOG(INFO,
> > > > > +			       "Device with port_id=%"PRIu16" link
> update not
> > > supported\n",
> > > > > +			       port_id);
> > > > > +			return -ENOTSUP;
> > > > > +	}
> > > > > +
> > > > >  	ret = rte_eth_dev_info_get(port_id, &dev_info);
> > > > >  	if (ret != 0)
> > > > >  		return ret;
> > > > > @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
> > > > >  		eth_dev_mac_restore(dev, &dev_info);
> > > > >
> > > > >  	diag = (*dev->dev_ops->dev_start)(dev);
> > > > > -	if (diag == 0)
> > > > > -		dev->data->dev_started = 1;
> > > > > -	else
> > > > > +	if (diag != 0)
> > > > >  		return eth_err(port_id, diag);
> > > > >
> > > > >  	ret = eth_dev_config_restore(dev, &dev_info, port_id); @@
> > > > > -1611,16
> > > > > +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
> > > > >  		return ret;
> > > > >  	}
> > > > >
> > > > > -	if (dev->data->dev_conf.intr_conf.lsc == 0) {
> > > > > -		if (*dev->dev_ops->link_update == NULL)
> > > > > -			return -ENOTSUP;
> > > > > -		(*dev->dev_ops->link_update)(dev, 0);
> > > > > -	}
> > > > > -
> > > > >  	/* expose selection of PMD fast-path functions */
> > > > >  	eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
> > > > >
> > > > > +	/* ensure state is set before marking device ready */
> > > > > +	rte_smp_wmb();
> > > > > +
> > > > >  	rte_ethdev_trace_start(port_id);
> > > > > +
> > > > > +	/* Update current link state */
> > > > > +	if (dev->data->dev_conf.intr_conf.lsc == 0)
> > > > > +		(*dev->dev_ops->link_update)(dev, 0);
> > > > > +
> > > > >  	return 0;
> > > > >  }
> > > > >
> > > > >
> > > > > .
> > > > >
> >
> 
> What about making started a real flag (with weak atomic's) and then any
> dataplane threads should wait for started flag before going into main loop.
This does not solve the later loads getting hoisted before reading the 'flag'.

> 
> It would not solve the error recovery case where the device decides to take
> itself offline. But that design is racy to start with and needs to be redesigned.
  
Chengwen Feng Feb. 23, 2023, 8:23 a.m. UTC | #13
On 2023/2/23 12:40, Honnappa Nagarahalli wrote:
> <snip>
> 
>>>>>>
>>>>>> On 2023/2/20 14:08, Ashok Kaladi wrote:
>>>>>>> If ethdev enqueue or dequeue function is called during
>>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting the
>>>>>>> function pointers, but before setting the pointer to port data.
>>>>>>> In this case the newly registered enqueue/dequeue function will
>>>>>>> use dummy port data and end up in seg fault.
>>>>>>>
>>>>>>> This patch moves the updation of each data pointers before
>>>>>>> updating corresponding function pointers.
>>>>>>>
>>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
>>>>>>> structure")
>>>>>>> Cc: stable@dpdk.org
>>>>
>>>> Why is something calling enqueue/dequeue when device is not fully
>> started.
>>>> A correctly written application would not call rx/tx burst until
>>>> after ethdev start had finished.
>>>
>>> Please refer the eb0d471a894 (ethdev: add proactive error handling
>>> mode), when driver recover itself, the application may still invoke
>> enqueue/dequeue API.
>>
>> Right now DPDK ethdev layer *does not* provide synchronization
>> mechanisms between data-path and control-path functions.
>> That was a deliberate deisgn choice. If we want to change that rule, then I
>> suppose we need a community consensus for it.
>> I think that if the driver wants to provide some sort of error recovery
>> procedure, then it has to provide some synchronization mechanism inside it
>> between data-path and control-path functions.
>> Actually looking at eb0d471a894 (ethdev: add proactive error handling
>> mode), and following patches I wonder how it creeped in?
>> It seems we just introduced a loophole for race condition with this
>> approach...

Could you try to describe the specific scenario of loophole ?

>> It probably needs to be either deprecated or reworked.
> Looking at the commit, it does not say anything about the data plane functions which probably means, the error recovery is happening within the data plane thread. What happens to other data plane threads that are polling the same port on which the error recovery is happening?

The commit log says: "the PMD sets the data path pointers to dummy functions".

So the data plane threads will receive non-packet and send zero with port which in error recovery.

> 
> Also, the commit log says that while the error recovery is under progress, the application should not call any control plane APIs. Does that mean, the application has to check for error condition every time it calls a control plane API?

If application has not register event (RTE_ETH_EVENT_ERR_RECOVERING) callback, it could calls control plane API, but it will return failed.
If application has register above callback, it can wait for recovery result, or direct call without wait but this will return failed.

> 
> The commit message also says that "PMD makes sure the control path operations failed with retcode -EBUSY". It does not say how it does this. But, any communication from the PMD thread to control plane thread may introduce race conditions if not done correctly.

First there are no PMD thread, do you mean eal-intr-thread ?

As for this question, you can see PMDs which already implement it, they both provides mutual exclusion protection.

> 
>>
>>>
>>>>
>>>> Would something like this work better?
>>>>
>>>> Note: there is another bug in current code. The check for link state
>>>> interrupt and link_ops could return -ENOTSUP and leave device in
>> indeterminate state.
>>>> The check should be done before calling PMD.
>>>>
>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
>>>> 0266cc82acb6..d6c163ed85e7 100644
>>>> --- a/lib/ethdev/rte_ethdev.c
>>>> +++ b/lib/ethdev/rte_ethdev.c
>>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
>>>>  		return 0;
>>>>  	}
>>>>
>>>> +	if (dev->data->dev_conf.intr_conf.lsc == 0 &&
>>>> +	    dev->dev_ops->link_update == NULL) {
>>>> +		RTE_ETHDEV_LOG(INFO,
>>>> +			       "Device with port_id=%"PRIu16" link update not
>> supported\n",
>>>> +			       port_id);
>>>> +			return -ENOTSUP;
>>>> +	}
>>>> +
>>>>  	ret = rte_eth_dev_info_get(port_id, &dev_info);
>>>>  	if (ret != 0)
>>>>  		return ret;
>>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
>>>>  		eth_dev_mac_restore(dev, &dev_info);
>>>>
>>>>  	diag = (*dev->dev_ops->dev_start)(dev);
>>>> -	if (diag == 0)
>>>> -		dev->data->dev_started = 1;
>>>> -	else
>>>> +	if (diag != 0)
>>>>  		return eth_err(port_id, diag);
>>>>
>>>>  	ret = eth_dev_config_restore(dev, &dev_info, port_id); @@ -1611,16
>>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
>>>>  		return ret;
>>>>  	}
>>>>
>>>> -	if (dev->data->dev_conf.intr_conf.lsc == 0) {
>>>> -		if (*dev->dev_ops->link_update == NULL)
>>>> -			return -ENOTSUP;
>>>> -		(*dev->dev_ops->link_update)(dev, 0);
>>>> -	}
>>>> -
>>>>  	/* expose selection of PMD fast-path functions */
>>>>  	eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
>>>>
>>>> +	/* ensure state is set before marking device ready */
>>>> +	rte_smp_wmb();
>>>> +
>>>>  	rte_ethdev_trace_start(port_id);
>>>> +
>>>> +	/* Update current link state */
>>>> +	if (dev->data->dev_conf.intr_conf.lsc == 0)
>>>> +		(*dev->dev_ops->link_update)(dev, 0);
>>>> +
>>>>  	return 0;
>>>>  }
>>>>
>>>>
>>>> .
>>>>
>
  
Konstantin Ananyev Feb. 23, 2023, 1:31 p.m. UTC | #14
> >>>>>>> If ethdev enqueue or dequeue function is called during
> >>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting the
> >>>>>>> function pointers, but before setting the pointer to port data.
> >>>>>>> In this case the newly registered enqueue/dequeue function will
> >>>>>>> use dummy port data and end up in seg fault.
> >>>>>>>
> >>>>>>> This patch moves the updation of each data pointers before
> >>>>>>> updating corresponding function pointers.
> >>>>>>>
> >>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
> >>>>>>> structure")
> >>>>>>> Cc: stable@dpdk.org
> >>>>
> >>>> Why is something calling enqueue/dequeue when device is not fully
> >> started.
> >>>> A correctly written application would not call rx/tx burst until
> >>>> after ethdev start had finished.
> >>>
> >>> Please refer the eb0d471a894 (ethdev: add proactive error handling
> >>> mode), when driver recover itself, the application may still invoke
> >> enqueue/dequeue API.
> >>
> >> Right now DPDK ethdev layer *does not* provide synchronization
> >> mechanisms between data-path and control-path functions.
> >> That was a deliberate deisgn choice. If we want to change that rule, then I
> >> suppose we need a community consensus for it.
> >> I think that if the driver wants to provide some sort of error recovery
> >> procedure, then it has to provide some synchronization mechanism inside it
> >> between data-path and control-path functions.
> >> Actually looking at eb0d471a894 (ethdev: add proactive error handling
> >> mode), and following patches I wonder how it creeped in?
> >> It seems we just introduced a loophole for race condition with this
> >> approach...
> 
> Could you try to describe the specific scenario of loophole ?

Ok, as I understand the existing mechanism: 

When PMD wants to start a recovery it has to:
 - invoke  rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
   That supposed to call user provided callback. After callback is finished PMD assumes
   that user is aware that recovery is about to start and should make some precautions.
- when recovery is finished it invokes another callback: 
  RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either can continue to
  use port or have to treat is as faulty.

The idea is ok in principle, but there is a problem.

lib/ethdev/rte_ethdev.h:
 
         /** Port recovering from a hardware or firmware error.
         * If PMD supports proactive error recovery,
         * it should trigger this event to notify application
         * that it detected an error and the recovery is being started.

<<< !!!!!
         * Upon receiving the event, the application should not invoke any control path API
         * (such as rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
         * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
         * The PMD will set the data path pointers to dummy functions,
         * and re-set the data path pointers to non-dummy functions
         * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
<<< !!!!!

That part is just wrong I believe.
It should be:
Upon receiving the event, the application should not invoke any *both control and data-path* API
until receiving  RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event. 
Resetting data path pointers to dummy functions by PMD *before* invoking
rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING); 
introduces a race-condition with data-path threads, as such thread could already be inside RX/TX function
or can already read RX/TX function/data pointers and be about to use them.
And right now rte_ethdev layer doesn't provide any mechanism to check it or wait when they'll finish, etc.

So, probably the simplest way to fix it with existing DPDK design:
- user level callback  RTE_ETH_EVENT_ERR_RECOVERING should return only after it ensures that *all*
  application threads (and processes) stopped using either control or data-path functions for that port
  (yes it means that application that wants to use this feature has to provide its own synchronization mechanism
  around data-path functions (RX/TX) that it is going to use). 
- after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones.

And message to all PMD developers:
*please stop updating rte_eth_fp_ops[] on your own*.
That's a bad practice and it is not supposed to do things that way.
There is a special API provided for these purposes:
eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it.

BTW,  I don't see any implementation for RTE_ETH_EVENT_ERR_RECOVERING within
either testpmd or any other example apps. 
Am I missing something?
If not, then probably it could be a good starting point - let's incorporate it inside testpmd 
(new forwarding engine probably) so everyone can test/try it.

         * It means that the application cannot send or receive any packets
         * during this period.
         * @note Before the PMD reports the recovery result,
         * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING event again,
         * because a larger error may occur during the recovery.
         */
        RTE_ETH_EVENT_ERR_RECOVERING, 

> >> It probably needs to be either deprecated or reworked.
> > Looking at the commit, it does not say anything about the data plane functions which probably means, the error recovery is
> happening within the data plane thread. What happens to other data plane threads that are polling the same port on which the error
> recovery is happening?
> 
> The commit log says: "the PMD sets the data path pointers to dummy functions".
> 
> So the data plane threads will receive non-packet and send zero with port which in error recovery.
> 
> >
> > Also, the commit log says that while the error recovery is under progress, the application should not call any control plane APIs. Does
> that mean, the application has to check for error condition every time it calls a control plane API?
> 
> If application has not register event (RTE_ETH_EVENT_ERR_RECOVERING) callback, it could calls control plane API, but it will return
> failed.
> If application has register above callback, it can wait for recovery result, or direct call without wait but this will return failed.
> 
> >
> > The commit message also says that "PMD makes sure the control path operations failed with retcode -EBUSY". It does not say how it
> does this. But, any communication from the PMD thread to control plane thread may introduce race conditions if not done correctly.
> 
> First there are no PMD thread, do you mean eal-intr-thread ?
> 
> As for this question, you can see PMDs which already implement it, they both provides mutual exclusion protection.
> 
> >
> >>
> >>>
> >>>>
> >>>> Would something like this work better?
> >>>>
> >>>> Note: there is another bug in current code. The check for link state
> >>>> interrupt and link_ops could return -ENOTSUP and leave device in
> >> indeterminate state.
> >>>> The check should be done before calling PMD.
> >>>>
> >>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> >>>> 0266cc82acb6..d6c163ed85e7 100644
> >>>> --- a/lib/ethdev/rte_ethdev.c
> >>>> +++ b/lib/ethdev/rte_ethdev.c
> >>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
> >>>>  		return 0;
> >>>>  	}
> >>>>
> >>>> +	if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> >>>> +	    dev->dev_ops->link_update == NULL) {
> >>>> +		RTE_ETHDEV_LOG(INFO,
> >>>> +			       "Device with port_id=%"PRIu16" link update not
> >> supported\n",
> >>>> +			       port_id);
> >>>> +			return -ENOTSUP;
> >>>> +	}
> >>>> +
> >>>>  	ret = rte_eth_dev_info_get(port_id, &dev_info);
> >>>>  	if (ret != 0)
> >>>>  		return ret;
> >>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
> >>>>  		eth_dev_mac_restore(dev, &dev_info);
> >>>>
> >>>>  	diag = (*dev->dev_ops->dev_start)(dev);
> >>>> -	if (diag == 0)
> >>>> -		dev->data->dev_started = 1;
> >>>> -	else
> >>>> +	if (diag != 0)
> >>>>  		return eth_err(port_id, diag);
> >>>>
> >>>>  	ret = eth_dev_config_restore(dev, &dev_info, port_id); @@ -1611,16
> >>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
> >>>>  		return ret;
> >>>>  	}
> >>>>
> >>>> -	if (dev->data->dev_conf.intr_conf.lsc == 0) {
> >>>> -		if (*dev->dev_ops->link_update == NULL)
> >>>> -			return -ENOTSUP;
> >>>> -		(*dev->dev_ops->link_update)(dev, 0);
> >>>> -	}
> >>>> -
> >>>>  	/* expose selection of PMD fast-path functions */
> >>>>  	eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
> >>>>
> >>>> +	/* ensure state is set before marking device ready */
> >>>> +	rte_smp_wmb();
> >>>> +
> >>>>  	rte_ethdev_trace_start(port_id);
> >>>> +
> >>>> +	/* Update current link state */
> >>>> +	if (dev->data->dev_conf.intr_conf.lsc == 0)
> >>>> +		(*dev->dev_ops->link_update)(dev, 0);
> >>>> +
> >>>>  	return 0;
> >>>>  }
> >>>>
> >>>>
> >>>> .
> >>>>
> >
  
Chengwen Feng Feb. 25, 2023, 1:32 a.m. UTC | #15
On 2023/2/23 21:31, Konstantin Ananyev wrote:
> 
> 
>>>>>>>>> If ethdev enqueue or dequeue function is called during
>>>>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting the
>>>>>>>>> function pointers, but before setting the pointer to port data.
>>>>>>>>> In this case the newly registered enqueue/dequeue function will
>>>>>>>>> use dummy port data and end up in seg fault.
>>>>>>>>>
>>>>>>>>> This patch moves the updation of each data pointers before
>>>>>>>>> updating corresponding function pointers.
>>>>>>>>>
>>>>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
>>>>>>>>> structure")
>>>>>>>>> Cc: stable@dpdk.org
>>>>>>
>>>>>> Why is something calling enqueue/dequeue when device is not fully
>>>> started.
>>>>>> A correctly written application would not call rx/tx burst until
>>>>>> after ethdev start had finished.
>>>>>
>>>>> Please refer the eb0d471a894 (ethdev: add proactive error handling
>>>>> mode), when driver recover itself, the application may still invoke
>>>> enqueue/dequeue API.
>>>>
>>>> Right now DPDK ethdev layer *does not* provide synchronization
>>>> mechanisms between data-path and control-path functions.
>>>> That was a deliberate deisgn choice. If we want to change that rule, then I
>>>> suppose we need a community consensus for it.
>>>> I think that if the driver wants to provide some sort of error recovery
>>>> procedure, then it has to provide some synchronization mechanism inside it
>>>> between data-path and control-path functions.
>>>> Actually looking at eb0d471a894 (ethdev: add proactive error handling
>>>> mode), and following patches I wonder how it creeped in?
>>>> It seems we just introduced a loophole for race condition with this
>>>> approach...
>>
>> Could you try to describe the specific scenario of loophole ?
> 
> Ok, as I understand the existing mechanism: 
> 
> When PMD wants to start a recovery it has to:
>  - invoke  rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
>    That supposed to call user provided callback. After callback is finished PMD assumes
>    that user is aware that recovery is about to start and should make some precautions.
> - when recovery is finished it invokes another callback: 
>   RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either can continue to
>   use port or have to treat is as faulty.
> 
> The idea is ok in principle, but there is a problem.
> 
> lib/ethdev/rte_ethdev.h:
>  
>          /** Port recovering from a hardware or firmware error.
>          * If PMD supports proactive error recovery,
>          * it should trigger this event to notify application
>          * that it detected an error and the recovery is being started.
> 
> <<< !!!!!
>          * Upon receiving the event, the application should not invoke any control path API
>          * (such as rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
>          * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
>          * The PMD will set the data path pointers to dummy functions,
>          * and re-set the data path pointers to non-dummy functions
>          * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> <<< !!!!!
> 
> That part is just wrong I believe.
> It should be:
> Upon receiving the event, the application should not invoke any *both control and data-path* API
> until receiving  RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event. 
> Resetting data path pointers to dummy functions by PMD *before* invoking
> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING); 
> introduces a race-condition with data-path threads, as such thread could already be inside RX/TX function
> or can already read RX/TX function/data pointers and be about to use them.

Current practices: the PMDs already add some delay after set Rx/Tx callback to dummy, and plus the DPDK
worker thread is busypolling, the probability of occurence in reality is zero. But in theoretically exist
the above race-condition.

> And right now rte_ethdev layer doesn't provide any mechanism to check it or wait when they'll finish, etc.

Yes

> 
> So, probably the simplest way to fix it with existing DPDK design:
> - user level callback  RTE_ETH_EVENT_ERR_RECOVERING should return only after it ensures that *all*
>   application threads (and processes) stopped using either control or data-path functions for that port

Agree

>   (yes it means that application that wants to use this feature has to provide its own synchronization mechanism
>   around data-path functions (RX/TX) that it is going to use). 
> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones.
> 
> And message to all PMD developers:
> *please stop updating rte_eth_fp_ops[] on your own*.
> That's a bad practice and it is not supposed to do things that way.
> There is a special API provided for these purposes:
> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it.

This two function is in private.h, so it should be expose to public header file.

> 
> BTW,  I don't see any implementation for RTE_ETH_EVENT_ERR_RECOVERING within
> either testpmd or any other example apps. 
> Am I missing something?

Currently it just promote the event.

> If not, then probably it could be a good starting point - let's incorporate it inside testpmd 
> (new forwarding engine probably) so everyone can test/try it.
> 
>          * It means that the application cannot send or receive any packets
>          * during this period.
>          * @note Before the PMD reports the recovery result,
>          * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING event again,
>          * because a larger error may occur during the recovery.
>          */
>         RTE_ETH_EVENT_ERR_RECOVERING, 
> 
>>>> It probably needs to be either deprecated or reworked.
>>> Looking at the commit, it does not say anything about the data plane functions which probably means, the error recovery is
>> happening within the data plane thread. What happens to other data plane threads that are polling the same port on which the error
>> recovery is happening?
>>
>> The commit log says: "the PMD sets the data path pointers to dummy functions".
>>
>> So the data plane threads will receive non-packet and send zero with port which in error recovery.
>>
>>>
>>> Also, the commit log says that while the error recovery is under progress, the application should not call any control plane APIs. Does
>> that mean, the application has to check for error condition every time it calls a control plane API?
>>
>> If application has not register event (RTE_ETH_EVENT_ERR_RECOVERING) callback, it could calls control plane API, but it will return
>> failed.
>> If application has register above callback, it can wait for recovery result, or direct call without wait but this will return failed.
>>
>>>
>>> The commit message also says that "PMD makes sure the control path operations failed with retcode -EBUSY". It does not say how it
>> does this. But, any communication from the PMD thread to control plane thread may introduce race conditions if not done correctly.
>>
>> First there are no PMD thread, do you mean eal-intr-thread ?
>>
>> As for this question, you can see PMDs which already implement it, they both provides mutual exclusion protection.
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Would something like this work better?
>>>>>>
>>>>>> Note: there is another bug in current code. The check for link state
>>>>>> interrupt and link_ops could return -ENOTSUP and leave device in
>>>> indeterminate state.
>>>>>> The check should be done before calling PMD.
>>>>>>
>>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
>>>>>> 0266cc82acb6..d6c163ed85e7 100644
>>>>>> --- a/lib/ethdev/rte_ethdev.c
>>>>>> +++ b/lib/ethdev/rte_ethdev.c
>>>>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>  		return 0;
>>>>>>  	}
>>>>>>
>>>>>> +	if (dev->data->dev_conf.intr_conf.lsc == 0 &&
>>>>>> +	    dev->dev_ops->link_update == NULL) {
>>>>>> +		RTE_ETHDEV_LOG(INFO,
>>>>>> +			       "Device with port_id=%"PRIu16" link update not
>>>> supported\n",
>>>>>> +			       port_id);
>>>>>> +			return -ENOTSUP;
>>>>>> +	}
>>>>>> +
>>>>>>  	ret = rte_eth_dev_info_get(port_id, &dev_info);
>>>>>>  	if (ret != 0)
>>>>>>  		return ret;
>>>>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>  		eth_dev_mac_restore(dev, &dev_info);
>>>>>>
>>>>>>  	diag = (*dev->dev_ops->dev_start)(dev);
>>>>>> -	if (diag == 0)
>>>>>> -		dev->data->dev_started = 1;
>>>>>> -	else
>>>>>> +	if (diag != 0)
>>>>>>  		return eth_err(port_id, diag);
>>>>>>
>>>>>>  	ret = eth_dev_config_restore(dev, &dev_info, port_id); @@ -1611,16
>>>>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>  		return ret;
>>>>>>  	}
>>>>>>
>>>>>> -	if (dev->data->dev_conf.intr_conf.lsc == 0) {
>>>>>> -		if (*dev->dev_ops->link_update == NULL)
>>>>>> -			return -ENOTSUP;
>>>>>> -		(*dev->dev_ops->link_update)(dev, 0);
>>>>>> -	}
>>>>>> -
>>>>>>  	/* expose selection of PMD fast-path functions */
>>>>>>  	eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
>>>>>>
>>>>>> +	/* ensure state is set before marking device ready */
>>>>>> +	rte_smp_wmb();
>>>>>> +
>>>>>>  	rte_ethdev_trace_start(port_id);
>>>>>> +
>>>>>> +	/* Update current link state */
>>>>>> +	if (dev->data->dev_conf.intr_conf.lsc == 0)
>>>>>> +		(*dev->dev_ops->link_update)(dev, 0);
>>>>>> +
>>>>>>  	return 0;
>>>>>>  }
>>>>>>
>>>>>>
>>>>>> .
>>>>>>
>>>
  
Konstantin Ananyev Feb. 26, 2023, 5:22 p.m. UTC | #16
>>>>>>>>>> If ethdev enqueue or dequeue function is called during
>>>>>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting the
>>>>>>>>>> function pointers, but before setting the pointer to port data.
>>>>>>>>>> In this case the newly registered enqueue/dequeue function will
>>>>>>>>>> use dummy port data and end up in seg fault.
>>>>>>>>>>
>>>>>>>>>> This patch moves the updation of each data pointers before
>>>>>>>>>> updating corresponding function pointers.
>>>>>>>>>>
>>>>>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
>>>>>>>>>> structure")
>>>>>>>>>> Cc: stable@dpdk.org
>>>>>>>
>>>>>>> Why is something calling enqueue/dequeue when device is not fully
>>>>> started.
>>>>>>> A correctly written application would not call rx/tx burst until
>>>>>>> after ethdev start had finished.
>>>>>>
>>>>>> Please refer the eb0d471a894 (ethdev: add proactive error handling
>>>>>> mode), when driver recover itself, the application may still invoke
>>>>> enqueue/dequeue API.
>>>>>
>>>>> Right now DPDK ethdev layer *does not* provide synchronization
>>>>> mechanisms between data-path and control-path functions.
>>>>> That was a deliberate deisgn choice. If we want to change that rule, then I
>>>>> suppose we need a community consensus for it.
>>>>> I think that if the driver wants to provide some sort of error recovery
>>>>> procedure, then it has to provide some synchronization mechanism inside it
>>>>> between data-path and control-path functions.
>>>>> Actually looking at eb0d471a894 (ethdev: add proactive error handling
>>>>> mode), and following patches I wonder how it creeped in?
>>>>> It seems we just introduced a loophole for race condition with this
>>>>> approach...
>>>
>>> Could you try to describe the specific scenario of loophole ?
>>
>> Ok, as I understand the existing mechanism:
>>
>> When PMD wants to start a recovery it has to:
>>   - invoke  rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
>>     That supposed to call user provided callback. After callback is finished PMD assumes
>>     that user is aware that recovery is about to start and should make some precautions.
>> - when recovery is finished it invokes another callback:
>>    RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either can continue to
>>    use port or have to treat is as faulty.
>>
>> The idea is ok in principle, but there is a problem.
>>
>> lib/ethdev/rte_ethdev.h:
>>   
>>           /** Port recovering from a hardware or firmware error.
>>           * If PMD supports proactive error recovery,
>>           * it should trigger this event to notify application
>>           * that it detected an error and the recovery is being started.
>>
>> <<< !!!!!
>>           * Upon receiving the event, the application should not invoke any control path API
>>           * (such as rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
>>           * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
>>           * The PMD will set the data path pointers to dummy functions,
>>           * and re-set the data path pointers to non-dummy functions
>>           * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
>> <<< !!!!!
>>
>> That part is just wrong I believe.
>> It should be:
>> Upon receiving the event, the application should not invoke any *both control and data-path* API
>> until receiving  RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
>> Resetting data path pointers to dummy functions by PMD *before* invoking
>> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
>> introduces a race-condition with data-path threads, as such thread could already be inside RX/TX function
>> or can already read RX/TX function/data pointers and be about to use them.
> 
> Current practices: the PMDs already add some delay after set Rx/Tx callback to dummy, and plus the DPDK
> worker thread is busypolling, the probability of occurence in reality is zero. But in theoretically exist
> the above race-condition.


Adding delay might make a problem a bit less reproducible,
but it doesn't fix it.
The bug is still there.


> 
>> And right now rte_ethdev layer doesn't provide any mechanism to check it or wait when they'll finish, etc.
> 
> Yes
> 
>>
>> So, probably the simplest way to fix it with existing DPDK design:
>> - user level callback  RTE_ETH_EVENT_ERR_RECOVERING should return only after it ensures that *all*
>>    application threads (and processes) stopped using either control or data-path functions for that port
> 
> Agree
> 
>>    (yes it means that application that wants to use this feature has to provide its own synchronization mechanism
>>    around data-path functions (RX/TX) that it is going to use).
>> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones.
>>
>> And message to all PMD developers:
>> *please stop updating rte_eth_fp_ops[] on your own*.
>> That's a bad practice and it is not supposed to do things that way.
>> There is a special API provided for these purposes:
>> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it.
> 
> This two function is in private.h, so it should be expose to public header file.

You mean we need to move these functions declarations into ethdev_driver.h?
If so, then yes, I think we probably do.


>>
>> BTW,  I don't see any implementation for RTE_ETH_EVENT_ERR_RECOVERING within
>> either testpmd or any other example apps.
>> Am I missing something?
> 
> Currently it just promote the event.


Ok, can I suggest then to add a proper usage for into in testpmd?
It looks really strange that we add new feature into ethdev (and 2 PMDs),
but didn't provide any way for users to test it.

> 
>> If not, then probably it could be a good starting point - let's incorporate it inside testpmd
>> (new forwarding engine probably) so everyone can test/try it.
>>
>>           * It means that the application cannot send or receive any packets
>>           * during this period.
>>           * @note Before the PMD reports the recovery result,
>>           * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING event again,
>>           * because a larger error may occur during the recovery.
>>           */
>>          RTE_ETH_EVENT_ERR_RECOVERING,
>>
>>>>> It probably needs to be either deprecated or reworked.
>>>> Looking at the commit, it does not say anything about the data plane functions which probably means, the error recovery is
>>> happening within the data plane thread. What happens to other data plane threads that are polling the same port on which the error
>>> recovery is happening?
>>>
>>> The commit log says: "the PMD sets the data path pointers to dummy functions".
>>>
>>> So the data plane threads will receive non-packet and send zero with port which in error recovery.
>>>
>>>>
>>>> Also, the commit log says that while the error recovery is under progress, the application should not call any control plane APIs. Does
>>> that mean, the application has to check for error condition every time it calls a control plane API?
>>>
>>> If application has not register event (RTE_ETH_EVENT_ERR_RECOVERING) callback, it could calls control plane API, but it will return
>>> failed.
>>> If application has register above callback, it can wait for recovery result, or direct call without wait but this will return failed.
>>>
>>>>
>>>> The commit message also says that "PMD makes sure the control path operations failed with retcode -EBUSY". It does not say how it
>>> does this. But, any communication from the PMD thread to control plane thread may introduce race conditions if not done correctly.
>>>
>>> First there are no PMD thread, do you mean eal-intr-thread ?
>>>
>>> As for this question, you can see PMDs which already implement it, they both provides mutual exclusion protection.
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Would something like this work better?
>>>>>>>
>>>>>>> Note: there is another bug in current code. The check for link state
>>>>>>> interrupt and link_ops could return -ENOTSUP and leave device in
>>>>> indeterminate state.
>>>>>>> The check should be done before calling PMD.
>>>>>>>
>>>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
>>>>>>> 0266cc82acb6..d6c163ed85e7 100644
>>>>>>> --- a/lib/ethdev/rte_ethdev.c
>>>>>>> +++ b/lib/ethdev/rte_ethdev.c
>>>>>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>   		return 0;
>>>>>>>   	}
>>>>>>>
>>>>>>> +	if (dev->data->dev_conf.intr_conf.lsc == 0 &&
>>>>>>> +	    dev->dev_ops->link_update == NULL) {
>>>>>>> +		RTE_ETHDEV_LOG(INFO,
>>>>>>> +			       "Device with port_id=%"PRIu16" link update not
>>>>> supported\n",
>>>>>>> +			       port_id);
>>>>>>> +			return -ENOTSUP;
>>>>>>> +	}
>>>>>>> +
>>>>>>>   	ret = rte_eth_dev_info_get(port_id, &dev_info);
>>>>>>>   	if (ret != 0)
>>>>>>>   		return ret;
>>>>>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>   		eth_dev_mac_restore(dev, &dev_info);
>>>>>>>
>>>>>>>   	diag = (*dev->dev_ops->dev_start)(dev);
>>>>>>> -	if (diag == 0)
>>>>>>> -		dev->data->dev_started = 1;
>>>>>>> -	else
>>>>>>> +	if (diag != 0)
>>>>>>>   		return eth_err(port_id, diag);
>>>>>>>
>>>>>>>   	ret = eth_dev_config_restore(dev, &dev_info, port_id); @@ -1611,16
>>>>>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>   		return ret;
>>>>>>>   	}
>>>>>>>
>>>>>>> -	if (dev->data->dev_conf.intr_conf.lsc == 0) {
>>>>>>> -		if (*dev->dev_ops->link_update == NULL)
>>>>>>> -			return -ENOTSUP;
>>>>>>> -		(*dev->dev_ops->link_update)(dev, 0);
>>>>>>> -	}
>>>>>>> -
>>>>>>>   	/* expose selection of PMD fast-path functions */
>>>>>>>   	eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
>>>>>>>
>>>>>>> +	/* ensure state is set before marking device ready */
>>>>>>> +	rte_smp_wmb();
>>>>>>> +
>>>>>>>   	rte_ethdev_trace_start(port_id);
>>>>>>> +
>>>>>>> +	/* Update current link state */
>>>>>>> +	if (dev->data->dev_conf.intr_conf.lsc == 0)
>>>>>>> +		(*dev->dev_ops->link_update)(dev, 0);
>>>>>>> +
>>>>>>>   	return 0;
>>>>>>>   }
>>>>>>>
>>>>>>>
>>>>>>> .
>>>>>>>
>>>>
  
Chengwen Feng Feb. 27, 2023, 2:56 a.m. UTC | #17
On 2023/2/27 1:22, Konstantin Ananyev wrote:
> 
>>>>>>>>>>> If ethdev enqueue or dequeue function is called during
>>>>>>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting the
>>>>>>>>>>> function pointers, but before setting the pointer to port data.
>>>>>>>>>>> In this case the newly registered enqueue/dequeue function will
>>>>>>>>>>> use dummy port data and end up in seg fault.
>>>>>>>>>>>
>>>>>>>>>>> This patch moves the updation of each data pointers before
>>>>>>>>>>> updating corresponding function pointers.
>>>>>>>>>>>
>>>>>>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
>>>>>>>>>>> structure")
>>>>>>>>>>> Cc: stable@dpdk.org
>>>>>>>>
>>>>>>>> Why is something calling enqueue/dequeue when device is not fully
>>>>>> started.
>>>>>>>> A correctly written application would not call rx/tx burst until
>>>>>>>> after ethdev start had finished.
>>>>>>>
>>>>>>> Please refer the eb0d471a894 (ethdev: add proactive error handling
>>>>>>> mode), when driver recover itself, the application may still invoke
>>>>>> enqueue/dequeue API.
>>>>>>
>>>>>> Right now DPDK ethdev layer *does not* provide synchronization
>>>>>> mechanisms between data-path and control-path functions.
>>>>>> That was a deliberate deisgn choice. If we want to change that rule, then I
>>>>>> suppose we need a community consensus for it.
>>>>>> I think that if the driver wants to provide some sort of error recovery
>>>>>> procedure, then it has to provide some synchronization mechanism inside it
>>>>>> between data-path and control-path functions.
>>>>>> Actually looking at eb0d471a894 (ethdev: add proactive error handling
>>>>>> mode), and following patches I wonder how it creeped in?
>>>>>> It seems we just introduced a loophole for race condition with this
>>>>>> approach...
>>>>
>>>> Could you try to describe the specific scenario of loophole ?
>>>
>>> Ok, as I understand the existing mechanism:
>>>
>>> When PMD wants to start a recovery it has to:
>>>   - invoke  rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
>>>     That supposed to call user provided callback. After callback is finished PMD assumes
>>>     that user is aware that recovery is about to start and should make some precautions.
>>> - when recovery is finished it invokes another callback:
>>>    RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either can continue to
>>>    use port or have to treat is as faulty.
>>>
>>> The idea is ok in principle, but there is a problem.
>>>
>>> lib/ethdev/rte_ethdev.h:
>>>             /** Port recovering from a hardware or firmware error.
>>>           * If PMD supports proactive error recovery,
>>>           * it should trigger this event to notify application
>>>           * that it detected an error and the recovery is being started.
>>>
>>> <<< !!!!!
>>>           * Upon receiving the event, the application should not invoke any control path API
>>>           * (such as rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
>>>           * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
>>>           * The PMD will set the data path pointers to dummy functions,
>>>           * and re-set the data path pointers to non-dummy functions
>>>           * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
>>> <<< !!!!!
>>>
>>> That part is just wrong I believe.
>>> It should be:
>>> Upon receiving the event, the application should not invoke any *both control and data-path* API
>>> until receiving  RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
>>> Resetting data path pointers to dummy functions by PMD *before* invoking
>>> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
>>> introduces a race-condition with data-path threads, as such thread could already be inside RX/TX function
>>> or can already read RX/TX function/data pointers and be about to use them.
>>
>> Current practices: the PMDs already add some delay after set Rx/Tx callback to dummy, and plus the DPDK
>> worker thread is busypolling, the probability of occurence in reality is zero. But in theoretically exist
>> the above race-condition.
> 
> 
> Adding delay might make a problem a bit less reproducible,
> but it doesn't fix it.
> The bug is still there.
> 
> 
>>
>>> And right now rte_ethdev layer doesn't provide any mechanism to check it or wait when they'll finish, etc.
>>
>> Yes
>>
>>>
>>> So, probably the simplest way to fix it with existing DPDK design:
>>> - user level callback  RTE_ETH_EVENT_ERR_RECOVERING should return only after it ensures that *all*
>>>    application threads (and processes) stopped using either control or data-path functions for that port
>>
>> Agree
>>
>>>    (yes it means that application that wants to use this feature has to provide its own synchronization mechanism
>>>    around data-path functions (RX/TX) that it is going to use).
>>> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones.
>>>
>>> And message to all PMD developers:
>>> *please stop updating rte_eth_fp_ops[] on your own*.
>>> That's a bad practice and it is not supposed to do things that way.
>>> There is a special API provided for these purposes:
>>> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it.
>>
>> This two function is in private.h, so it should be expose to public header file.
> 
> You mean we need to move these functions declarations into ethdev_driver.h?
> If so, then yes, I think we probably do.
> 
> 
>>>
>>> BTW,  I don't see any implementation for RTE_ETH_EVENT_ERR_RECOVERING within
>>> either testpmd or any other example apps.
>>> Am I missing something?
>>
>> Currently it just promote the event.
> 
> 
> Ok, can I suggest then to add a proper usage for into in testpmd?

our team will do that, thanks.

> It looks really strange that we add new feature into ethdev (and 2 PMDs),
> but didn't provide any way for users to test it.
> 
>>
>>> If not, then probably it could be a good starting point - let's incorporate it inside testpmd
>>> (new forwarding engine probably) so everyone can test/try it.
>>>
>>>           * It means that the application cannot send or receive any packets
>>>           * during this period.
>>>           * @note Before the PMD reports the recovery result,
>>>           * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING event again,
>>>           * because a larger error may occur during the recovery.
>>>           */
>>>          RTE_ETH_EVENT_ERR_RECOVERING,
>>>
>>>>>> It probably needs to be either deprecated or reworked.
>>>>> Looking at the commit, it does not say anything about the data plane functions which probably means, the error recovery is
>>>> happening within the data plane thread. What happens to other data plane threads that are polling the same port on which the error
>>>> recovery is happening?
>>>>
>>>> The commit log says: "the PMD sets the data path pointers to dummy functions".
>>>>
>>>> So the data plane threads will receive non-packet and send zero with port which in error recovery.
>>>>
>>>>>
>>>>> Also, the commit log says that while the error recovery is under progress, the application should not call any control plane APIs. Does
>>>> that mean, the application has to check for error condition every time it calls a control plane API?
>>>>
>>>> If application has not register event (RTE_ETH_EVENT_ERR_RECOVERING) callback, it could calls control plane API, but it will return
>>>> failed.
>>>> If application has register above callback, it can wait for recovery result, or direct call without wait but this will return failed.
>>>>
>>>>>
>>>>> The commit message also says that "PMD makes sure the control path operations failed with retcode -EBUSY". It does not say how it
>>>> does this. But, any communication from the PMD thread to control plane thread may introduce race conditions if not done correctly.
>>>>
>>>> First there are no PMD thread, do you mean eal-intr-thread ?
>>>>
>>>> As for this question, you can see PMDs which already implement it, they both provides mutual exclusion protection.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Would something like this work better?
>>>>>>>>
>>>>>>>> Note: there is another bug in current code. The check for link state
>>>>>>>> interrupt and link_ops could return -ENOTSUP and leave device in
>>>>>> indeterminate state.
>>>>>>>> The check should be done before calling PMD.
>>>>>>>>
>>>>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
>>>>>>>> 0266cc82acb6..d6c163ed85e7 100644
>>>>>>>> --- a/lib/ethdev/rte_ethdev.c
>>>>>>>> +++ b/lib/ethdev/rte_ethdev.c
>>>>>>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>>           return 0;
>>>>>>>>       }
>>>>>>>>
>>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0 &&
>>>>>>>> +        dev->dev_ops->link_update == NULL) {
>>>>>>>> +        RTE_ETHDEV_LOG(INFO,
>>>>>>>> +                   "Device with port_id=%"PRIu16" link update not
>>>>>> supported\n",
>>>>>>>> +                   port_id);
>>>>>>>> +            return -ENOTSUP;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>>       ret = rte_eth_dev_info_get(port_id, &dev_info);
>>>>>>>>       if (ret != 0)
>>>>>>>>           return ret;
>>>>>>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>>           eth_dev_mac_restore(dev, &dev_info);
>>>>>>>>
>>>>>>>>       diag = (*dev->dev_ops->dev_start)(dev);
>>>>>>>> -    if (diag == 0)
>>>>>>>> -        dev->data->dev_started = 1;
>>>>>>>> -    else
>>>>>>>> +    if (diag != 0)
>>>>>>>>           return eth_err(port_id, diag);
>>>>>>>>
>>>>>>>>       ret = eth_dev_config_restore(dev, &dev_info, port_id); @@ -1611,16
>>>>>>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>>           return ret;
>>>>>>>>       }
>>>>>>>>
>>>>>>>> -    if (dev->data->dev_conf.intr_conf.lsc == 0) {
>>>>>>>> -        if (*dev->dev_ops->link_update == NULL)
>>>>>>>> -            return -ENOTSUP;
>>>>>>>> -        (*dev->dev_ops->link_update)(dev, 0);
>>>>>>>> -    }
>>>>>>>> -
>>>>>>>>       /* expose selection of PMD fast-path functions */
>>>>>>>>       eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
>>>>>>>>
>>>>>>>> +    /* ensure state is set before marking device ready */
>>>>>>>> +    rte_smp_wmb();
>>>>>>>> +
>>>>>>>>       rte_ethdev_trace_start(port_id);
>>>>>>>> +
>>>>>>>> +    /* Update current link state */
>>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0)
>>>>>>>> +        (*dev->dev_ops->link_update)(dev, 0);
>>>>>>>> +
>>>>>>>>       return 0;
>>>>>>>>   }
>>>>>>>>
>>>>>>>>
>>>>>>>> .
>>>>>>>>
>>>>>
> 
> .
  
Konstantin Ananyev Feb. 27, 2023, 7:08 p.m. UTC | #18
> >
> >
> >>>
> >>> BTW,  I don't see any implementation for RTE_ETH_EVENT_ERR_RECOVERING within
> >>> either testpmd or any other example apps.
> >>> Am I missing something?
> >>
> >> Currently it just promote the event.
> >
> >
> > Ok, can I suggest then to add a proper usage for into in testpmd?
> 
> our team will do that, thanks.

That's great, thanks for that.
  
Honnappa Nagarahalli Feb. 28, 2023, 11:57 p.m. UTC | #19
> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Sent: Thursday, February 23, 2023 7:31 AM
> To: Fengchengwen <fengchengwen@huawei.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Stephen Hemminger
> <stephen@networkplumber.org>; Ruifeng Wang <Ruifeng.Wang@arm.com>;
> Ajit Khaparde (ajit.khaparde@broadcom.com)
> <ajit.khaparde@broadcom.com>
> Cc: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com;
> thomas@monjalon.net; dev@dpdk.org; s.v.naga.harish.k@intel.com;
> erik.g.carrillo@intel.com; abhinandan.gujjar@intel.com; stable@dpdk.org; nd
> <nd@arm.com>
> Subject: RE: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
> 
> 
> 
> > >>>>>>> If ethdev enqueue or dequeue function is called during
> > >>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting
> > >>>>>>> the function pointers, but before setting the pointer to port data.
> > >>>>>>> In this case the newly registered enqueue/dequeue function
> > >>>>>>> will use dummy port data and end up in seg fault.
> > >>>>>>>
> > >>>>>>> This patch moves the updation of each data pointers before
> > >>>>>>> updating corresponding function pointers.
> > >>>>>>>
> > >>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
> > >>>>>>> structure")
> > >>>>>>> Cc: stable@dpdk.org
> > >>>>
> > >>>> Why is something calling enqueue/dequeue when device is not fully
> > >> started.
> > >>>> A correctly written application would not call rx/tx burst until
> > >>>> after ethdev start had finished.
> > >>>
> > >>> Please refer the eb0d471a894 (ethdev: add proactive error handling
> > >>> mode), when driver recover itself, the application may still
> > >>> invoke
> > >> enqueue/dequeue API.
> > >>
> > >> Right now DPDK ethdev layer *does not* provide synchronization
> > >> mechanisms between data-path and control-path functions.
> > >> That was a deliberate deisgn choice. If we want to change that
> > >> rule, then I suppose we need a community consensus for it.
> > >> I think that if the driver wants to provide some sort of error
> > >> recovery procedure, then it has to provide some synchronization
> > >> mechanism inside it between data-path and control-path functions.
> > >> Actually looking at eb0d471a894 (ethdev: add proactive error
> > >> handling mode), and following patches I wonder how it creeped in?
> > >> It seems we just introduced a loophole for race condition with this
> > >> approach...
> >
> > Could you try to describe the specific scenario of loophole ?
> 
> Ok, as I understand the existing mechanism:
> 
> When PMD wants to start a recovery it has to:
>  - invoke  rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
>    That supposed to call user provided callback. After callback is finished PMD
> assumes
>    that user is aware that recovery is about to start and should make some
> precautions.
> - when recovery is finished it invokes another callback:
>   RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either can
> continue to
>   use port or have to treat is as faulty.
> 
> The idea is ok in principle, but there is a problem.
> 
> lib/ethdev/rte_ethdev.h:
> 
>          /** Port recovering from a hardware or firmware error.
>          * If PMD supports proactive error recovery,
>          * it should trigger this event to notify application
>          * that it detected an error and the recovery is being started.
> 
> <<< !!!!!
>          * Upon receiving the event, the application should not invoke any control
> path API
>          * (such as rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
>          * RTE_ETH_EVENT_RECOVERY_SUCCESS or
> RTE_ETH_EVENT_RECOVERY_FAILED event.
>          * The PMD will set the data path pointers to dummy functions,
>          * and re-set the data path pointers to non-dummy functions
>          * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> <<< !!!!!
> 
> That part is just wrong I believe.
> It should be:
> Upon receiving the event, the application should not invoke any *both control
> and data-path* API until receiving  RTE_ETH_EVENT_RECOVERY_SUCCESS or
> RTE_ETH_EVENT_RECOVERY_FAILED event.
> Resetting data path pointers to dummy functions by PMD *before* invoking
> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
> introduces a race-condition with data-path threads, as such thread could
> already be inside RX/TX function or can already read RX/TX function/data
> pointers and be about to use them.
> And right now rte_ethdev layer doesn't provide any mechanism to check it or
> wait when they'll finish, etc.
> 
> So, probably the simplest way to fix it with existing DPDK design:
> - user level callback  RTE_ETH_EVENT_ERR_RECOVERING should return only
> after it ensures that *all*
>   application threads (and processes) stopped using either control or data-path
> functions for that port
>   (yes it means that application that wants to use this feature has to provide its
> own synchronization mechanism
>   around data-path functions (RX/TX) that it is going to use).
Does this mean the application does not call either control plane or data plane APIs after the callback returns?
If the application can do this in the call back function, can it do the same outside of the call back function?

Correct me if I am wrong, I believe the call back is called in the context of the EAL thread. There could be multiple threads using the same port. There is a possibility that all these threads might call the call back function. So, who owns the responsibility to ensure the call back function is executed only once? PMD or the call back function? 

> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones.
Why is this required if the control plane and data plane threads are not going to call any APIs? Resetting all these pointers is not at atomic operation, does it introduce any problems? For ex: if the application has to call function1 and function2 in sequence, what happens if function1 was not null but function2 became NULL by the time it is called?


How about a more simpler approach?
It should be possible to return an error code from the rte_eth_rx_burst API. The responsibility to stop calling any control plane and data plane APIs (this requires a simple synchronization mechanism. The cost of that should be less when there are no errors. I see applications like VPP already implement them) can be left to the application. The application can call the recovery API and release all the threads if the recovery was successful.

> 
> And message to all PMD developers:
> *please stop updating rte_eth_fp_ops[] on your own*.
> That's a bad practice and it is not supposed to do things that way.
> There is a special API provided for these purposes:
> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it.
> 
> BTW,  I don't see any implementation for RTE_ETH_EVENT_ERR_RECOVERING
> within either testpmd or any other example apps.
> Am I missing something?
> If not, then probably it could be a good starting point - let's incorporate it inside
> testpmd (new forwarding engine probably) so everyone can test/try it.
> 
>          * It means that the application cannot send or receive any packets
>          * during this period.
>          * @note Before the PMD reports the recovery result,
>          * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING event
> again,
>          * because a larger error may occur during the recovery.
>          */
>         RTE_ETH_EVENT_ERR_RECOVERING,
> 
> > >> It probably needs to be either deprecated or reworked.
> > > Looking at the commit, it does not say anything about the data plane
> > > functions which probably means, the error recovery is
> > happening within the data plane thread. What happens to other data
> > plane threads that are polling the same port on which the error recovery is
> happening?
> >
> > The commit log says: "the PMD sets the data path pointers to dummy
> functions".
> >
> > So the data plane threads will receive non-packet and send zero with port
> which in error recovery.
> >
> > >
> > > Also, the commit log says that while the error recovery is under
> > > progress, the application should not call any control plane APIs.
> > > Does
> > that mean, the application has to check for error condition every time it calls a
> control plane API?
> >
> > If application has not register event (RTE_ETH_EVENT_ERR_RECOVERING)
> > callback, it could calls control plane API, but it will return failed.
> > If application has register above callback, it can wait for recovery result, or
> direct call without wait but this will return failed.
> >
> > >
> > > The commit message also says that "PMD makes sure the control path
> > > operations failed with retcode -EBUSY". It does not say how it
> > does this. But, any communication from the PMD thread to control plane
> thread may introduce race conditions if not done correctly.
> >
> > First there are no PMD thread, do you mean eal-intr-thread ?
> >
> > As for this question, you can see PMDs which already implement it, they both
> provides mutual exclusion protection.
> >
> > >
> > >>
> > >>>
> > >>>>
> > >>>> Would something like this work better?
> > >>>>
> > >>>> Note: there is another bug in current code. The check for link
> > >>>> state interrupt and link_ops could return -ENOTSUP and leave
> > >>>> device in
> > >> indeterminate state.
> > >>>> The check should be done before calling PMD.
> > >>>>
> > >>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > >>>> index
> > >>>> 0266cc82acb6..d6c163ed85e7 100644
> > >>>> --- a/lib/ethdev/rte_ethdev.c
> > >>>> +++ b/lib/ethdev/rte_ethdev.c
> > >>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
> > >>>>  		return 0;
> > >>>>  	}
> > >>>>
> > >>>> +	if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> > >>>> +	    dev->dev_ops->link_update == NULL) {
> > >>>> +		RTE_ETHDEV_LOG(INFO,
> > >>>> +			       "Device with port_id=%"PRIu16" link
> update not
> > >> supported\n",
> > >>>> +			       port_id);
> > >>>> +			return -ENOTSUP;
> > >>>> +	}
> > >>>> +
> > >>>>  	ret = rte_eth_dev_info_get(port_id, &dev_info);
> > >>>>  	if (ret != 0)
> > >>>>  		return ret;
> > >>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
> > >>>>  		eth_dev_mac_restore(dev, &dev_info);
> > >>>>
> > >>>>  	diag = (*dev->dev_ops->dev_start)(dev);
> > >>>> -	if (diag == 0)
> > >>>> -		dev->data->dev_started = 1;
> > >>>> -	else
> > >>>> +	if (diag != 0)
> > >>>>  		return eth_err(port_id, diag);
> > >>>>
> > >>>>  	ret = eth_dev_config_restore(dev, &dev_info, port_id); @@
> > >>>> -1611,16
> > >>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
> > >>>>  		return ret;
> > >>>>  	}
> > >>>>
> > >>>> -	if (dev->data->dev_conf.intr_conf.lsc == 0) {
> > >>>> -		if (*dev->dev_ops->link_update == NULL)
> > >>>> -			return -ENOTSUP;
> > >>>> -		(*dev->dev_ops->link_update)(dev, 0);
> > >>>> -	}
> > >>>> -
> > >>>>  	/* expose selection of PMD fast-path functions */
> > >>>>  	eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
> > >>>>
> > >>>> +	/* ensure state is set before marking device ready */
> > >>>> +	rte_smp_wmb();
> > >>>> +
> > >>>>  	rte_ethdev_trace_start(port_id);
> > >>>> +
> > >>>> +	/* Update current link state */
> > >>>> +	if (dev->data->dev_conf.intr_conf.lsc == 0)
> > >>>> +		(*dev->dev_ops->link_update)(dev, 0);
> > >>>> +
> > >>>>  	return 0;
> > >>>>  }
> > >>>>
> > >>>>
> > >>>> .
> > >>>>
> > >
  
Ferruh Yigit March 3, 2023, 4:49 p.m. UTC | #20
On 2/20/2023 6:08 AM, Ashok Kaladi wrote:
> If ethdev enqueue or dequeue function is called during
> eth_dev_fp_ops_setup(), it may get pre-empted after setting
> the function pointers, but before setting the pointer to port data.
> In this case the newly registered enqueue/dequeue function will use
> dummy port data and end up in seg fault.
> 
> This patch moves the updation of each data pointers before updating
> corresponding function pointers.
> 
> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate structure")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Ashok Kaladi <ashok.k.kaladi@intel.com>
> 
> diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
> index 48090c879a..a0232c669f 100644
> --- a/lib/ethdev/ethdev_private.c
> +++ b/lib/ethdev/ethdev_private.c
> @@ -270,17 +270,17 @@ void
>  eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
>  		const struct rte_eth_dev *dev)
>  {
> +	fpo->rxq.data = dev->data->rx_queues;
>  	fpo->rx_pkt_burst = dev->rx_pkt_burst;
> +	fpo->txq.data = dev->data->tx_queues;
>  	fpo->tx_pkt_burst = dev->tx_pkt_burst;
>  	fpo->tx_pkt_prepare = dev->tx_pkt_prepare;
>  	fpo->rx_queue_count = dev->rx_queue_count;
>  	fpo->rx_descriptor_status = dev->rx_descriptor_status;
>  	fpo->tx_descriptor_status = dev->tx_descriptor_status;
>  
> -	fpo->rxq.data = dev->data->rx_queues;
>  	fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
>  
> -	fpo->txq.data = dev->data->tx_queues;
>  	fpo->txq.clbk = (void **)(uintptr_t)dev->pre_tx_burst_cbs;
>  }
>  

Hi Ashok,

The discussion evolved to proactive recovery, but I wonder if that was
your concern or use case?
If not can you please describe your use case more?
  
Ferruh Yigit March 3, 2023, 5:19 p.m. UTC | #21
On 2/26/2023 5:22 PM, Konstantin Ananyev wrote:
> 
>>>>>>>>>>> If ethdev enqueue or dequeue function is called during
>>>>>>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting the
>>>>>>>>>>> function pointers, but before setting the pointer to port data.
>>>>>>>>>>> In this case the newly registered enqueue/dequeue function will
>>>>>>>>>>> use dummy port data and end up in seg fault.
>>>>>>>>>>>
>>>>>>>>>>> This patch moves the updation of each data pointers before
>>>>>>>>>>> updating corresponding function pointers.
>>>>>>>>>>>
>>>>>>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
>>>>>>>>>>> structure")
>>>>>>>>>>> Cc: stable@dpdk.org
>>>>>>>>
>>>>>>>> Why is something calling enqueue/dequeue when device is not fully
>>>>>> started.
>>>>>>>> A correctly written application would not call rx/tx burst until
>>>>>>>> after ethdev start had finished.
>>>>>>>
>>>>>>> Please refer the eb0d471a894 (ethdev: add proactive error handling
>>>>>>> mode), when driver recover itself, the application may still invoke
>>>>>> enqueue/dequeue API.
>>>>>>
>>>>>> Right now DPDK ethdev layer *does not* provide synchronization
>>>>>> mechanisms between data-path and control-path functions.
>>>>>> That was a deliberate deisgn choice. If we want to change that
>>>>>> rule, then I
>>>>>> suppose we need a community consensus for it.
>>>>>> I think that if the driver wants to provide some sort of error
>>>>>> recovery
>>>>>> procedure, then it has to provide some synchronization mechanism
>>>>>> inside it
>>>>>> between data-path and control-path functions.
>>>>>> Actually looking at eb0d471a894 (ethdev: add proactive error handling
>>>>>> mode), and following patches I wonder how it creeped in?
>>>>>> It seems we just introduced a loophole for race condition with this
>>>>>> approach...
>>>>
>>>> Could you try to describe the specific scenario of loophole ?
>>>
>>> Ok, as I understand the existing mechanism:
>>>
>>> When PMD wants to start a recovery it has to:
>>>   - invoke  rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
>>>     That supposed to call user provided callback. After callback is
>>> finished PMD assumes
>>>     that user is aware that recovery is about to start and should
>>> make some precautions.
>>> - when recovery is finished it invokes another callback:
>>>    RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either
>>> can continue to
>>>    use port or have to treat is as faulty.
>>>
>>> The idea is ok in principle, but there is a problem.
>>>
>>> lib/ethdev/rte_ethdev.h:
>>>             /** Port recovering from a hardware or firmware error.
>>>           * If PMD supports proactive error recovery,
>>>           * it should trigger this event to notify application
>>>           * that it detected an error and the recovery is being started.
>>>
>>> <<< !!!!!
>>>           * Upon receiving the event, the application should not
>>> invoke any control path API
>>>           * (such as rte_eth_dev_configure/rte_eth_dev_stop...) until
>>> receiving
>>>           * RTE_ETH_EVENT_RECOVERY_SUCCESS or
>>> RTE_ETH_EVENT_RECOVERY_FAILED event.
>>>           * The PMD will set the data path pointers to dummy functions,
>>>           * and re-set the data path pointers to non-dummy functions
>>>           * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
>>> <<< !!!!!
>>>
>>> That part is just wrong I believe.
>>> It should be:
>>> Upon receiving the event, the application should not invoke any *both
>>> control and data-path* API
>>> until receiving  RTE_ETH_EVENT_RECOVERY_SUCCESS or
>>> RTE_ETH_EVENT_RECOVERY_FAILED event.
>>> Resetting data path pointers to dummy functions by PMD *before* invoking
>>> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
>>> introduces a race-condition with data-path threads, as such thread
>>> could already be inside RX/TX function
>>> or can already read RX/TX function/data pointers and be about to use
>>> them.
>>
>> Current practices: the PMDs already add some delay after set Rx/Tx
>> callback to dummy, and plus the DPDK
>> worker thread is busypolling, the probability of occurence in reality
>> is zero. But in theoretically exist
>> the above race-condition.
> 
> 
> Adding delay might make a problem a bit less reproducible,
> but it doesn't fix it.
> The bug is still there.
> 
> 
>>
>>> And right now rte_ethdev layer doesn't provide any mechanism to check
>>> it or wait when they'll finish, etc.
>>
>> Yes
>>
>>>
>>> So, probably the simplest way to fix it with existing DPDK design:
>>> - user level callback  RTE_ETH_EVENT_ERR_RECOVERING should return
>>> only after it ensures that *all*
>>>    application threads (and processes) stopped using either control
>>> or data-path functions for that port
>>
>> Agree
>>
>>>    (yes it means that application that wants to use this feature has
>>> to provide its own synchronization mechanism
>>>    around data-path functions (RX/TX) that it is going to use).
>>> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones.
>>>
>>> And message to all PMD developers:
>>> *please stop updating rte_eth_fp_ops[] on your own*.
>>> That's a bad practice and it is not supposed to do things that way.
>>> There is a special API provided for these purposes:
>>> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it.
>>
>> This two function is in private.h, so it should be expose to public
>> header file.
> 
> You mean we need to move these functions declarations into ethdev_driver.h?
> If so, then yes, I think we probably do.
> 
> 


What about making slightly different version available to drivers, which
only updates function pointers, but not  'fpo->rxq' / 'fpo->txq'.

This way driver can switch to between dummy and real burst function
without worrying Rx/Tx queue validity.

@Chengwen, @Ruifeng, can this solve the issue for relaxed memory
ordering systems?



>>>
>>> BTW,  I don't see any implementation for RTE_ETH_EVENT_ERR_RECOVERING
>>> within
>>> either testpmd or any other example apps.
>>> Am I missing something?
>>
>> Currently it just promote the event.
> 
> 
> Ok, can I suggest then to add a proper usage for into in testpmd?
> It looks really strange that we add new feature into ethdev (and 2 PMDs),
> but didn't provide any way for users to test it.
> 
>>
>>> If not, then probably it could be a good starting point - let's
>>> incorporate it inside testpmd
>>> (new forwarding engine probably) so everyone can test/try it.
>>>
>>>           * It means that the application cannot send or receive any
>>> packets
>>>           * during this period.
>>>           * @note Before the PMD reports the recovery result,
>>>           * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING event
>>> again,
>>>           * because a larger error may occur during the recovery.
>>>           */
>>>          RTE_ETH_EVENT_ERR_RECOVERING,
>>>
>>>>>> It probably needs to be either deprecated or reworked.
>>>>> Looking at the commit, it does not say anything about the data
>>>>> plane functions which probably means, the error recovery is
>>>> happening within the data plane thread. What happens to other data
>>>> plane threads that are polling the same port on which the error
>>>> recovery is happening?
>>>>
>>>> The commit log says: "the PMD sets the data path pointers to dummy
>>>> functions".
>>>>
>>>> So the data plane threads will receive non-packet and send zero with
>>>> port which in error recovery.
>>>>
>>>>>
>>>>> Also, the commit log says that while the error recovery is under
>>>>> progress, the application should not call any control plane APIs. Does
>>>> that mean, the application has to check for error condition every
>>>> time it calls a control plane API?
>>>>
>>>> If application has not register event (RTE_ETH_EVENT_ERR_RECOVERING)
>>>> callback, it could calls control plane API, but it will return
>>>> failed.
>>>> If application has register above callback, it can wait for recovery
>>>> result, or direct call without wait but this will return failed.
>>>>
>>>>>
>>>>> The commit message also says that "PMD makes sure the control path
>>>>> operations failed with retcode -EBUSY". It does not say how it
>>>> does this. But, any communication from the PMD thread to control
>>>> plane thread may introduce race conditions if not done correctly.
>>>>
>>>> First there are no PMD thread, do you mean eal-intr-thread ?
>>>>
>>>> As for this question, you can see PMDs which already implement it,
>>>> they both provides mutual exclusion protection.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Would something like this work better?
>>>>>>>>
>>>>>>>> Note: there is another bug in current code. The check for link
>>>>>>>> state
>>>>>>>> interrupt and link_ops could return -ENOTSUP and leave device in
>>>>>> indeterminate state.
>>>>>>>> The check should be done before calling PMD.
>>>>>>>>
>>>>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
>>>>>>>> index
>>>>>>>> 0266cc82acb6..d6c163ed85e7 100644
>>>>>>>> --- a/lib/ethdev/rte_ethdev.c
>>>>>>>> +++ b/lib/ethdev/rte_ethdev.c
>>>>>>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>>           return 0;
>>>>>>>>       }
>>>>>>>>
>>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0 &&
>>>>>>>> +        dev->dev_ops->link_update == NULL) {
>>>>>>>> +        RTE_ETHDEV_LOG(INFO,
>>>>>>>> +                   "Device with port_id=%"PRIu16" link update not
>>>>>> supported\n",
>>>>>>>> +                   port_id);
>>>>>>>> +            return -ENOTSUP;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>>       ret = rte_eth_dev_info_get(port_id, &dev_info);
>>>>>>>>       if (ret != 0)
>>>>>>>>           return ret;
>>>>>>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>>           eth_dev_mac_restore(dev, &dev_info);
>>>>>>>>
>>>>>>>>       diag = (*dev->dev_ops->dev_start)(dev);
>>>>>>>> -    if (diag == 0)
>>>>>>>> -        dev->data->dev_started = 1;
>>>>>>>> -    else
>>>>>>>> +    if (diag != 0)
>>>>>>>>           return eth_err(port_id, diag);
>>>>>>>>
>>>>>>>>       ret = eth_dev_config_restore(dev, &dev_info, port_id); @@
>>>>>>>> -1611,16
>>>>>>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>>           return ret;
>>>>>>>>       }
>>>>>>>>
>>>>>>>> -    if (dev->data->dev_conf.intr_conf.lsc == 0) {
>>>>>>>> -        if (*dev->dev_ops->link_update == NULL)
>>>>>>>> -            return -ENOTSUP;
>>>>>>>> -        (*dev->dev_ops->link_update)(dev, 0);
>>>>>>>> -    }
>>>>>>>> -
>>>>>>>>       /* expose selection of PMD fast-path functions */
>>>>>>>>       eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
>>>>>>>>
>>>>>>>> +    /* ensure state is set before marking device ready */
>>>>>>>> +    rte_smp_wmb();
>>>>>>>> +
>>>>>>>>       rte_ethdev_trace_start(port_id);
>>>>>>>> +
>>>>>>>> +    /* Update current link state */
>>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0)
>>>>>>>> +        (*dev->dev_ops->link_update)(dev, 0);
>>>>>>>> +
>>>>>>>>       return 0;
>>>>>>>>   }
>>>>>>>>
>>>>>>>>
>>>>>>>> .
>>>>>>>>
>>>>>
>
  
Chengwen Feng March 6, 2023, 1:57 a.m. UTC | #22
On 2023/3/4 1:19, Ferruh Yigit wrote:
> On 2/26/2023 5:22 PM, Konstantin Ananyev wrote:
>>
>>>>>>>>>>>> If ethdev enqueue or dequeue function is called during
>>>>>>>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting the
>>>>>>>>>>>> function pointers, but before setting the pointer to port data.
>>>>>>>>>>>> In this case the newly registered enqueue/dequeue function will
>>>>>>>>>>>> use dummy port data and end up in seg fault.
>>>>>>>>>>>>
>>>>>>>>>>>> This patch moves the updation of each data pointers before
>>>>>>>>>>>> updating corresponding function pointers.
>>>>>>>>>>>>
>>>>>>>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
>>>>>>>>>>>> structure")
>>>>>>>>>>>> Cc: stable@dpdk.org
>>>>>>>>>
>>>>>>>>> Why is something calling enqueue/dequeue when device is not fully
>>>>>>> started.
>>>>>>>>> A correctly written application would not call rx/tx burst until
>>>>>>>>> after ethdev start had finished.
>>>>>>>>
>>>>>>>> Please refer the eb0d471a894 (ethdev: add proactive error handling
>>>>>>>> mode), when driver recover itself, the application may still invoke
>>>>>>> enqueue/dequeue API.
>>>>>>>
>>>>>>> Right now DPDK ethdev layer *does not* provide synchronization
>>>>>>> mechanisms between data-path and control-path functions.
>>>>>>> That was a deliberate deisgn choice. If we want to change that
>>>>>>> rule, then I
>>>>>>> suppose we need a community consensus for it.
>>>>>>> I think that if the driver wants to provide some sort of error
>>>>>>> recovery
>>>>>>> procedure, then it has to provide some synchronization mechanism
>>>>>>> inside it
>>>>>>> between data-path and control-path functions.
>>>>>>> Actually looking at eb0d471a894 (ethdev: add proactive error handling
>>>>>>> mode), and following patches I wonder how it creeped in?
>>>>>>> It seems we just introduced a loophole for race condition with this
>>>>>>> approach...
>>>>>
>>>>> Could you try to describe the specific scenario of loophole ?
>>>>
>>>> Ok, as I understand the existing mechanism:
>>>>
>>>> When PMD wants to start a recovery it has to:
>>>>   - invoke  rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
>>>>     That supposed to call user provided callback. After callback is
>>>> finished PMD assumes
>>>>     that user is aware that recovery is about to start and should
>>>> make some precautions.
>>>> - when recovery is finished it invokes another callback:
>>>>    RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either
>>>> can continue to
>>>>    use port or have to treat is as faulty.
>>>>
>>>> The idea is ok in principle, but there is a problem.
>>>>
>>>> lib/ethdev/rte_ethdev.h:
>>>>             /** Port recovering from a hardware or firmware error.
>>>>           * If PMD supports proactive error recovery,
>>>>           * it should trigger this event to notify application
>>>>           * that it detected an error and the recovery is being started.
>>>>
>>>> <<< !!!!!
>>>>           * Upon receiving the event, the application should not
>>>> invoke any control path API
>>>>           * (such as rte_eth_dev_configure/rte_eth_dev_stop...) until
>>>> receiving
>>>>           * RTE_ETH_EVENT_RECOVERY_SUCCESS or
>>>> RTE_ETH_EVENT_RECOVERY_FAILED event.
>>>>           * The PMD will set the data path pointers to dummy functions,
>>>>           * and re-set the data path pointers to non-dummy functions
>>>>           * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
>>>> <<< !!!!!
>>>>
>>>> That part is just wrong I believe.
>>>> It should be:
>>>> Upon receiving the event, the application should not invoke any *both
>>>> control and data-path* API
>>>> until receiving  RTE_ETH_EVENT_RECOVERY_SUCCESS or
>>>> RTE_ETH_EVENT_RECOVERY_FAILED event.
>>>> Resetting data path pointers to dummy functions by PMD *before* invoking
>>>> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
>>>> introduces a race-condition with data-path threads, as such thread
>>>> could already be inside RX/TX function
>>>> or can already read RX/TX function/data pointers and be about to use
>>>> them.
>>>
>>> Current practices: the PMDs already add some delay after set Rx/Tx
>>> callback to dummy, and plus the DPDK
>>> worker thread is busypolling, the probability of occurence in reality
>>> is zero. But in theoretically exist
>>> the above race-condition.
>>
>>
>> Adding delay might make a problem a bit less reproducible,
>> but it doesn't fix it.
>> The bug is still there.
>>
>>
>>>
>>>> And right now rte_ethdev layer doesn't provide any mechanism to check
>>>> it or wait when they'll finish, etc.
>>>
>>> Yes
>>>
>>>>
>>>> So, probably the simplest way to fix it with existing DPDK design:
>>>> - user level callback  RTE_ETH_EVENT_ERR_RECOVERING should return
>>>> only after it ensures that *all*
>>>>    application threads (and processes) stopped using either control
>>>> or data-path functions for that port
>>>
>>> Agree
>>>
>>>>    (yes it means that application that wants to use this feature has
>>>> to provide its own synchronization mechanism
>>>>    around data-path functions (RX/TX) that it is going to use).
>>>> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones.
>>>>
>>>> And message to all PMD developers:
>>>> *please stop updating rte_eth_fp_ops[] on your own*.
>>>> That's a bad practice and it is not supposed to do things that way.
>>>> There is a special API provided for these purposes:
>>>> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it.
>>>
>>> This two function is in private.h, so it should be expose to public
>>> header file.
>>
>> You mean we need to move these functions declarations into ethdev_driver.h?
>> If so, then yes, I think we probably do.
>>
>>
> 
> 
> What about making slightly different version available to drivers, which
> only updates function pointers, but not  'fpo->rxq' / 'fpo->txq'.
> 
> This way driver can switch to between dummy and real burst function
> without worrying Rx/Tx queue validity.
> 
> @Chengwen, @Ruifeng, can this solve the issue for relaxed memory
> ordering systems?

For the problem described in this commit, I think it's OK for solve the RMO.

> 
> 
> 
>>>>
>>>> BTW,  I don't see any implementation for RTE_ETH_EVENT_ERR_RECOVERING
>>>> within
>>>> either testpmd or any other example apps.
>>>> Am I missing something?
>>>
>>> Currently it just promote the event.
>>
>>
>> Ok, can I suggest then to add a proper usage for into in testpmd?
>> It looks really strange that we add new feature into ethdev (and 2 PMDs),
>> but didn't provide any way for users to test it.
>>
>>>
>>>> If not, then probably it could be a good starting point - let's
>>>> incorporate it inside testpmd
>>>> (new forwarding engine probably) so everyone can test/try it.
>>>>
>>>>           * It means that the application cannot send or receive any
>>>> packets
>>>>           * during this period.
>>>>           * @note Before the PMD reports the recovery result,
>>>>           * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING event
>>>> again,
>>>>           * because a larger error may occur during the recovery.
>>>>           */
>>>>          RTE_ETH_EVENT_ERR_RECOVERING,
>>>>
>>>>>>> It probably needs to be either deprecated or reworked.
>>>>>> Looking at the commit, it does not say anything about the data
>>>>>> plane functions which probably means, the error recovery is
>>>>> happening within the data plane thread. What happens to other data
>>>>> plane threads that are polling the same port on which the error
>>>>> recovery is happening?
>>>>>
>>>>> The commit log says: "the PMD sets the data path pointers to dummy
>>>>> functions".
>>>>>
>>>>> So the data plane threads will receive non-packet and send zero with
>>>>> port which in error recovery.
>>>>>
>>>>>>
>>>>>> Also, the commit log says that while the error recovery is under
>>>>>> progress, the application should not call any control plane APIs. Does
>>>>> that mean, the application has to check for error condition every
>>>>> time it calls a control plane API?
>>>>>
>>>>> If application has not register event (RTE_ETH_EVENT_ERR_RECOVERING)
>>>>> callback, it could calls control plane API, but it will return
>>>>> failed.
>>>>> If application has register above callback, it can wait for recovery
>>>>> result, or direct call without wait but this will return failed.
>>>>>
>>>>>>
>>>>>> The commit message also says that "PMD makes sure the control path
>>>>>> operations failed with retcode -EBUSY". It does not say how it
>>>>> does this. But, any communication from the PMD thread to control
>>>>> plane thread may introduce race conditions if not done correctly.
>>>>>
>>>>> First there are no PMD thread, do you mean eal-intr-thread ?
>>>>>
>>>>> As for this question, you can see PMDs which already implement it,
>>>>> they both provides mutual exclusion protection.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Would something like this work better?
>>>>>>>>>
>>>>>>>>> Note: there is another bug in current code. The check for link
>>>>>>>>> state
>>>>>>>>> interrupt and link_ops could return -ENOTSUP and leave device in
>>>>>>> indeterminate state.
>>>>>>>>> The check should be done before calling PMD.
>>>>>>>>>
>>>>>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
>>>>>>>>> index
>>>>>>>>> 0266cc82acb6..d6c163ed85e7 100644
>>>>>>>>> --- a/lib/ethdev/rte_ethdev.c
>>>>>>>>> +++ b/lib/ethdev/rte_ethdev.c
>>>>>>>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>>>           return 0;
>>>>>>>>>       }
>>>>>>>>>
>>>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0 &&
>>>>>>>>> +        dev->dev_ops->link_update == NULL) {
>>>>>>>>> +        RTE_ETHDEV_LOG(INFO,
>>>>>>>>> +                   "Device with port_id=%"PRIu16" link update not
>>>>>>> supported\n",
>>>>>>>>> +                   port_id);
>>>>>>>>> +            return -ENOTSUP;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>>       ret = rte_eth_dev_info_get(port_id, &dev_info);
>>>>>>>>>       if (ret != 0)
>>>>>>>>>           return ret;
>>>>>>>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>>>           eth_dev_mac_restore(dev, &dev_info);
>>>>>>>>>
>>>>>>>>>       diag = (*dev->dev_ops->dev_start)(dev);
>>>>>>>>> -    if (diag == 0)
>>>>>>>>> -        dev->data->dev_started = 1;
>>>>>>>>> -    else
>>>>>>>>> +    if (diag != 0)
>>>>>>>>>           return eth_err(port_id, diag);
>>>>>>>>>
>>>>>>>>>       ret = eth_dev_config_restore(dev, &dev_info, port_id); @@
>>>>>>>>> -1611,16
>>>>>>>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>>>           return ret;
>>>>>>>>>       }
>>>>>>>>>
>>>>>>>>> -    if (dev->data->dev_conf.intr_conf.lsc == 0) {
>>>>>>>>> -        if (*dev->dev_ops->link_update == NULL)
>>>>>>>>> -            return -ENOTSUP;
>>>>>>>>> -        (*dev->dev_ops->link_update)(dev, 0);
>>>>>>>>> -    }
>>>>>>>>> -
>>>>>>>>>       /* expose selection of PMD fast-path functions */
>>>>>>>>>       eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
>>>>>>>>>
>>>>>>>>> +    /* ensure state is set before marking device ready */
>>>>>>>>> +    rte_smp_wmb();
>>>>>>>>> +
>>>>>>>>>       rte_ethdev_trace_start(port_id);
>>>>>>>>> +
>>>>>>>>> +    /* Update current link state */
>>>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0)
>>>>>>>>> +        (*dev->dev_ops->link_update)(dev, 0);
>>>>>>>>> +
>>>>>>>>>       return 0;
>>>>>>>>>   }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> .
>>>>>>>>>
>>>>>>
>>
> 
> .
>
  
Ruifeng Wang March 6, 2023, 6:13 a.m. UTC | #23
> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@amd.com>
> Sent: Saturday, March 4, 2023 1:19 AM
> To: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>; dev@dpdk.org; fengchengwen
> <fengchengwen@huawei.com>; Konstantin Ananyev <konstantin.ananyev@huawei.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Stephen Hemminger <stephen@networkplumber.org>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; Ajit Khaparde (ajit.khaparde@broadcom.com)
> <ajit.khaparde@broadcom.com>
> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
> 
> On 2/26/2023 5:22 PM, Konstantin Ananyev wrote:
> >
> >>>>>>>>>>> If ethdev enqueue or dequeue function is called during
> >>>>>>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting
> >>>>>>>>>>> the function pointers, but before setting the pointer to port data.
> >>>>>>>>>>> In this case the newly registered enqueue/dequeue function
> >>>>>>>>>>> will use dummy port data and end up in seg fault.
> >>>>>>>>>>>
> >>>>>>>>>>> This patch moves the updation of each data pointers before
> >>>>>>>>>>> updating corresponding function pointers.
> >>>>>>>>>>>
> >>>>>>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into
> >>>>>>>>>>> separate
> >>>>>>>>>>> structure")
> >>>>>>>>>>> Cc: stable@dpdk.org
> >>>>>>>>
> >>>>>>>> Why is something calling enqueue/dequeue when device is not
> >>>>>>>> fully
> >>>>>> started.
> >>>>>>>> A correctly written application would not call rx/tx burst
> >>>>>>>> until after ethdev start had finished.
> >>>>>>>
> >>>>>>> Please refer the eb0d471a894 (ethdev: add proactive error
> >>>>>>> handling mode), when driver recover itself, the application may
> >>>>>>> still invoke
> >>>>>> enqueue/dequeue API.
> >>>>>>
> >>>>>> Right now DPDK ethdev layer *does not* provide synchronization
> >>>>>> mechanisms between data-path and control-path functions.
> >>>>>> That was a deliberate deisgn choice. If we want to change that
> >>>>>> rule, then I suppose we need a community consensus for it.
> >>>>>> I think that if the driver wants to provide some sort of error
> >>>>>> recovery procedure, then it has to provide some synchronization
> >>>>>> mechanism inside it between data-path and control-path functions.
> >>>>>> Actually looking at eb0d471a894 (ethdev: add proactive error
> >>>>>> handling mode), and following patches I wonder how it creeped in?
> >>>>>> It seems we just introduced a loophole for race condition with
> >>>>>> this approach...
> >>>>
> >>>> Could you try to describe the specific scenario of loophole ?
> >>>
> >>> Ok, as I understand the existing mechanism:
> >>>
> >>> When PMD wants to start a recovery it has to:
> >>>   - invoke
> >>> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
> >>>     That supposed to call user provided callback. After callback is
> >>> finished PMD assumes
> >>>     that user is aware that recovery is about to start and should
> >>> make some precautions.
> >>> - when recovery is finished it invokes another callback:
> >>>    RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either
> >>> can continue to
> >>>    use port or have to treat is as faulty.
> >>>
> >>> The idea is ok in principle, but there is a problem.
> >>>
> >>> lib/ethdev/rte_ethdev.h:
> >>>             /** Port recovering from a hardware or firmware error.
> >>>           * If PMD supports proactive error recovery,
> >>>           * it should trigger this event to notify application
> >>>           * that it detected an error and the recovery is being started.
> >>>
> >>> <<< !!!!!
> >>>           * Upon receiving the event, the application should not
> >>> invoke any control path API
> >>>           * (such as rte_eth_dev_configure/rte_eth_dev_stop...)
> >>> until receiving
> >>>           * RTE_ETH_EVENT_RECOVERY_SUCCESS or
> >>> RTE_ETH_EVENT_RECOVERY_FAILED event.
> >>>           * The PMD will set the data path pointers to dummy
> >>> functions,
> >>>           * and re-set the data path pointers to non-dummy functions
> >>>           * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> >>> <<< !!!!!
> >>>
> >>> That part is just wrong I believe.
> >>> It should be:
> >>> Upon receiving the event, the application should not invoke any
> >>> *both control and data-path* API until receiving
> >>> RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED
> >>> event.
> >>> Resetting data path pointers to dummy functions by PMD *before*
> >>> invoking rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
> >>> introduces a race-condition with data-path threads, as such thread
> >>> could already be inside RX/TX function or can already read RX/TX
> >>> function/data pointers and be about to use them.
> >>
> >> Current practices: the PMDs already add some delay after set Rx/Tx
> >> callback to dummy, and plus the DPDK worker thread is busypolling,
> >> the probability of occurence in reality is zero. But in theoretically
> >> exist the above race-condition.
> >
> >
> > Adding delay might make a problem a bit less reproducible, but it
> > doesn't fix it.
> > The bug is still there.
> >
> >
> >>
> >>> And right now rte_ethdev layer doesn't provide any mechanism to
> >>> check it or wait when they'll finish, etc.
> >>
> >> Yes
> >>
> >>>
> >>> So, probably the simplest way to fix it with existing DPDK design:
> >>> - user level callback  RTE_ETH_EVENT_ERR_RECOVERING should return
> >>> only after it ensures that *all*
> >>>    application threads (and processes) stopped using either control
> >>> or data-path functions for that port
> >>
> >> Agree
> >>
> >>>    (yes it means that application that wants to use this feature has
> >>> to provide its own synchronization mechanism
> >>>    around data-path functions (RX/TX) that it is going to use).
> >>> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones.
> >>>
> >>> And message to all PMD developers:
> >>> *please stop updating rte_eth_fp_ops[] on your own*.
> >>> That's a bad practice and it is not supposed to do things that way.
> >>> There is a special API provided for these purposes:
> >>> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it.
> >>
> >> This two function is in private.h, so it should be expose to public
> >> header file.
> >
> > You mean we need to move these functions declarations into ethdev_driver.h?
> > If so, then yes, I think we probably do.
> >
> >
> 
> 
> What about making slightly different version available to drivers, which only updates
> function pointers, but not  'fpo->rxq' / 'fpo->txq'.
> 
> This way driver can switch to between dummy and real burst function without worrying Rx/Tx
> queue validity.
> 
> @Chengwen, @Ruifeng, can this solve the issue for relaxed memory ordering systems?

Yes, updating only function pointers removes the synchronization requirement between function
pointer and qdata. 

> 
> 
> 
> >>>
> >>> BTW,  I don't see any implementation for
> >>> RTE_ETH_EVENT_ERR_RECOVERING within either testpmd or any other
> >>> example apps.
> >>> Am I missing something?
> >>
> >> Currently it just promote the event.
> >
> >
> > Ok, can I suggest then to add a proper usage for into in testpmd?
> > It looks really strange that we add new feature into ethdev (and 2
> > PMDs), but didn't provide any way for users to test it.
> >
> >>
> >>> If not, then probably it could be a good starting point - let's
> >>> incorporate it inside testpmd (new forwarding engine probably) so
> >>> everyone can test/try it.
> >>>
> >>>           * It means that the application cannot send or receive any
> >>> packets
> >>>           * during this period.
> >>>           * @note Before the PMD reports the recovery result,
> >>>           * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING
> >>> event again,
> >>>           * because a larger error may occur during the recovery.
> >>>           */
> >>>          RTE_ETH_EVENT_ERR_RECOVERING,
> >>>
> >>>>>> It probably needs to be either deprecated or reworked.
> >>>>> Looking at the commit, it does not say anything about the data
> >>>>> plane functions which probably means, the error recovery is
> >>>> happening within the data plane thread. What happens to other data
> >>>> plane threads that are polling the same port on which the error
> >>>> recovery is happening?
> >>>>
> >>>> The commit log says: "the PMD sets the data path pointers to dummy
> >>>> functions".
> >>>>
> >>>> So the data plane threads will receive non-packet and send zero
> >>>> with port which in error recovery.
> >>>>
> >>>>>
> >>>>> Also, the commit log says that while the error recovery is under
> >>>>> progress, the application should not call any control plane APIs.
> >>>>> Does
> >>>> that mean, the application has to check for error condition every
> >>>> time it calls a control plane API?
> >>>>
> >>>> If application has not register event
> >>>> (RTE_ETH_EVENT_ERR_RECOVERING) callback, it could calls control
> >>>> plane API, but it will return failed.
> >>>> If application has register above callback, it can wait for
> >>>> recovery result, or direct call without wait but this will return failed.
> >>>>
> >>>>>
> >>>>> The commit message also says that "PMD makes sure the control path
> >>>>> operations failed with retcode -EBUSY". It does not say how it
> >>>> does this. But, any communication from the PMD thread to control
> >>>> plane thread may introduce race conditions if not done correctly.
> >>>>
> >>>> First there are no PMD thread, do you mean eal-intr-thread ?
> >>>>
> >>>> As for this question, you can see PMDs which already implement it,
> >>>> they both provides mutual exclusion protection.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Would something like this work better?
> >>>>>>>>
> >>>>>>>> Note: there is another bug in current code. The check for link
> >>>>>>>> state interrupt and link_ops could return -ENOTSUP and leave
> >>>>>>>> device in
> >>>>>> indeterminate state.
> >>>>>>>> The check should be done before calling PMD.
> >>>>>>>>
> >>>>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> >>>>>>>> index
> >>>>>>>> 0266cc82acb6..d6c163ed85e7 100644
> >>>>>>>> --- a/lib/ethdev/rte_ethdev.c
> >>>>>>>> +++ b/lib/ethdev/rte_ethdev.c
> >>>>>>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
> >>>>>>>>           return 0;
> >>>>>>>>       }
> >>>>>>>>
> >>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> >>>>>>>> +        dev->dev_ops->link_update == NULL) {
> >>>>>>>> +        RTE_ETHDEV_LOG(INFO,
> >>>>>>>> +                   "Device with port_id=%"PRIu16" link update
> >>>>>>>> +not
> >>>>>> supported\n",
> >>>>>>>> +                   port_id);
> >>>>>>>> +            return -ENOTSUP;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>>       ret = rte_eth_dev_info_get(port_id, &dev_info);
> >>>>>>>>       if (ret != 0)
> >>>>>>>>           return ret;
> >>>>>>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
> >>>>>>>>           eth_dev_mac_restore(dev, &dev_info);
> >>>>>>>>
> >>>>>>>>       diag = (*dev->dev_ops->dev_start)(dev);
> >>>>>>>> -    if (diag == 0)
> >>>>>>>> -        dev->data->dev_started = 1;
> >>>>>>>> -    else
> >>>>>>>> +    if (diag != 0)
> >>>>>>>>           return eth_err(port_id, diag);
> >>>>>>>>
> >>>>>>>>       ret = eth_dev_config_restore(dev, &dev_info, port_id); @@
> >>>>>>>> -1611,16
> >>>>>>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
> >>>>>>>>           return ret;
> >>>>>>>>       }
> >>>>>>>>
> >>>>>>>> -    if (dev->data->dev_conf.intr_conf.lsc == 0) {
> >>>>>>>> -        if (*dev->dev_ops->link_update == NULL)
> >>>>>>>> -            return -ENOTSUP;
> >>>>>>>> -        (*dev->dev_ops->link_update)(dev, 0);
> >>>>>>>> -    }
> >>>>>>>> -
> >>>>>>>>       /* expose selection of PMD fast-path functions */
> >>>>>>>>       eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
> >>>>>>>>
> >>>>>>>> +    /* ensure state is set before marking device ready */
> >>>>>>>> +    rte_smp_wmb();
> >>>>>>>> +
> >>>>>>>>       rte_ethdev_trace_start(port_id);
> >>>>>>>> +
> >>>>>>>> +    /* Update current link state */
> >>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0)
> >>>>>>>> +        (*dev->dev_ops->link_update)(dev, 0);
> >>>>>>>> +
> >>>>>>>>       return 0;
> >>>>>>>>   }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> .
> >>>>>>>>
> >>>>>
> >
  
Konstantin Ananyev March 6, 2023, 10:32 a.m. UTC | #24
> > -----Original Message-----
> > From: Ferruh Yigit <ferruh.yigit@amd.com>
> > Sent: Saturday, March 4, 2023 1:19 AM
> > To: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>; dev@dpdk.org; fengchengwen
> > <fengchengwen@huawei.com>; Konstantin Ananyev <konstantin.ananyev@huawei.com>; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; Stephen Hemminger <stephen@networkplumber.org>;
> > Ruifeng Wang <Ruifeng.Wang@arm.com>; Ajit Khaparde (ajit.khaparde@broadcom.com)
> > <ajit.khaparde@broadcom.com>
> > Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
> >
> > On 2/26/2023 5:22 PM, Konstantin Ananyev wrote:
> > >
> > >>>>>>>>>>> If ethdev enqueue or dequeue function is called during
> > >>>>>>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting
> > >>>>>>>>>>> the function pointers, but before setting the pointer to port data.
> > >>>>>>>>>>> In this case the newly registered enqueue/dequeue function
> > >>>>>>>>>>> will use dummy port data and end up in seg fault.
> > >>>>>>>>>>>
> > >>>>>>>>>>> This patch moves the updation of each data pointers before
> > >>>>>>>>>>> updating corresponding function pointers.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into
> > >>>>>>>>>>> separate
> > >>>>>>>>>>> structure")
> > >>>>>>>>>>> Cc: stable@dpdk.org
> > >>>>>>>>
> > >>>>>>>> Why is something calling enqueue/dequeue when device is not
> > >>>>>>>> fully
> > >>>>>> started.
> > >>>>>>>> A correctly written application would not call rx/tx burst
> > >>>>>>>> until after ethdev start had finished.
> > >>>>>>>
> > >>>>>>> Please refer the eb0d471a894 (ethdev: add proactive error
> > >>>>>>> handling mode), when driver recover itself, the application may
> > >>>>>>> still invoke
> > >>>>>> enqueue/dequeue API.
> > >>>>>>
> > >>>>>> Right now DPDK ethdev layer *does not* provide synchronization
> > >>>>>> mechanisms between data-path and control-path functions.
> > >>>>>> That was a deliberate deisgn choice. If we want to change that
> > >>>>>> rule, then I suppose we need a community consensus for it.
> > >>>>>> I think that if the driver wants to provide some sort of error
> > >>>>>> recovery procedure, then it has to provide some synchronization
> > >>>>>> mechanism inside it between data-path and control-path functions.
> > >>>>>> Actually looking at eb0d471a894 (ethdev: add proactive error
> > >>>>>> handling mode), and following patches I wonder how it creeped in?
> > >>>>>> It seems we just introduced a loophole for race condition with
> > >>>>>> this approach...
> > >>>>
> > >>>> Could you try to describe the specific scenario of loophole ?
> > >>>
> > >>> Ok, as I understand the existing mechanism:
> > >>>
> > >>> When PMD wants to start a recovery it has to:
> > >>>   - invoke
> > >>> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
> > >>>     That supposed to call user provided callback. After callback is
> > >>> finished PMD assumes
> > >>>     that user is aware that recovery is about to start and should
> > >>> make some precautions.
> > >>> - when recovery is finished it invokes another callback:
> > >>>    RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either
> > >>> can continue to
> > >>>    use port or have to treat is as faulty.
> > >>>
> > >>> The idea is ok in principle, but there is a problem.
> > >>>
> > >>> lib/ethdev/rte_ethdev.h:
> > >>>             /** Port recovering from a hardware or firmware error.
> > >>>           * If PMD supports proactive error recovery,
> > >>>           * it should trigger this event to notify application
> > >>>           * that it detected an error and the recovery is being started.
> > >>>
> > >>> <<< !!!!!
> > >>>           * Upon receiving the event, the application should not
> > >>> invoke any control path API
> > >>>           * (such as rte_eth_dev_configure/rte_eth_dev_stop...)
> > >>> until receiving
> > >>>           * RTE_ETH_EVENT_RECOVERY_SUCCESS or
> > >>> RTE_ETH_EVENT_RECOVERY_FAILED event.
> > >>>           * The PMD will set the data path pointers to dummy
> > >>> functions,
> > >>>           * and re-set the data path pointers to non-dummy functions
> > >>>           * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> > >>> <<< !!!!!
> > >>>
> > >>> That part is just wrong I believe.
> > >>> It should be:
> > >>> Upon receiving the event, the application should not invoke any
> > >>> *both control and data-path* API until receiving
> > >>> RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED
> > >>> event.
> > >>> Resetting data path pointers to dummy functions by PMD *before*
> > >>> invoking rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
> > >>> introduces a race-condition with data-path threads, as such thread
> > >>> could already be inside RX/TX function or can already read RX/TX
> > >>> function/data pointers and be about to use them.
> > >>
> > >> Current practices: the PMDs already add some delay after set Rx/Tx
> > >> callback to dummy, and plus the DPDK worker thread is busypolling,
> > >> the probability of occurence in reality is zero. But in theoretically
> > >> exist the above race-condition.
> > >
> > >
> > > Adding delay might make a problem a bit less reproducible, but it
> > > doesn't fix it.
> > > The bug is still there.
> > >
> > >
> > >>
> > >>> And right now rte_ethdev layer doesn't provide any mechanism to
> > >>> check it or wait when they'll finish, etc.
> > >>
> > >> Yes
> > >>
> > >>>
> > >>> So, probably the simplest way to fix it with existing DPDK design:
> > >>> - user level callback  RTE_ETH_EVENT_ERR_RECOVERING should return
> > >>> only after it ensures that *all*
> > >>>    application threads (and processes) stopped using either control
> > >>> or data-path functions for that port
> > >>
> > >> Agree
> > >>
> > >>>    (yes it means that application that wants to use this feature has
> > >>> to provide its own synchronization mechanism
> > >>>    around data-path functions (RX/TX) that it is going to use).
> > >>> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones.
> > >>>
> > >>> And message to all PMD developers:
> > >>> *please stop updating rte_eth_fp_ops[] on your own*.
> > >>> That's a bad practice and it is not supposed to do things that way.
> > >>> There is a special API provided for these purposes:
> > >>> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it.
> > >>
> > >> This two function is in private.h, so it should be expose to public
> > >> header file.
> > >
> > > You mean we need to move these functions declarations into ethdev_driver.h?
> > > If so, then yes, I think we probably do.
> > >
> > >
> >
> >
> > What about making slightly different version available to drivers, which only updates
> > function pointers, but not  'fpo->rxq' / 'fpo->txq'.
> >
> > This way driver can switch to between dummy and real burst function without worrying Rx/Tx
> > queue validity.
> >
> > @Chengwen, @Ruifeng, can this solve the issue for relaxed memory ordering systems?
> 
> Yes, updating only function pointers removes the synchronization requirement between function
> pointer and qdata.

Lads, that wouldn't work anyway.
The race between recovery procedure and data-path persists:
Recovery still has no idea is at given moment any thread doing RX/TX or not, and there is no
way for it to know when such thread will finish.
We do need some synchronization mechanism between control(recovery) and data-path threads.
I believe it is unavoidable.   

> >
> >
> >
> > >>>
> > >>> BTW,  I don't see any implementation for
> > >>> RTE_ETH_EVENT_ERR_RECOVERING within either testpmd or any other
> > >>> example apps.
> > >>> Am I missing something?
> > >>
> > >> Currently it just promote the event.
> > >
> > >
> > > Ok, can I suggest then to add a proper usage for into in testpmd?
> > > It looks really strange that we add new feature into ethdev (and 2
> > > PMDs), but didn't provide any way for users to test it.
> > >
> > >>
> > >>> If not, then probably it could be a good starting point - let's
> > >>> incorporate it inside testpmd (new forwarding engine probably) so
> > >>> everyone can test/try it.
> > >>>
> > >>>           * It means that the application cannot send or receive any
> > >>> packets
> > >>>           * during this period.
> > >>>           * @note Before the PMD reports the recovery result,
> > >>>           * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING
> > >>> event again,
> > >>>           * because a larger error may occur during the recovery.
> > >>>           */
> > >>>          RTE_ETH_EVENT_ERR_RECOVERING,
> > >>>
> > >>>>>> It probably needs to be either deprecated or reworked.
> > >>>>> Looking at the commit, it does not say anything about the data
> > >>>>> plane functions which probably means, the error recovery is
> > >>>> happening within the data plane thread. What happens to other data
> > >>>> plane threads that are polling the same port on which the error
> > >>>> recovery is happening?
> > >>>>
> > >>>> The commit log says: "the PMD sets the data path pointers to dummy
> > >>>> functions".
> > >>>>
> > >>>> So the data plane threads will receive non-packet and send zero
> > >>>> with port which in error recovery.
> > >>>>
> > >>>>>
> > >>>>> Also, the commit log says that while the error recovery is under
> > >>>>> progress, the application should not call any control plane APIs.
> > >>>>> Does
> > >>>> that mean, the application has to check for error condition every
> > >>>> time it calls a control plane API?
> > >>>>
> > >>>> If application has not register event
> > >>>> (RTE_ETH_EVENT_ERR_RECOVERING) callback, it could calls control
> > >>>> plane API, but it will return failed.
> > >>>> If application has register above callback, it can wait for
> > >>>> recovery result, or direct call without wait but this will return failed.
> > >>>>
> > >>>>>
> > >>>>> The commit message also says that "PMD makes sure the control path
> > >>>>> operations failed with retcode -EBUSY". It does not say how it
> > >>>> does this. But, any communication from the PMD thread to control
> > >>>> plane thread may introduce race conditions if not done correctly.
> > >>>>
> > >>>> First there are no PMD thread, do you mean eal-intr-thread ?
> > >>>>
> > >>>> As for this question, you can see PMDs which already implement it,
> > >>>> they both provides mutual exclusion protection.
> > >>>>
> > >>>>>
> > >>>>>>
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>> Would something like this work better?
> > >>>>>>>>
> > >>>>>>>> Note: there is another bug in current code. The check for link
> > >>>>>>>> state interrupt and link_ops could return -ENOTSUP and leave
> > >>>>>>>> device in
> > >>>>>> indeterminate state.
> > >>>>>>>> The check should be done before calling PMD.
> > >>>>>>>>
> > >>>>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > >>>>>>>> index
> > >>>>>>>> 0266cc82acb6..d6c163ed85e7 100644
> > >>>>>>>> --- a/lib/ethdev/rte_ethdev.c
> > >>>>>>>> +++ b/lib/ethdev/rte_ethdev.c
> > >>>>>>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
> > >>>>>>>>           return 0;
> > >>>>>>>>       }
> > >>>>>>>>
> > >>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> > >>>>>>>> +        dev->dev_ops->link_update == NULL) {
> > >>>>>>>> +        RTE_ETHDEV_LOG(INFO,
> > >>>>>>>> +                   "Device with port_id=%"PRIu16" link update
> > >>>>>>>> +not
> > >>>>>> supported\n",
> > >>>>>>>> +                   port_id);
> > >>>>>>>> +            return -ENOTSUP;
> > >>>>>>>> +    }
> > >>>>>>>> +
> > >>>>>>>>       ret = rte_eth_dev_info_get(port_id, &dev_info);
> > >>>>>>>>       if (ret != 0)
> > >>>>>>>>           return ret;
> > >>>>>>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
> > >>>>>>>>           eth_dev_mac_restore(dev, &dev_info);
> > >>>>>>>>
> > >>>>>>>>       diag = (*dev->dev_ops->dev_start)(dev);
> > >>>>>>>> -    if (diag == 0)
> > >>>>>>>> -        dev->data->dev_started = 1;
> > >>>>>>>> -    else
> > >>>>>>>> +    if (diag != 0)
> > >>>>>>>>           return eth_err(port_id, diag);
> > >>>>>>>>
> > >>>>>>>>       ret = eth_dev_config_restore(dev, &dev_info, port_id); @@
> > >>>>>>>> -1611,16
> > >>>>>>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
> > >>>>>>>>           return ret;
> > >>>>>>>>       }
> > >>>>>>>>
> > >>>>>>>> -    if (dev->data->dev_conf.intr_conf.lsc == 0) {
> > >>>>>>>> -        if (*dev->dev_ops->link_update == NULL)
> > >>>>>>>> -            return -ENOTSUP;
> > >>>>>>>> -        (*dev->dev_ops->link_update)(dev, 0);
> > >>>>>>>> -    }
> > >>>>>>>> -
> > >>>>>>>>       /* expose selection of PMD fast-path functions */
> > >>>>>>>>       eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
> > >>>>>>>>
> > >>>>>>>> +    /* ensure state is set before marking device ready */
> > >>>>>>>> +    rte_smp_wmb();
> > >>>>>>>> +
> > >>>>>>>>       rte_ethdev_trace_start(port_id);
> > >>>>>>>> +
> > >>>>>>>> +    /* Update current link state */
> > >>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0)
> > >>>>>>>> +        (*dev->dev_ops->link_update)(dev, 0);
> > >>>>>>>> +
> > >>>>>>>>       return 0;
> > >>>>>>>>   }
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> .
> > >>>>>>>>
> > >>>>>
> > >
  
Ajit Khaparde March 6, 2023, 11:17 a.m. UTC | #25
On Mon, Mar 6, 2023 at 2:33 AM Konstantin Ananyev
<konstantin.ananyev@huawei.com> wrote:
>
>
>
> > > -----Original Message-----
> > > From: Ferruh Yigit <ferruh.yigit@amd.com>
> > > Sent: Saturday, March 4, 2023 1:19 AM
> > > To: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>; dev@dpdk.org; fengchengwen
> > > <fengchengwen@huawei.com>; Konstantin Ananyev <konstantin.ananyev@huawei.com>; Honnappa
> > > Nagarahalli <Honnappa.Nagarahalli@arm.com>; Stephen Hemminger <stephen@networkplumber.org>;
> > > Ruifeng Wang <Ruifeng.Wang@arm.com>; Ajit Khaparde (ajit.khaparde@broadcom.com)
> > > <ajit.khaparde@broadcom.com>
> > > Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
> > >
> > > On 2/26/2023 5:22 PM, Konstantin Ananyev wrote:
> > > >
> > > >>>>>>>>>>> If ethdev enqueue or dequeue function is called during
> > > >>>>>>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting
> > > >>>>>>>>>>> the function pointers, but before setting the pointer to port data.
> > > >>>>>>>>>>> In this case the newly registered enqueue/dequeue function
> > > >>>>>>>>>>> will use dummy port data and end up in seg fault.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> This patch moves the updation of each data pointers before
> > > >>>>>>>>>>> updating corresponding function pointers.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into
> > > >>>>>>>>>>> separate
> > > >>>>>>>>>>> structure")
> > > >>>>>>>>>>> Cc: stable@dpdk.org
> > > >>>>>>>>
> > > >>>>>>>> Why is something calling enqueue/dequeue when device is not
> > > >>>>>>>> fully
> > > >>>>>> started.
> > > >>>>>>>> A correctly written application would not call rx/tx burst
> > > >>>>>>>> until after ethdev start had finished.
> > > >>>>>>>
> > > >>>>>>> Please refer the eb0d471a894 (ethdev: add proactive error
> > > >>>>>>> handling mode), when driver recover itself, the application may
> > > >>>>>>> still invoke
> > > >>>>>> enqueue/dequeue API.
> > > >>>>>>
> > > >>>>>> Right now DPDK ethdev layer *does not* provide synchronization
> > > >>>>>> mechanisms between data-path and control-path functions.
> > > >>>>>> That was a deliberate deisgn choice. If we want to change that
> > > >>>>>> rule, then I suppose we need a community consensus for it.
> > > >>>>>> I think that if the driver wants to provide some sort of error
> > > >>>>>> recovery procedure, then it has to provide some synchronization
> > > >>>>>> mechanism inside it between data-path and control-path functions.
> > > >>>>>> Actually looking at eb0d471a894 (ethdev: add proactive error
> > > >>>>>> handling mode), and following patches I wonder how it creeped in?
> > > >>>>>> It seems we just introduced a loophole for race condition with
> > > >>>>>> this approach...
> > > >>>>
> > > >>>> Could you try to describe the specific scenario of loophole ?
> > > >>>
> > > >>> Ok, as I understand the existing mechanism:
> > > >>>
> > > >>> When PMD wants to start a recovery it has to:
> > > >>>   - invoke
> > > >>> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
> > > >>>     That supposed to call user provided callback. After callback is
> > > >>> finished PMD assumes
> > > >>>     that user is aware that recovery is about to start and should
> > > >>> make some precautions.
> > > >>> - when recovery is finished it invokes another callback:
> > > >>>    RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either
> > > >>> can continue to
> > > >>>    use port or have to treat is as faulty.
> > > >>>
> > > >>> The idea is ok in principle, but there is a problem.
> > > >>>
> > > >>> lib/ethdev/rte_ethdev.h:
> > > >>>             /** Port recovering from a hardware or firmware error.
> > > >>>           * If PMD supports proactive error recovery,
> > > >>>           * it should trigger this event to notify application
> > > >>>           * that it detected an error and the recovery is being started.
> > > >>>
> > > >>> <<< !!!!!
> > > >>>           * Upon receiving the event, the application should not
> > > >>> invoke any control path API
> > > >>>           * (such as rte_eth_dev_configure/rte_eth_dev_stop...)
> > > >>> until receiving
> > > >>>           * RTE_ETH_EVENT_RECOVERY_SUCCESS or
> > > >>> RTE_ETH_EVENT_RECOVERY_FAILED event.
> > > >>>           * The PMD will set the data path pointers to dummy
> > > >>> functions,
> > > >>>           * and re-set the data path pointers to non-dummy functions
> > > >>>           * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> > > >>> <<< !!!!!
> > > >>>
> > > >>> That part is just wrong I believe.
> > > >>> It should be:
> > > >>> Upon receiving the event, the application should not invoke any
> > > >>> *both control and data-path* API until receiving
> > > >>> RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED
> > > >>> event.
> > > >>> Resetting data path pointers to dummy functions by PMD *before*
> > > >>> invoking rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
> > > >>> introduces a race-condition with data-path threads, as such thread
> > > >>> could already be inside RX/TX function or can already read RX/TX
> > > >>> function/data pointers and be about to use them.
> > > >>
> > > >> Current practices: the PMDs already add some delay after set Rx/Tx
> > > >> callback to dummy, and plus the DPDK worker thread is busypolling,
> > > >> the probability of occurence in reality is zero. But in theoretically
> > > >> exist the above race-condition.
> > > >
> > > >
> > > > Adding delay might make a problem a bit less reproducible, but it
> > > > doesn't fix it.
> > > > The bug is still there.
> > > >
> > > >
> > > >>
> > > >>> And right now rte_ethdev layer doesn't provide any mechanism to
> > > >>> check it or wait when they'll finish, etc.
> > > >>
> > > >> Yes
> > > >>
> > > >>>
> > > >>> So, probably the simplest way to fix it with existing DPDK design:
> > > >>> - user level callback  RTE_ETH_EVENT_ERR_RECOVERING should return
> > > >>> only after it ensures that *all*
> > > >>>    application threads (and processes) stopped using either control
> > > >>> or data-path functions for that port
> > > >>
> > > >> Agree
> > > >>
> > > >>>    (yes it means that application that wants to use this feature has
> > > >>> to provide its own synchronization mechanism
> > > >>>    around data-path functions (RX/TX) that it is going to use).
> > > >>> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones.
> > > >>>
> > > >>> And message to all PMD developers:
> > > >>> *please stop updating rte_eth_fp_ops[] on your own*.
> > > >>> That's a bad practice and it is not supposed to do things that way.
> > > >>> There is a special API provided for these purposes:
> > > >>> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it.
> > > >>
> > > >> This two function is in private.h, so it should be expose to public
> > > >> header file.
> > > >
> > > > You mean we need to move these functions declarations into ethdev_driver.h?
> > > > If so, then yes, I think we probably do.
> > > >
> > > >
> > >
> > >
> > > What about making slightly different version available to drivers, which only updates
> > > function pointers, but not  'fpo->rxq' / 'fpo->txq'.
> > >
> > > This way driver can switch to between dummy and real burst function without worrying Rx/Tx
> > > queue validity.
> > >
> > > @Chengwen, @Ruifeng, can this solve the issue for relaxed memory ordering systems?
> >
> > Yes, updating only function pointers removes the synchronization requirement between function
> > pointer and qdata.
>
> Lads, that wouldn't work anyway.
> The race between recovery procedure and data-path persists:
> Recovery still has no idea is at given moment any thread doing RX/TX or not, and there is no
> way for it to know when such thread will finish.
> We do need some synchronization mechanism between control(recovery) and data-path threads.
> I believe it is unavoidable.
+1

>
> > >
> > >
> > >
> > > >>>
> > > >>> BTW,  I don't see any implementation for
> > > >>> RTE_ETH_EVENT_ERR_RECOVERING within either testpmd or any other
> > > >>> example apps.
> > > >>> Am I missing something?
> > > >>
> > > >> Currently it just promote the event.
> > > >
> > > >
> > > > Ok, can I suggest then to add a proper usage for into in testpmd?
> > > > It looks really strange that we add new feature into ethdev (and 2
> > > > PMDs), but didn't provide any way for users to test it.
> > > >
> > > >>
> > > >>> If not, then probably it could be a good starting point - let's
> > > >>> incorporate it inside testpmd (new forwarding engine probably) so
> > > >>> everyone can test/try it.
> > > >>>
> > > >>>           * It means that the application cannot send or receive any
> > > >>> packets
> > > >>>           * during this period.
> > > >>>           * @note Before the PMD reports the recovery result,
> > > >>>           * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING
> > > >>> event again,
> > > >>>           * because a larger error may occur during the recovery.
> > > >>>           */
> > > >>>          RTE_ETH_EVENT_ERR_RECOVERING,
> > > >>>
> > > >>>>>> It probably needs to be either deprecated or reworked.
> > > >>>>> Looking at the commit, it does not say anything about the data
> > > >>>>> plane functions which probably means, the error recovery is
> > > >>>> happening within the data plane thread. What happens to other data
> > > >>>> plane threads that are polling the same port on which the error
> > > >>>> recovery is happening?
> > > >>>>
> > > >>>> The commit log says: "the PMD sets the data path pointers to dummy
> > > >>>> functions".
> > > >>>>
> > > >>>> So the data plane threads will receive non-packet and send zero
> > > >>>> with port which in error recovery.
> > > >>>>
> > > >>>>>
> > > >>>>> Also, the commit log says that while the error recovery is under
> > > >>>>> progress, the application should not call any control plane APIs.
> > > >>>>> Does
> > > >>>> that mean, the application has to check for error condition every
> > > >>>> time it calls a control plane API?
> > > >>>>
> > > >>>> If application has not register event
> > > >>>> (RTE_ETH_EVENT_ERR_RECOVERING) callback, it could calls control
> > > >>>> plane API, but it will return failed.
> > > >>>> If application has register above callback, it can wait for
> > > >>>> recovery result, or direct call without wait but this will return failed.
> > > >>>>
> > > >>>>>
> > > >>>>> The commit message also says that "PMD makes sure the control path
> > > >>>>> operations failed with retcode -EBUSY". It does not say how it
> > > >>>> does this. But, any communication from the PMD thread to control
> > > >>>> plane thread may introduce race conditions if not done correctly.
> > > >>>>
> > > >>>> First there are no PMD thread, do you mean eal-intr-thread ?
> > > >>>>
> > > >>>> As for this question, you can see PMDs which already implement it,
> > > >>>> they both provides mutual exclusion protection.
> > > >>>>
> > > >>>>>
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Would something like this work better?
> > > >>>>>>>>
> > > >>>>>>>> Note: there is another bug in current code. The check for link
> > > >>>>>>>> state interrupt and link_ops could return -ENOTSUP and leave
> > > >>>>>>>> device in
> > > >>>>>> indeterminate state.
> > > >>>>>>>> The check should be done before calling PMD.
> > > >>>>>>>>
> > > >>>>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > >>>>>>>> index
> > > >>>>>>>> 0266cc82acb6..d6c163ed85e7 100644
> > > >>>>>>>> --- a/lib/ethdev/rte_ethdev.c
> > > >>>>>>>> +++ b/lib/ethdev/rte_ethdev.c
> > > >>>>>>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
> > > >>>>>>>>           return 0;
> > > >>>>>>>>       }
> > > >>>>>>>>
> > > >>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> > > >>>>>>>> +        dev->dev_ops->link_update == NULL) {
> > > >>>>>>>> +        RTE_ETHDEV_LOG(INFO,
> > > >>>>>>>> +                   "Device with port_id=%"PRIu16" link update
> > > >>>>>>>> +not
> > > >>>>>> supported\n",
> > > >>>>>>>> +                   port_id);
> > > >>>>>>>> +            return -ENOTSUP;
> > > >>>>>>>> +    }
> > > >>>>>>>> +
> > > >>>>>>>>       ret = rte_eth_dev_info_get(port_id, &dev_info);
> > > >>>>>>>>       if (ret != 0)
> > > >>>>>>>>           return ret;
> > > >>>>>>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
> > > >>>>>>>>           eth_dev_mac_restore(dev, &dev_info);
> > > >>>>>>>>
> > > >>>>>>>>       diag = (*dev->dev_ops->dev_start)(dev);
> > > >>>>>>>> -    if (diag == 0)
> > > >>>>>>>> -        dev->data->dev_started = 1;
> > > >>>>>>>> -    else
> > > >>>>>>>> +    if (diag != 0)
> > > >>>>>>>>           return eth_err(port_id, diag);
> > > >>>>>>>>
> > > >>>>>>>>       ret = eth_dev_config_restore(dev, &dev_info, port_id); @@
> > > >>>>>>>> -1611,16
> > > >>>>>>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
> > > >>>>>>>>           return ret;
> > > >>>>>>>>       }
> > > >>>>>>>>
> > > >>>>>>>> -    if (dev->data->dev_conf.intr_conf.lsc == 0) {
> > > >>>>>>>> -        if (*dev->dev_ops->link_update == NULL)
> > > >>>>>>>> -            return -ENOTSUP;
> > > >>>>>>>> -        (*dev->dev_ops->link_update)(dev, 0);
> > > >>>>>>>> -    }
> > > >>>>>>>> -
> > > >>>>>>>>       /* expose selection of PMD fast-path functions */
> > > >>>>>>>>       eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
> > > >>>>>>>>
> > > >>>>>>>> +    /* ensure state is set before marking device ready */
> > > >>>>>>>> +    rte_smp_wmb();
> > > >>>>>>>> +
> > > >>>>>>>>       rte_ethdev_trace_start(port_id);
> > > >>>>>>>> +
> > > >>>>>>>> +    /* Update current link state */
> > > >>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0)
> > > >>>>>>>> +        (*dev->dev_ops->link_update)(dev, 0);
> > > >>>>>>>> +
> > > >>>>>>>>       return 0;
> > > >>>>>>>>   }
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> .
> > > >>>>>>>>
> > > >>>>>
> > > >
>
  
Ferruh Yigit March 6, 2023, 11:57 a.m. UTC | #26
On 3/6/2023 10:32 AM, Konstantin Ananyev wrote:
> 
> 
>>> -----Original Message-----
>>> From: Ferruh Yigit <ferruh.yigit@amd.com>
>>> Sent: Saturday, March 4, 2023 1:19 AM
>>> To: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>; dev@dpdk.org; fengchengwen
>>> <fengchengwen@huawei.com>; Konstantin Ananyev <konstantin.ananyev@huawei.com>; Honnappa
>>> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Stephen Hemminger <stephen@networkplumber.org>;
>>> Ruifeng Wang <Ruifeng.Wang@arm.com>; Ajit Khaparde (ajit.khaparde@broadcom.com)
>>> <ajit.khaparde@broadcom.com>
>>> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
>>>
>>> On 2/26/2023 5:22 PM, Konstantin Ananyev wrote:
>>>>
>>>>>>>>>>>>>> If ethdev enqueue or dequeue function is called during
>>>>>>>>>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting
>>>>>>>>>>>>>> the function pointers, but before setting the pointer to port data.
>>>>>>>>>>>>>> In this case the newly registered enqueue/dequeue function
>>>>>>>>>>>>>> will use dummy port data and end up in seg fault.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This patch moves the updation of each data pointers before
>>>>>>>>>>>>>> updating corresponding function pointers.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into
>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>> structure")
>>>>>>>>>>>>>> Cc: stable@dpdk.org
>>>>>>>>>>>
>>>>>>>>>>> Why is something calling enqueue/dequeue when device is not
>>>>>>>>>>> fully
>>>>>>>>> started.
>>>>>>>>>>> A correctly written application would not call rx/tx burst
>>>>>>>>>>> until after ethdev start had finished.
>>>>>>>>>>
>>>>>>>>>> Please refer the eb0d471a894 (ethdev: add proactive error
>>>>>>>>>> handling mode), when driver recover itself, the application may
>>>>>>>>>> still invoke
>>>>>>>>> enqueue/dequeue API.
>>>>>>>>>
>>>>>>>>> Right now DPDK ethdev layer *does not* provide synchronization
>>>>>>>>> mechanisms between data-path and control-path functions.
>>>>>>>>> That was a deliberate deisgn choice. If we want to change that
>>>>>>>>> rule, then I suppose we need a community consensus for it.
>>>>>>>>> I think that if the driver wants to provide some sort of error
>>>>>>>>> recovery procedure, then it has to provide some synchronization
>>>>>>>>> mechanism inside it between data-path and control-path functions.
>>>>>>>>> Actually looking at eb0d471a894 (ethdev: add proactive error
>>>>>>>>> handling mode), and following patches I wonder how it creeped in?
>>>>>>>>> It seems we just introduced a loophole for race condition with
>>>>>>>>> this approach...
>>>>>>>
>>>>>>> Could you try to describe the specific scenario of loophole ?
>>>>>>
>>>>>> Ok, as I understand the existing mechanism:
>>>>>>
>>>>>> When PMD wants to start a recovery it has to:
>>>>>>   - invoke
>>>>>> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
>>>>>>     That supposed to call user provided callback. After callback is
>>>>>> finished PMD assumes
>>>>>>     that user is aware that recovery is about to start and should
>>>>>> make some precautions.
>>>>>> - when recovery is finished it invokes another callback:
>>>>>>    RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either
>>>>>> can continue to
>>>>>>    use port or have to treat is as faulty.
>>>>>>
>>>>>> The idea is ok in principle, but there is a problem.
>>>>>>
>>>>>> lib/ethdev/rte_ethdev.h:
>>>>>>             /** Port recovering from a hardware or firmware error.
>>>>>>           * If PMD supports proactive error recovery,
>>>>>>           * it should trigger this event to notify application
>>>>>>           * that it detected an error and the recovery is being started.
>>>>>>
>>>>>> <<< !!!!!
>>>>>>           * Upon receiving the event, the application should not
>>>>>> invoke any control path API
>>>>>>           * (such as rte_eth_dev_configure/rte_eth_dev_stop...)
>>>>>> until receiving
>>>>>>           * RTE_ETH_EVENT_RECOVERY_SUCCESS or
>>>>>> RTE_ETH_EVENT_RECOVERY_FAILED event.
>>>>>>           * The PMD will set the data path pointers to dummy
>>>>>> functions,
>>>>>>           * and re-set the data path pointers to non-dummy functions
>>>>>>           * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
>>>>>> <<< !!!!!
>>>>>>
>>>>>> That part is just wrong I believe.
>>>>>> It should be:
>>>>>> Upon receiving the event, the application should not invoke any
>>>>>> *both control and data-path* API until receiving
>>>>>> RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED
>>>>>> event.
>>>>>> Resetting data path pointers to dummy functions by PMD *before*
>>>>>> invoking rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
>>>>>> introduces a race-condition with data-path threads, as such thread
>>>>>> could already be inside RX/TX function or can already read RX/TX
>>>>>> function/data pointers and be about to use them.
>>>>>
>>>>> Current practices: the PMDs already add some delay after set Rx/Tx
>>>>> callback to dummy, and plus the DPDK worker thread is busypolling,
>>>>> the probability of occurence in reality is zero. But in theoretically
>>>>> exist the above race-condition.
>>>>
>>>>
>>>> Adding delay might make a problem a bit less reproducible, but it
>>>> doesn't fix it.
>>>> The bug is still there.
>>>>
>>>>
>>>>>
>>>>>> And right now rte_ethdev layer doesn't provide any mechanism to
>>>>>> check it or wait when they'll finish, etc.
>>>>>
>>>>> Yes
>>>>>
>>>>>>
>>>>>> So, probably the simplest way to fix it with existing DPDK design:
>>>>>> - user level callback  RTE_ETH_EVENT_ERR_RECOVERING should return
>>>>>> only after it ensures that *all*
>>>>>>    application threads (and processes) stopped using either control
>>>>>> or data-path functions for that port
>>>>>
>>>>> Agree
>>>>>
>>>>>>    (yes it means that application that wants to use this feature has
>>>>>> to provide its own synchronization mechanism
>>>>>>    around data-path functions (RX/TX) that it is going to use).
>>>>>> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones.
>>>>>>
>>>>>> And message to all PMD developers:
>>>>>> *please stop updating rte_eth_fp_ops[] on your own*.
>>>>>> That's a bad practice and it is not supposed to do things that way.
>>>>>> There is a special API provided for these purposes:
>>>>>> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it.
>>>>>
>>>>> This two function is in private.h, so it should be expose to public
>>>>> header file.
>>>>
>>>> You mean we need to move these functions declarations into ethdev_driver.h?
>>>> If so, then yes, I think we probably do.
>>>>
>>>>
>>>
>>>
>>> What about making slightly different version available to drivers, which only updates
>>> function pointers, but not  'fpo->rxq' / 'fpo->txq'.
>>>
>>> This way driver can switch to between dummy and real burst function without worrying Rx/Tx
>>> queue validity.
>>>
>>> @Chengwen, @Ruifeng, can this solve the issue for relaxed memory ordering systems?
>>
>> Yes, updating only function pointers removes the synchronization requirement between function
>> pointer and qdata.
> 
> Lads, that wouldn't work anyway.
> The race between recovery procedure and data-path persists:
> Recovery still has no idea is at given moment any thread doing RX/TX or not, and there is no
> way for it to know when such thread will finish.


Yes race condition persists, but as long as data (rxq/txq) stays valid,
does it cause a trouble? At lest this fixes the potential crash I think.


> We do need some synchronization mechanism between control(recovery) and data-path threads.
> I believe it is unavoidable.   
> 
>>>
>>>
>>>
>>>>>>
>>>>>> BTW,  I don't see any implementation for
>>>>>> RTE_ETH_EVENT_ERR_RECOVERING within either testpmd or any other
>>>>>> example apps.
>>>>>> Am I missing something?
>>>>>
>>>>> Currently it just promote the event.
>>>>
>>>>
>>>> Ok, can I suggest then to add a proper usage for into in testpmd?
>>>> It looks really strange that we add new feature into ethdev (and 2
>>>> PMDs), but didn't provide any way for users to test it.
>>>>
>>>>>
>>>>>> If not, then probably it could be a good starting point - let's
>>>>>> incorporate it inside testpmd (new forwarding engine probably) so
>>>>>> everyone can test/try it.
>>>>>>
>>>>>>           * It means that the application cannot send or receive any
>>>>>> packets
>>>>>>           * during this period.
>>>>>>           * @note Before the PMD reports the recovery result,
>>>>>>           * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING
>>>>>> event again,
>>>>>>           * because a larger error may occur during the recovery.
>>>>>>           */
>>>>>>          RTE_ETH_EVENT_ERR_RECOVERING,
>>>>>>
>>>>>>>>> It probably needs to be either deprecated or reworked.
>>>>>>>> Looking at the commit, it does not say anything about the data
>>>>>>>> plane functions which probably means, the error recovery is
>>>>>>> happening within the data plane thread. What happens to other data
>>>>>>> plane threads that are polling the same port on which the error
>>>>>>> recovery is happening?
>>>>>>>
>>>>>>> The commit log says: "the PMD sets the data path pointers to dummy
>>>>>>> functions".
>>>>>>>
>>>>>>> So the data plane threads will receive non-packet and send zero
>>>>>>> with port which in error recovery.
>>>>>>>
>>>>>>>>
>>>>>>>> Also, the commit log says that while the error recovery is under
>>>>>>>> progress, the application should not call any control plane APIs.
>>>>>>>> Does
>>>>>>> that mean, the application has to check for error condition every
>>>>>>> time it calls a control plane API?
>>>>>>>
>>>>>>> If application has not register event
>>>>>>> (RTE_ETH_EVENT_ERR_RECOVERING) callback, it could calls control
>>>>>>> plane API, but it will return failed.
>>>>>>> If application has register above callback, it can wait for
>>>>>>> recovery result, or direct call without wait but this will return failed.
>>>>>>>
>>>>>>>>
>>>>>>>> The commit message also says that "PMD makes sure the control path
>>>>>>>> operations failed with retcode -EBUSY". It does not say how it
>>>>>>> does this. But, any communication from the PMD thread to control
>>>>>>> plane thread may introduce race conditions if not done correctly.
>>>>>>>
>>>>>>> First there are no PMD thread, do you mean eal-intr-thread ?
>>>>>>>
>>>>>>> As for this question, you can see PMDs which already implement it,
>>>>>>> they both provides mutual exclusion protection.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Would something like this work better?
>>>>>>>>>>>
>>>>>>>>>>> Note: there is another bug in current code. The check for link
>>>>>>>>>>> state interrupt and link_ops could return -ENOTSUP and leave
>>>>>>>>>>> device in
>>>>>>>>> indeterminate state.
>>>>>>>>>>> The check should be done before calling PMD.
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
>>>>>>>>>>> index
>>>>>>>>>>> 0266cc82acb6..d6c163ed85e7 100644
>>>>>>>>>>> --- a/lib/ethdev/rte_ethdev.c
>>>>>>>>>>> +++ b/lib/ethdev/rte_ethdev.c
>>>>>>>>>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>>>>>           return 0;
>>>>>>>>>>>       }
>>>>>>>>>>>
>>>>>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0 &&
>>>>>>>>>>> +        dev->dev_ops->link_update == NULL) {
>>>>>>>>>>> +        RTE_ETHDEV_LOG(INFO,
>>>>>>>>>>> +                   "Device with port_id=%"PRIu16" link update
>>>>>>>>>>> +not
>>>>>>>>> supported\n",
>>>>>>>>>>> +                   port_id);
>>>>>>>>>>> +            return -ENOTSUP;
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>>       ret = rte_eth_dev_info_get(port_id, &dev_info);
>>>>>>>>>>>       if (ret != 0)
>>>>>>>>>>>           return ret;
>>>>>>>>>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>>>>>           eth_dev_mac_restore(dev, &dev_info);
>>>>>>>>>>>
>>>>>>>>>>>       diag = (*dev->dev_ops->dev_start)(dev);
>>>>>>>>>>> -    if (diag == 0)
>>>>>>>>>>> -        dev->data->dev_started = 1;
>>>>>>>>>>> -    else
>>>>>>>>>>> +    if (diag != 0)
>>>>>>>>>>>           return eth_err(port_id, diag);
>>>>>>>>>>>
>>>>>>>>>>>       ret = eth_dev_config_restore(dev, &dev_info, port_id); @@
>>>>>>>>>>> -1611,16
>>>>>>>>>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
>>>>>>>>>>>           return ret;
>>>>>>>>>>>       }
>>>>>>>>>>>
>>>>>>>>>>> -    if (dev->data->dev_conf.intr_conf.lsc == 0) {
>>>>>>>>>>> -        if (*dev->dev_ops->link_update == NULL)
>>>>>>>>>>> -            return -ENOTSUP;
>>>>>>>>>>> -        (*dev->dev_ops->link_update)(dev, 0);
>>>>>>>>>>> -    }
>>>>>>>>>>> -
>>>>>>>>>>>       /* expose selection of PMD fast-path functions */
>>>>>>>>>>>       eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
>>>>>>>>>>>
>>>>>>>>>>> +    /* ensure state is set before marking device ready */
>>>>>>>>>>> +    rte_smp_wmb();
>>>>>>>>>>> +
>>>>>>>>>>>       rte_ethdev_trace_start(port_id);
>>>>>>>>>>> +
>>>>>>>>>>> +    /* Update current link state */
>>>>>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0)
>>>>>>>>>>> +        (*dev->dev_ops->link_update)(dev, 0);
>>>>>>>>>>> +
>>>>>>>>>>>       return 0;
>>>>>>>>>>>   }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>
>>>>
>
  
Konstantin Ananyev March 6, 2023, 12:36 p.m. UTC | #27
> On 3/6/2023 10:32 AM, Konstantin Ananyev wrote:
> >
> >
> >>> -----Original Message-----
> >>> From: Ferruh Yigit <ferruh.yigit@amd.com>
> >>> Sent: Saturday, March 4, 2023 1:19 AM
> >>> To: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>; dev@dpdk.org; fengchengwen
> >>> <fengchengwen@huawei.com>; Konstantin Ananyev <konstantin.ananyev@huawei.com>; Honnappa
> >>> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Stephen Hemminger <stephen@networkplumber.org>;
> >>> Ruifeng Wang <Ruifeng.Wang@arm.com>; Ajit Khaparde (ajit.khaparde@broadcom.com)
> >>> <ajit.khaparde@broadcom.com>
> >>> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
> >>>
> >>> On 2/26/2023 5:22 PM, Konstantin Ananyev wrote:
> >>>>
> >>>>>>>>>>>>>> If ethdev enqueue or dequeue function is called during
> >>>>>>>>>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting
> >>>>>>>>>>>>>> the function pointers, but before setting the pointer to port data.
> >>>>>>>>>>>>>> In this case the newly registered enqueue/dequeue function
> >>>>>>>>>>>>>> will use dummy port data and end up in seg fault.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This patch moves the updation of each data pointers before
> >>>>>>>>>>>>>> updating corresponding function pointers.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into
> >>>>>>>>>>>>>> separate
> >>>>>>>>>>>>>> structure")
> >>>>>>>>>>>>>> Cc: stable@dpdk.org
> >>>>>>>>>>>
> >>>>>>>>>>> Why is something calling enqueue/dequeue when device is not
> >>>>>>>>>>> fully
> >>>>>>>>> started.
> >>>>>>>>>>> A correctly written application would not call rx/tx burst
> >>>>>>>>>>> until after ethdev start had finished.
> >>>>>>>>>>
> >>>>>>>>>> Please refer the eb0d471a894 (ethdev: add proactive error
> >>>>>>>>>> handling mode), when driver recover itself, the application may
> >>>>>>>>>> still invoke
> >>>>>>>>> enqueue/dequeue API.
> >>>>>>>>>
> >>>>>>>>> Right now DPDK ethdev layer *does not* provide synchronization
> >>>>>>>>> mechanisms between data-path and control-path functions.
> >>>>>>>>> That was a deliberate deisgn choice. If we want to change that
> >>>>>>>>> rule, then I suppose we need a community consensus for it.
> >>>>>>>>> I think that if the driver wants to provide some sort of error
> >>>>>>>>> recovery procedure, then it has to provide some synchronization
> >>>>>>>>> mechanism inside it between data-path and control-path functions.
> >>>>>>>>> Actually looking at eb0d471a894 (ethdev: add proactive error
> >>>>>>>>> handling mode), and following patches I wonder how it creeped in?
> >>>>>>>>> It seems we just introduced a loophole for race condition with
> >>>>>>>>> this approach...
> >>>>>>>
> >>>>>>> Could you try to describe the specific scenario of loophole ?
> >>>>>>
> >>>>>> Ok, as I understand the existing mechanism:
> >>>>>>
> >>>>>> When PMD wants to start a recovery it has to:
> >>>>>>   - invoke
> >>>>>> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
> >>>>>>     That supposed to call user provided callback. After callback is
> >>>>>> finished PMD assumes
> >>>>>>     that user is aware that recovery is about to start and should
> >>>>>> make some precautions.
> >>>>>> - when recovery is finished it invokes another callback:
> >>>>>>    RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either
> >>>>>> can continue to
> >>>>>>    use port or have to treat is as faulty.
> >>>>>>
> >>>>>> The idea is ok in principle, but there is a problem.
> >>>>>>
> >>>>>> lib/ethdev/rte_ethdev.h:
> >>>>>>             /** Port recovering from a hardware or firmware error.
> >>>>>>           * If PMD supports proactive error recovery,
> >>>>>>           * it should trigger this event to notify application
> >>>>>>           * that it detected an error and the recovery is being started.
> >>>>>>
> >>>>>> <<< !!!!!
> >>>>>>           * Upon receiving the event, the application should not
> >>>>>> invoke any control path API
> >>>>>>           * (such as rte_eth_dev_configure/rte_eth_dev_stop...)
> >>>>>> until receiving
> >>>>>>           * RTE_ETH_EVENT_RECOVERY_SUCCESS or
> >>>>>> RTE_ETH_EVENT_RECOVERY_FAILED event.
> >>>>>>           * The PMD will set the data path pointers to dummy
> >>>>>> functions,
> >>>>>>           * and re-set the data path pointers to non-dummy functions
> >>>>>>           * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> >>>>>> <<< !!!!!
> >>>>>>
> >>>>>> That part is just wrong I believe.
> >>>>>> It should be:
> >>>>>> Upon receiving the event, the application should not invoke any
> >>>>>> *both control and data-path* API until receiving
> >>>>>> RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED
> >>>>>> event.
> >>>>>> Resetting data path pointers to dummy functions by PMD *before*
> >>>>>> invoking rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
> >>>>>> introduces a race-condition with data-path threads, as such thread
> >>>>>> could already be inside RX/TX function or can already read RX/TX
> >>>>>> function/data pointers and be about to use them.
> >>>>>
> >>>>> Current practices: the PMDs already add some delay after set Rx/Tx
> >>>>> callback to dummy, and plus the DPDK worker thread is busypolling,
> >>>>> the probability of occurence in reality is zero. But in theoretically
> >>>>> exist the above race-condition.
> >>>>
> >>>>
> >>>> Adding delay might make a problem a bit less reproducible, but it
> >>>> doesn't fix it.
> >>>> The bug is still there.
> >>>>
> >>>>
> >>>>>
> >>>>>> And right now rte_ethdev layer doesn't provide any mechanism to
> >>>>>> check it or wait when they'll finish, etc.
> >>>>>
> >>>>> Yes
> >>>>>
> >>>>>>
> >>>>>> So, probably the simplest way to fix it with existing DPDK design:
> >>>>>> - user level callback  RTE_ETH_EVENT_ERR_RECOVERING should return
> >>>>>> only after it ensures that *all*
> >>>>>>    application threads (and processes) stopped using either control
> >>>>>> or data-path functions for that port
> >>>>>
> >>>>> Agree
> >>>>>
> >>>>>>    (yes it means that application that wants to use this feature has
> >>>>>> to provide its own synchronization mechanism
> >>>>>>    around data-path functions (RX/TX) that it is going to use).
> >>>>>> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones.
> >>>>>>
> >>>>>> And message to all PMD developers:
> >>>>>> *please stop updating rte_eth_fp_ops[] on your own*.
> >>>>>> That's a bad practice and it is not supposed to do things that way.
> >>>>>> There is a special API provided for these purposes:
> >>>>>> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it.
> >>>>>
> >>>>> This two function is in private.h, so it should be expose to public
> >>>>> header file.
> >>>>
> >>>> You mean we need to move these functions declarations into ethdev_driver.h?
> >>>> If so, then yes, I think we probably do.
> >>>>
> >>>>
> >>>
> >>>
> >>> What about making slightly different version available to drivers, which only updates
> >>> function pointers, but not  'fpo->rxq' / 'fpo->txq'.
> >>>
> >>> This way driver can switch to between dummy and real burst function without worrying Rx/Tx
> >>> queue validity.
> >>>
> >>> @Chengwen, @Ruifeng, can this solve the issue for relaxed memory ordering systems?
> >>
> >> Yes, updating only function pointers removes the synchronization requirement between function
> >> pointer and qdata.
> >
> > Lads, that wouldn't work anyway.
> > The race between recovery procedure and data-path persists:
> > Recovery still has no idea is at given moment any thread doing RX/TX or not, and there is no
> > way for it to know when such thread will finish.
> 
> 
> Yes race condition persists, but as long as data (rxq/txq) stays valid,
> does it cause a trouble? At lest this fixes the potential crash I think.

Yes, I believe it still would cause the trouble.
We still have control thread and RX/TX threads simultaneously accessing rxq/txq data and
probably trying to access/modify the same HW registers.
With current ethdev design (no sync between control and daya-path) 
dev_fp_ops_setup()  and RX/TX functions should not be called simultaneously.

> 
> > We do need some synchronization mechanism between control(recovery) and data-path threads.
> > I believe it is unavoidable.
> >
> >>>
> >>>
> >>>
> >>>>>>
> >>>>>> BTW,  I don't see any implementation for
> >>>>>> RTE_ETH_EVENT_ERR_RECOVERING within either testpmd or any other
> >>>>>> example apps.
> >>>>>> Am I missing something?
> >>>>>
> >>>>> Currently it just promote the event.
> >>>>
> >>>>
> >>>> Ok, can I suggest then to add a proper usage for into in testpmd?
> >>>> It looks really strange that we add new feature into ethdev (and 2
> >>>> PMDs), but didn't provide any way for users to test it.
> >>>>
> >>>>>
> >>>>>> If not, then probably it could be a good starting point - let's
> >>>>>> incorporate it inside testpmd (new forwarding engine probably) so
> >>>>>> everyone can test/try it.
> >>>>>>
> >>>>>>           * It means that the application cannot send or receive any
> >>>>>> packets
> >>>>>>           * during this period.
> >>>>>>           * @note Before the PMD reports the recovery result,
> >>>>>>           * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING
> >>>>>> event again,
> >>>>>>           * because a larger error may occur during the recovery.
> >>>>>>           */
> >>>>>>          RTE_ETH_EVENT_ERR_RECOVERING,
> >>>>>>
> >>>>>>>>> It probably needs to be either deprecated or reworked.
> >>>>>>>> Looking at the commit, it does not say anything about the data
> >>>>>>>> plane functions which probably means, the error recovery is
> >>>>>>> happening within the data plane thread. What happens to other data
> >>>>>>> plane threads that are polling the same port on which the error
> >>>>>>> recovery is happening?
> >>>>>>>
> >>>>>>> The commit log says: "the PMD sets the data path pointers to dummy
> >>>>>>> functions".
> >>>>>>>
> >>>>>>> So the data plane threads will receive non-packet and send zero
> >>>>>>> with port which in error recovery.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Also, the commit log says that while the error recovery is under
> >>>>>>>> progress, the application should not call any control plane APIs.
> >>>>>>>> Does
> >>>>>>> that mean, the application has to check for error condition every
> >>>>>>> time it calls a control plane API?
> >>>>>>>
> >>>>>>> If application has not register event
> >>>>>>> (RTE_ETH_EVENT_ERR_RECOVERING) callback, it could calls control
> >>>>>>> plane API, but it will return failed.
> >>>>>>> If application has register above callback, it can wait for
> >>>>>>> recovery result, or direct call without wait but this will return failed.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> The commit message also says that "PMD makes sure the control path
> >>>>>>>> operations failed with retcode -EBUSY". It does not say how it
> >>>>>>> does this. But, any communication from the PMD thread to control
> >>>>>>> plane thread may introduce race conditions if not done correctly.
> >>>>>>>
> >>>>>>> First there are no PMD thread, do you mean eal-intr-thread ?
> >>>>>>>
> >>>>>>> As for this question, you can see PMDs which already implement it,
> >>>>>>> they both provides mutual exclusion protection.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Would something like this work better?
> >>>>>>>>>>>
> >>>>>>>>>>> Note: there is another bug in current code. The check for link
> >>>>>>>>>>> state interrupt and link_ops could return -ENOTSUP and leave
> >>>>>>>>>>> device in
> >>>>>>>>> indeterminate state.
> >>>>>>>>>>> The check should be done before calling PMD.
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> >>>>>>>>>>> index
> >>>>>>>>>>> 0266cc82acb6..d6c163ed85e7 100644
> >>>>>>>>>>> --- a/lib/ethdev/rte_ethdev.c
> >>>>>>>>>>> +++ b/lib/ethdev/rte_ethdev.c
> >>>>>>>>>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
> >>>>>>>>>>>           return 0;
> >>>>>>>>>>>       }
> >>>>>>>>>>>
> >>>>>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> >>>>>>>>>>> +        dev->dev_ops->link_update == NULL) {
> >>>>>>>>>>> +        RTE_ETHDEV_LOG(INFO,
> >>>>>>>>>>> +                   "Device with port_id=%"PRIu16" link update
> >>>>>>>>>>> +not
> >>>>>>>>> supported\n",
> >>>>>>>>>>> +                   port_id);
> >>>>>>>>>>> +            return -ENOTSUP;
> >>>>>>>>>>> +    }
> >>>>>>>>>>> +
> >>>>>>>>>>>       ret = rte_eth_dev_info_get(port_id, &dev_info);
> >>>>>>>>>>>       if (ret != 0)
> >>>>>>>>>>>           return ret;
> >>>>>>>>>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
> >>>>>>>>>>>           eth_dev_mac_restore(dev, &dev_info);
> >>>>>>>>>>>
> >>>>>>>>>>>       diag = (*dev->dev_ops->dev_start)(dev);
> >>>>>>>>>>> -    if (diag == 0)
> >>>>>>>>>>> -        dev->data->dev_started = 1;
> >>>>>>>>>>> -    else
> >>>>>>>>>>> +    if (diag != 0)
> >>>>>>>>>>>           return eth_err(port_id, diag);
> >>>>>>>>>>>
> >>>>>>>>>>>       ret = eth_dev_config_restore(dev, &dev_info, port_id); @@
> >>>>>>>>>>> -1611,16
> >>>>>>>>>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
> >>>>>>>>>>>           return ret;
> >>>>>>>>>>>       }
> >>>>>>>>>>>
> >>>>>>>>>>> -    if (dev->data->dev_conf.intr_conf.lsc == 0) {
> >>>>>>>>>>> -        if (*dev->dev_ops->link_update == NULL)
> >>>>>>>>>>> -            return -ENOTSUP;
> >>>>>>>>>>> -        (*dev->dev_ops->link_update)(dev, 0);
> >>>>>>>>>>> -    }
> >>>>>>>>>>> -
> >>>>>>>>>>>       /* expose selection of PMD fast-path functions */
> >>>>>>>>>>>       eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
> >>>>>>>>>>>
> >>>>>>>>>>> +    /* ensure state is set before marking device ready */
> >>>>>>>>>>> +    rte_smp_wmb();
> >>>>>>>>>>> +
> >>>>>>>>>>>       rte_ethdev_trace_start(port_id);
> >>>>>>>>>>> +
> >>>>>>>>>>> +    /* Update current link state */
> >>>>>>>>>>> +    if (dev->data->dev_conf.intr_conf.lsc == 0)
> >>>>>>>>>>> +        (*dev->dev_ops->link_update)(dev, 0);
> >>>>>>>>>>> +
> >>>>>>>>>>>       return 0;
> >>>>>>>>>>>   }
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> .
> >>>>>>>>>>>
> >>>>>>>>
> >>>>
> >
  

Patch

diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
index 48090c879a..a0232c669f 100644
--- a/lib/ethdev/ethdev_private.c
+++ b/lib/ethdev/ethdev_private.c
@@ -270,17 +270,17 @@  void
 eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
 		const struct rte_eth_dev *dev)
 {
+	fpo->rxq.data = dev->data->rx_queues;
 	fpo->rx_pkt_burst = dev->rx_pkt_burst;
+	fpo->txq.data = dev->data->tx_queues;
 	fpo->tx_pkt_burst = dev->tx_pkt_burst;
 	fpo->tx_pkt_prepare = dev->tx_pkt_prepare;
 	fpo->rx_queue_count = dev->rx_queue_count;
 	fpo->rx_descriptor_status = dev->rx_descriptor_status;
 	fpo->tx_descriptor_status = dev->tx_descriptor_status;
 
-	fpo->rxq.data = dev->data->rx_queues;
 	fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
 
-	fpo->txq.data = dev->data->tx_queues;
 	fpo->txq.clbk = (void **)(uintptr_t)dev->pre_tx_burst_cbs;
 }