[v8,1/4] ethdev: support device error recovery notification

Message ID 20220616094122.1909-2-fengchengwen@huawei.com (mailing list archive)
State Changes Requested, archived
Delegated to: Andrew Rybchenko
Headers
Series ethdev: support error recovery notification |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Chengwen Feng June 16, 2022, 9:41 a.m. UTC
  From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
to recover from the errors. In this process, the PMD sets the data path
pointers to dummy functions (which will prevent the crash), and also
make sure the control path operations failed with retcode -EBUSY.

Also in this process, from the perspective of application, services are
affected. For example, the Rx/Tx bust APIs cannot receive and send
packets, and the control plane API return failure.

In some service scenarios, application needs to be aware of the event
to determine whether to migrate services. So three events were
introduced:

1. RTE_ETH_EVENT_ERR_RECOVERING: the PMD must trigger this event to
notify the application that it detected a hardware or firmware error
and tries to recover.
2. RTE_ETH_EVENT_RECOVER_SUCCESS: the PMD must trigger this event to
notify the application that it has recovered from the error. And PMD
already re-configures the port to the state prior to the error.
3. RTE_ETH_EVENT_RECOVER_FAILED: the PMD must trigger this event to
notify the application that it has failed to recover from the error.
The port may not be usable anymore.

Note: the error recovery of these events is mainly performed by the
PMD. Unlike the RTE_ETH_EVENT_INTR_RESET which the error recovery is
performed by the application. The PMD must ensure that the above two
error handling methods cannot be used at the same time.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 doc/guides/prog_guide/poll_mode_drv.rst | 32 +++++++++++++++++++++++++
 doc/guides/rel_notes/release_22_07.rst  | 11 +++++++++
 lib/ethdev/rte_ethdev.h                 |  6 +++++
 3 files changed, 49 insertions(+)
  

Comments

Thomas Monjalon June 20, 2022, 5:42 p.m. UTC | #1
16/06/2022 11:41, Chengwen Feng:
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -3928,6 +3928,12 @@ enum rte_eth_event_type {
>  	 * @see rte_eth_rx_avail_thresh_set()
>  	 */
>  	RTE_ETH_EVENT_RX_AVAIL_THRESH,
> +	/** Port recovering from a hardware or firmware error */
> +	RTE_ETH_EVENT_ERR_RECOVERING,
> +	/** Port recovers successful from the error */
> +	RTE_ETH_EVENT_RECOVER_SUCCESS,
> +	/** Port recovers failed from the error */
> +	RTE_ETH_EVENT_RECOVER_FAILED,
>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>  };

The descriptions here are not enough.
We cannot understand what has changed on the port,
and which action must be taken.

Note: I believe that's the same problem from the beginning:
we need a generic handling which is described in details.
  
Chengwen Feng June 21, 2022, 1:38 a.m. UTC | #2
Hi Thomas,

On 2022/6/21 1:42, Thomas Monjalon wrote:
> 16/06/2022 11:41, Chengwen Feng:
>> --- a/lib/ethdev/rte_ethdev.h
>> +++ b/lib/ethdev/rte_ethdev.h
>> @@ -3928,6 +3928,12 @@ enum rte_eth_event_type {
>>  	 * @see rte_eth_rx_avail_thresh_set()
>>  	 */
>>  	RTE_ETH_EVENT_RX_AVAIL_THRESH,
>> +	/** Port recovering from a hardware or firmware error */
>> +	RTE_ETH_EVENT_ERR_RECOVERING,
>> +	/** Port recovers successful from the error */
>> +	RTE_ETH_EVENT_RECOVER_SUCCESS,
>> +	/** Port recovers failed from the error */
>> +	RTE_ETH_EVENT_RECOVER_FAILED,
>>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>>  };
> 
> The descriptions here are not enough.
> We cannot understand what has changed on the port,
> and which action must be taken.

There are detail descriptions in /doc/guides/prog_guide/poll_mode_drv.rst,
I will add your review in poll_mode_drv.rst.

Another question: do we need to add a detail description here as well? I think the poll_mode_drv.rst is enough.

> 
> Note: I believe that's the same problem from the beginning:
> we need a generic handling which is described in details.
> 
> 
> 
> .
>
  
Thomas Monjalon June 21, 2022, 7:04 a.m. UTC | #3
21/06/2022 03:38, fengchengwen:
> Hi Thomas,
> 
> On 2022/6/21 1:42, Thomas Monjalon wrote:
> > 16/06/2022 11:41, Chengwen Feng:
> >> --- a/lib/ethdev/rte_ethdev.h
> >> +++ b/lib/ethdev/rte_ethdev.h
> >> @@ -3928,6 +3928,12 @@ enum rte_eth_event_type {
> >>  	 * @see rte_eth_rx_avail_thresh_set()
> >>  	 */
> >>  	RTE_ETH_EVENT_RX_AVAIL_THRESH,
> >> +	/** Port recovering from a hardware or firmware error */
> >> +	RTE_ETH_EVENT_ERR_RECOVERING,
> >> +	/** Port recovers successful from the error */
> >> +	RTE_ETH_EVENT_RECOVER_SUCCESS,
> >> +	/** Port recovers failed from the error */
> >> +	RTE_ETH_EVENT_RECOVER_FAILED,
> >>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
> >>  };
> > 
> > The descriptions here are not enough.
> > We cannot understand what has changed on the port,
> > and which action must be taken.
> 
> There are detail descriptions in /doc/guides/prog_guide/poll_mode_drv.rst,
> I will add your review in poll_mode_drv.rst.
> 
> Another question: do we need to add a detail description here as well? I think the poll_mode_drv.rst is enough.

It is the opposite: the RST guide is to give the overview,
while the doxygen comments are the precise API documentation.
You need to explain what is the state of the device.
Is it the same as after a call to rte_eth_dev_stop() ?
The application needs to know what must be reconfigured.
  
Ray Kinsella June 23, 2022, 3:58 p.m. UTC | #4
Chengwen Feng <fengchengwen@huawei.com> writes:

> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>
> Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
> to recover from the errors. In this process, the PMD sets the data path
> pointers to dummy functions (which will prevent the crash), and also
> make sure the control path operations failed with retcode -EBUSY.
>
> Also in this process, from the perspective of application, services are
> affected. For example, the Rx/Tx bust APIs cannot receive and send
> packets, and the control plane API return failure.
>
> In some service scenarios, application needs to be aware of the event
> to determine whether to migrate services. So three events were
> introduced:
>
> 1. RTE_ETH_EVENT_ERR_RECOVERING: the PMD must trigger this event to
> notify the application that it detected a hardware or firmware error
> and tries to recover.
> 2. RTE_ETH_EVENT_RECOVER_SUCCESS: the PMD must trigger this event to
> notify the application that it has recovered from the error. And PMD
> already re-configures the port to the state prior to the error.
> 3. RTE_ETH_EVENT_RECOVER_FAILED: the PMD must trigger this event to
> notify the application that it has failed to recover from the error.
> The port may not be usable anymore.
>
> Note: the error recovery of these events is mainly performed by the
> PMD. Unlike the RTE_ETH_EVENT_INTR_RESET which the error recovery is
> performed by the application. The PMD must ensure that the above two
> error handling methods cannot be used at the same time.
>
> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> ---
>  doc/guides/prog_guide/poll_mode_drv.rst | 32 +++++++++++++++++++++++++
>  doc/guides/rel_notes/release_22_07.rst  | 11 +++++++++
>  lib/ethdev/rte_ethdev.h                 |  6 +++++
>  3 files changed, 49 insertions(+)
>
> diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
> index 9d081b1cba..6398917485 100644
> --- a/doc/guides/prog_guide/poll_mode_drv.rst
> +++ b/doc/guides/prog_guide/poll_mode_drv.rst
> @@ -627,3 +627,35 @@ by application.
>  The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
>  the application to handle reset event. It is duty of application to
>  handle all synchronization before it calls rte_eth_dev_reset().
> +
> +Error Recovery Notification
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try to
> +recover from the errors. In this process, the PMD sets the data path pointers
> +to dummy functions (which will prevent the crash), and also make sure the
> +control path operations failed with retcode -EBUSY.
> +
> +Also in this process, from the perspective of application, services are
> +affected. For example, the Rx/Tx bust APIs cannot receive and send packets,
> +and the control plane API return failure.
> +
> +In some service scenarios, application needs to be aware of the event to
> +determine whether to migrate services. So three events was introduced.
> +
> +The PMD must trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
> +application that it detected a hardware or firmware error and tries to recover.
> +
> +The PMD must trigger RTE_ETH_EVENT_RECOVER_SUCCESS event to notify the
> +application that it has recovered from the error. And PMD already re-configures
> +the port to the state prior to the error.
> +
> +The PMD must trigger RTE_ETH_EVENT_RECOVER_FAILED event to notify the
> +application that it has failed to recover from the error. The port may not be
> +usable anymore.
> +
> +.. note::
> +        The error recovery of these events is mainly performed by the PMD.
> +        Unlike the RTE_ETH_EVENT_INTR_RESET which the error recovery is
> +        performed by the application. The PMD must ensure that the above two
> +        error handling methods cannot be used at the same time.
> diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
> index 6fc044edaa..b237bd3303 100644
> --- a/doc/guides/rel_notes/release_22_07.rst
> +++ b/doc/guides/rel_notes/release_22_07.rst
> @@ -108,6 +108,17 @@ New Features
>  
>    Added an API which can get the device type of vDPA device.
>  
> +* **Added error recover notification.**
> +
> +  Added error recover notification to application including:
> +
> +  * Added new event: ``RTE_ETH_EVENT_ERR_RECOVERING`` for the PMD to report
> +    that the port is recovering from an error.
> +  * Added new event: ``RTE_ETH_EVENT_RECOVER_SUCCESS`` for the PMD to report
> +    that the port recover successful from an error.

RTE_ETH_EVENT_RECOVERY_SUCCESS

> +  * Added new event: ``RTE_ETH_EVENT_RECOVER_FAILED`` for the PMD to
> report

RTE_ETH_EVENT_RECOVERY_FAILED

> +    that the prot recover failed from an error

to report that port recovery failed

> +
>  * **Updated Amazon ena driver.**
>  
>    The new driver version (v2.7.0) includes:
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 045ee64747..6998f6f0be 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -3928,6 +3928,12 @@ enum rte_eth_event_type {
>  	 * @see rte_eth_rx_avail_thresh_set()
>  	 */
>  	RTE_ETH_EVENT_RX_AVAIL_THRESH,
> +	/** Port recovering from a hardware or firmware error */
> +	RTE_ETH_EVENT_ERR_RECOVERING,
> +	/** Port recovers successful from the error */
> +	RTE_ETH_EVENT_RECOVER_SUCCESS,

RTE_ETH_EVENT_RECOVERY_SUCCESS

> +	/** Port recovers failed from the error */
> +	RTE_ETH_EVENT_RECOVER_FAILED,
RTE_ETH_EVENT_RECOVERY_FAILED

>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>  };
  
Chengwen Feng Sept. 22, 2022, 7:53 a.m. UTC | #5
Hi Thomas,

On 2022/6/21 15:04, Thomas Monjalon wrote:
> 21/06/2022 03:38, fengchengwen:
>> Hi Thomas,
>>
>> On 2022/6/21 1:42, Thomas Monjalon wrote:
>>> 16/06/2022 11:41, Chengwen Feng:
>>>> --- a/lib/ethdev/rte_ethdev.h
>>>> +++ b/lib/ethdev/rte_ethdev.h
>>>> @@ -3928,6 +3928,12 @@ enum rte_eth_event_type {
>>>>  	 * @see rte_eth_rx_avail_thresh_set()
>>>>  	 */
>>>>  	RTE_ETH_EVENT_RX_AVAIL_THRESH,
>>>> +	/** Port recovering from a hardware or firmware error */
>>>> +	RTE_ETH_EVENT_ERR_RECOVERING,
>>>> +	/** Port recovers successful from the error */
>>>> +	RTE_ETH_EVENT_RECOVER_SUCCESS,
>>>> +	/** Port recovers failed from the error */
>>>> +	RTE_ETH_EVENT_RECOVER_FAILED,
>>>>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>>>>  };
>>>
>>> The descriptions here are not enough.
>>> We cannot understand what has changed on the port,
>>> and which action must be taken.
>>
>> There are detail descriptions in /doc/guides/prog_guide/poll_mode_drv.rst,
>> I will add your review in poll_mode_drv.rst.
>>
>> Another question: do we need to add a detail description here as well? I think the poll_mode_drv.rst is enough.
> 
> It is the opposite: the RST guide is to give the overview,
> while the doxygen comments are the precise API documentation.
> You need to explain what is the state of the device.
> Is it the same as after a call to rte_eth_dev_stop() ?
> The application needs to know what must be reconfigured.

V9 already sent to address these, please take time to review, thanks.

> 
> 
> 
> 
> .
>
  

Patch

diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
index 9d081b1cba..6398917485 100644
--- a/doc/guides/prog_guide/poll_mode_drv.rst
+++ b/doc/guides/prog_guide/poll_mode_drv.rst
@@ -627,3 +627,35 @@  by application.
 The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
 the application to handle reset event. It is duty of application to
 handle all synchronization before it calls rte_eth_dev_reset().
+
+Error Recovery Notification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try to
+recover from the errors. In this process, the PMD sets the data path pointers
+to dummy functions (which will prevent the crash), and also make sure the
+control path operations failed with retcode -EBUSY.
+
+Also in this process, from the perspective of application, services are
+affected. For example, the Rx/Tx bust APIs cannot receive and send packets,
+and the control plane API return failure.
+
+In some service scenarios, application needs to be aware of the event to
+determine whether to migrate services. So three events was introduced.
+
+The PMD must trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
+application that it detected a hardware or firmware error and tries to recover.
+
+The PMD must trigger RTE_ETH_EVENT_RECOVER_SUCCESS event to notify the
+application that it has recovered from the error. And PMD already re-configures
+the port to the state prior to the error.
+
+The PMD must trigger RTE_ETH_EVENT_RECOVER_FAILED event to notify the
+application that it has failed to recover from the error. The port may not be
+usable anymore.
+
+.. note::
+        The error recovery of these events is mainly performed by the PMD.
+        Unlike the RTE_ETH_EVENT_INTR_RESET which the error recovery is
+        performed by the application. The PMD must ensure that the above two
+        error handling methods cannot be used at the same time.
diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index 6fc044edaa..b237bd3303 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -108,6 +108,17 @@  New Features
 
   Added an API which can get the device type of vDPA device.
 
+* **Added error recover notification.**
+
+  Added error recover notification to application including:
+
+  * Added new event: ``RTE_ETH_EVENT_ERR_RECOVERING`` for the PMD to report
+    that the port is recovering from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_SUCCESS`` for the PMD to report
+    that the port recover successful from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_FAILED`` for the PMD to report
+    that the prot recover failed from an error.
+
 * **Updated Amazon ena driver.**
 
   The new driver version (v2.7.0) includes:
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 045ee64747..6998f6f0be 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -3928,6 +3928,12 @@  enum rte_eth_event_type {
 	 * @see rte_eth_rx_avail_thresh_set()
 	 */
 	RTE_ETH_EVENT_RX_AVAIL_THRESH,
+	/** Port recovering from a hardware or firmware error */
+	RTE_ETH_EVENT_ERR_RECOVERING,
+	/** Port recovers successful from the error */
+	RTE_ETH_EVENT_RECOVER_SUCCESS,
+	/** Port recovers failed from the error */
+	RTE_ETH_EVENT_RECOVER_FAILED,
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };