[v6,1/3] ethdev: support device reset and recovery events

Message ID 20201009034832.10302-2-kalesh-anakkur.purayil@broadcom.com (mailing list archive)
State Changes Requested, archived
Delegated to: Ferruh Yigit
Headers
Series librte_ethdev: error recovery support |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Kalesh A P Oct. 9, 2020, 3:48 a.m. UTC
  From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Adding support for device reset and recovery events in the
rte_eth_event framework. FW error and FW reset conditions would be
managed internally by PMD without needing application intervention.
In such cases, PMD would need reset/recovery events to notify application
that PMD is undergoing a reset.

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Asaf Penso <asafp@nvidia.com>
---
 doc/guides/prog_guide/poll_mode_drv.rst | 18 ++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h          | 17 +++++++++++++++++
 2 files changed, 35 insertions(+)
  

Comments

Thomas Monjalon Oct. 11, 2020, 9:29 p.m. UTC | #1
09/10/2020 05:48, Kalesh A P:
> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> 
> Adding support for device reset and recovery events in the
> rte_eth_event framework. FW error and FW reset conditions would be
> managed internally by PMD without needing application intervention.
> In such cases, PMD would need reset/recovery events to notify application
> that PMD is undergoing a reset.
> 
> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> Reviewed-by: Asaf Penso <asafp@nvidia.com>

The ethdev maintainers are not Cc'ed.
Please use the option --cc-cmd devtools/get-maintainer.sh


> +Error recovery support
> +~~~~~~~~~~~~~~~~~~~~~~
> +
> +When the PMD detects a FW reset or error condition, it will try to recover
> +from the error without needing the application intervention. In such cases,
> +PMD would need events to notify the application that it is undergoing
> +an error recovery.
> +
> +The PMD will trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
> +application that PMD detected a FW reset or FW error condition. PMD will
> +try to recover from the error by itself. Data path will be halted and
> +control path operations would fail during the recovery period.
> +
> +The PMD will trigger RTE_ETH_EVENT_RECOVERED event to notify the application
> +that the it has recovered from the error condition. Control path and data path
> +are up now. Since the device undergone a reset, flow rules offloaded prior to
> +the reset will be lost and the application has to recreate the rules again.
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 9759f13..9b4b015 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -3207,6 +3207,23 @@ enum rte_eth_event_type {
>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
>  	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
> +	RTE_ETH_EVENT_ERR_RECOVERING,
> +			/**< port recovering from an error
> +			 *
> +			 * PMD detected a FW reset or error condition.
> +			 * PMD will try to recover from the error.
> +			 * Data path will be halted and Control path operations
> +			 * would fail at this time.
> +			 */

Does it mean the application has nothing to do when receiving this event?
I think the app should stop polling at least.

> +	RTE_ETH_EVENT_RECOVERED,
> +			/**< port recovered from an error
> +			 *
> +			 * PMD has recovered from the error condition.
> +			 * Control path and Data path are up now.
> +			 * Since the device undergone a reset, flow rules
> +			 * offloaded prior to the reset will be lost and
> +			 * the application has to recreate the rules again.
> +			 */

Please be more precise.
Should the app re-configure the port, setup the queues, start the port?
  
Andrew Rybchenko Oct. 12, 2020, 8:09 a.m. UTC | #2
On 10/12/20 12:29 AM, Thomas Monjalon wrote:
> 09/10/2020 05:48, Kalesh A P:
>> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>>
>> Adding support for device reset and recovery events in the
>> rte_eth_event framework. FW error and FW reset conditions would be
>> managed internally by PMD without needing application intervention.
>> In such cases, PMD would need reset/recovery events to notify application
>> that PMD is undergoing a reset.
>>
>> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
>> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
>> Reviewed-by: Asaf Penso <asafp@nvidia.com>
> 
> The ethdev maintainers are not Cc'ed.
> Please use the option --cc-cmd devtools/get-maintainer.sh
> 
> 
>> +Error recovery support
>> +~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +When the PMD detects a FW reset or error condition, it will try to recover
>> +from the error without needing the application intervention. In such cases,
>> +PMD would need events to notify the application that it is undergoing
>> +an error recovery.
>> +
>> +The PMD will trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
>> +application that PMD detected a FW reset or FW error condition. PMD will
>> +try to recover from the error by itself. Data path will be halted and
>> +control path operations would fail during the recovery period.
>> +
>> +The PMD will trigger RTE_ETH_EVENT_RECOVERED event to notify the application
>> +that the it has recovered from the error condition. Control path and data path
>> +are up now. Since the device undergone a reset, flow rules offloaded prior to
>> +the reset will be lost and the application has to recreate the rules again.

What should be done if the state is not recoverable?

>> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
>> index 9759f13..9b4b015 100644
>> --- a/lib/librte_ethdev/rte_ethdev.h
>> +++ b/lib/librte_ethdev/rte_ethdev.h
>> @@ -3207,6 +3207,23 @@ enum rte_eth_event_type {
>>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
>>  	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
>> +	RTE_ETH_EVENT_ERR_RECOVERING,
>> +			/**< port recovering from an error
>> +			 *
>> +			 * PMD detected a FW reset or error condition.
>> +			 * PMD will try to recover from the error.
>> +			 * Data path will be halted and Control path operations
>> +			 * would fail at this time.
>> +			 */
> 
> Does it mean the application has nothing to do when receiving this event?
> I think the app should stop polling at least.
> 
>> +	RTE_ETH_EVENT_RECOVERED,
>> +			/**< port recovered from an error
>> +			 *
>> +			 * PMD has recovered from the error condition.
>> +			 * Control path and Data path are up now.
>> +			 * Since the device undergone a reset, flow rules
>> +			 * offloaded prior to the reset will be lost and
>> +			 * the application has to recreate the rules again.
>> +			 */
> 
> Please be more precise.
> Should the app re-configure the port, setup the queues, start the port?
> 
>
  
Ferruh Yigit Feb. 18, 2021, 3:32 p.m. UTC | #3
On 10/12/2020 9:09 AM, Andrew Rybchenko wrote:
> On 10/12/20 12:29 AM, Thomas Monjalon wrote:
>> 09/10/2020 05:48, Kalesh A P:
>>> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>>>
>>> Adding support for device reset and recovery events in the
>>> rte_eth_event framework. FW error and FW reset conditions would be
>>> managed internally by PMD without needing application intervention.
>>> In such cases, PMD would need reset/recovery events to notify application
>>> that PMD is undergoing a reset.
>>>
>>> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
>>> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>>> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
>>> Reviewed-by: Asaf Penso <asafp@nvidia.com>
>>
>> The ethdev maintainers are not Cc'ed.
>> Please use the option --cc-cmd devtools/get-maintainer.sh
>>
>>
>>> +Error recovery support
>>> +~~~~~~~~~~~~~~~~~~~~~~
>>> +
>>> +When the PMD detects a FW reset or error condition, it will try to recover
>>> +from the error without needing the application intervention. In such cases,
>>> +PMD would need events to notify the application that it is undergoing
>>> +an error recovery.
>>> +
>>> +The PMD will trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
>>> +application that PMD detected a FW reset or FW error condition. PMD will
>>> +try to recover from the error by itself. Data path will be halted and
>>> +control path operations would fail during the recovery period.
>>> +
>>> +The PMD will trigger RTE_ETH_EVENT_RECOVERED event to notify the application
>>> +that the it has recovered from the error condition. Control path and data path
>>> +are up now. Since the device undergone a reset, flow rules offloaded prior to
>>> +the reset will be lost and the application has to recreate the rules again.
> 
> What should be done if the state is not recoverable?
> 
>>> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
>>> index 9759f13..9b4b015 100644
>>> --- a/lib/librte_ethdev/rte_ethdev.h
>>> +++ b/lib/librte_ethdev/rte_ethdev.h
>>> @@ -3207,6 +3207,23 @@ enum rte_eth_event_type {
>>>   	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>>>   	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
>>>   	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
>>> +	RTE_ETH_EVENT_ERR_RECOVERING,
>>> +			/**< port recovering from an error
>>> +			 *
>>> +			 * PMD detected a FW reset or error condition.
>>> +			 * PMD will try to recover from the error.
>>> +			 * Data path will be halted and Control path operations
>>> +			 * would fail at this time.
>>> +			 */
>>
>> Does it mean the application has nothing to do when receiving this event?
>> I think the app should stop polling at least.
>>
>>> +	RTE_ETH_EVENT_RECOVERED,
>>> +			/**< port recovered from an error
>>> +			 *
>>> +			 * PMD has recovered from the error condition.
>>> +			 * Control path and Data path are up now.
>>> +			 * Since the device undergone a reset, flow rules
>>> +			 * offloaded prior to the reset will be lost and
>>> +			 * the application has to recreate the rules again.
>>> +			 */
>>
>> Please be more precise.
>> Should the app re-configure the port, setup the queues, start the port?
>>
>>
> 

Hi Kalesh Anakkur,

The mechanics of notifying the application looks good, but the concerns seems 
more about what application should do with this information.

PMD notifies the application on the FW/HW reset and pushes some 
tasks/responsibilities to the application, but for this to be useful, these 
tasks should be clear to application.

Think yourself in a situation that you are developing an application and you 
received these events from a device that you don't know its internals, what will 
you do?

Both Thomas and Andrew put cases that needs more clarification for application. 
Can you please send a new version with those clarifications?


Thanks,
ferruh
  

Patch

diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
index 86e0a14..c03f0ef 100644
--- a/doc/guides/prog_guide/poll_mode_drv.rst
+++ b/doc/guides/prog_guide/poll_mode_drv.rst
@@ -615,3 +615,21 @@  by application.
 The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
 the application to handle reset event. It is duty of application to
 handle all synchronization before it calls rte_eth_dev_reset().
+
+Error recovery support
+~~~~~~~~~~~~~~~~~~~~~~
+
+When the PMD detects a FW reset or error condition, it will try to recover
+from the error without needing the application intervention. In such cases,
+PMD would need events to notify the application that it is undergoing
+an error recovery.
+
+The PMD will trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
+application that PMD detected a FW reset or FW error condition. PMD will
+try to recover from the error by itself. Data path will be halted and
+control path operations would fail during the recovery period.
+
+The PMD will trigger RTE_ETH_EVENT_RECOVERED event to notify the application
+that the it has recovered from the error condition. Control path and data path
+are up now. Since the device undergone a reset, flow rules offloaded prior to
+the reset will be lost and the application has to recreate the rules again.
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 9759f13..9b4b015 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3207,6 +3207,23 @@  enum rte_eth_event_type {
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+	RTE_ETH_EVENT_ERR_RECOVERING,
+			/**< port recovering from an error
+			 *
+			 * PMD detected a FW reset or error condition.
+			 * PMD will try to recover from the error.
+			 * Data path will be halted and Control path operations
+			 * would fail at this time.
+			 */
+	RTE_ETH_EVENT_RECOVERED,
+			/**< port recovered from an error
+			 *
+			 * PMD has recovered from the error condition.
+			 * Control path and Data path are up now.
+			 * Since the device undergone a reset, flow rules
+			 * offloaded prior to the reset will be lost and
+			 * the application has to recreate the rules again.
+			 */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };