From patchwork Thu Oct 13 12:42:24 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: fengchengwen X-Patchwork-Id: 118159 X-Patchwork-Delegate: andrew.rybchenko@oktetlabs.ru Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 8638AA0093; Thu, 13 Oct 2022 14:48:31 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id DF12D42F86; Thu, 13 Oct 2022 14:48:17 +0200 (CEST) Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by mails.dpdk.org (Postfix) with ESMTP id 82B5042EAF for ; Thu, 13 Oct 2022 14:48:13 +0200 (CEST) Received: from dggpeml500024.china.huawei.com (unknown [172.30.72.54]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4Mp8MV6FfZz1P7KS; Thu, 13 Oct 2022 20:43:34 +0800 (CST) Received: from localhost.localdomain (10.67.165.24) by dggpeml500024.china.huawei.com (7.185.36.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Thu, 13 Oct 2022 20:48:10 +0800 From: Chengwen Feng To: , , CC: , , , , , Subject: [PATCH v13 2/5] ethdev: support proactive error handling mode Date: Thu, 13 Oct 2022 12:42:24 +0000 Message-ID: <20221013124227.40123-3-fengchengwen@huawei.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20221013124227.40123-1-fengchengwen@huawei.com> References: <20220128124831.427-1-kalesh-anakkur.purayil@broadcom.com> <20221013124227.40123-1-fengchengwen@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.67.165.24] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpeml500024.china.huawei.com (7.185.36.10) X-CFilter-Loop: Reflected X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org From: Kalesh AP Some PMDs (e.g. hns3) could detect hardware or firmware errors, one error recovery mode is to report RTE_ETH_EVENT_INTR_RESET event, and wait for application invoke rte_eth_dev_reset() to recover the port, however, this mode has the following weaknesses: 1) Due to different hardware and software design, some NIC port recovery process requires multiple handshakes with the firmware and PF (when the port is VF). It takes a long time to complete the entire operation for one port, If multiple ports (for example, multiple VFs of a PF) are reset at the same time, other VFs may fail to be reset. (Because the reset processing is serial, the previous VFs must be processed before the subsequent VFs). 2) The impact on the application layer is great, and it should stop working queues, stop calling Rx and Tx functions, and then call rte_eth_dev_reset(), and re-setup all again. This patch introduces proactive error handling mode, the PMD will try to recover from the errors itself. In this process, the PMD sets the data path pointers to dummy functions (which will prevent the crash), and also make sure the control path operations failed with retcode -EBUSY. Because the PMD recovers automatically, the application can only sense that the data flow is disconnected for a while and the control API returns an error in this period. In order to sense the error happening/recovering, three events were introduced: 1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it detected an error and the recovery is being started. Upon receiving the event, the application should not invoke any control path APIs until receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event. 2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that it recovers successful from the error, the PMD already re-configures the port, and the effect is the same as that of the restart operation. 3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it recovers failed from the error, the port should not usable anymore. The application should close the port. Signed-off-by: Kalesh AP Signed-off-by: Somnath Kotur Signed-off-by: Chengwen Feng Reviewed-by: Ajit Khaparde Acked-by: Andrew Rybchenko --- app/test-pmd/config.c | 3 ++ doc/guides/prog_guide/poll_mode_drv.rst | 41 +++++++++++++++++ doc/guides/rel_notes/release_22_11.rst | 12 +++++ lib/ethdev/rte_ethdev.h | 59 +++++++++++++++++++++++++ 4 files changed, 115 insertions(+) diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c index 4cddcd0bf7..0f7dbd698f 100644 --- a/app/test-pmd/config.c +++ b/app/test-pmd/config.c @@ -929,6 +929,9 @@ port_infos_display(portid_t port_id) case RTE_ETH_ERROR_HANDLE_MODE_PASSIVE: printf("passive\n"); break; + case RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE: + printf("proactive\n"); + break; default: printf("unknown\n"); break; diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst index 9d081b1cba..7a9c43d1cb 100644 --- a/doc/guides/prog_guide/poll_mode_drv.rst +++ b/doc/guides/prog_guide/poll_mode_drv.rst @@ -627,3 +627,44 @@ by application. The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger the application to handle reset event. It is duty of application to handle all synchronization before it calls rte_eth_dev_reset(). + +The above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``. + +Proactive Error Handling Mode +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``, different from +the application invokes recovery in PASSIVE mode, the PMD automatically recovers +from error in PROACTIVE mode, and only a small amount of work is required for +the application. + +During error detection and automatic recovery, the PMD sets the data path +pointers to dummy functions (which will prevent the crash), and also make sure +the control path operations failed with retcode -EBUSY. + +Because the PMD recovers automatically, the application can only sense that the +data flow is disconnected for a while and the control API returns an error in +this period. + +In order to sense the error happening/recovering, as well as restore some +additional configuration, three events were introduced: + +* RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it detected + an error and the recovery is being started. Upon receiving the event, the + application should not invoke any control path APIs until receiving + RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event. + +* RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that it + recovers successful from the error, the PMD already re-configures the port, + and the effect is the same as that of the restart operation. + +* RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it + recovers failed from the error, the port should not usable anymore. the + application should close the port. + +.. note:: + * Before the PMD reports the recovery result, the PMD may report the + ``RTE_ETH_EVENT_ERR_RECOVERING`` event again, because a larger error + may occur during the recovery. + * The error handling mode supported by the PMD can be reported through + the ``rte_eth_dev_info_get`` API. diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst index 2da8bc9661..a3700bbb34 100644 --- a/doc/guides/rel_notes/release_22_11.rst +++ b/doc/guides/rel_notes/release_22_11.rst @@ -124,6 +124,18 @@ New Features Added new flow action which allows application to re-route packets directly to the kernel without software involvement. +* **Added proactive error handling mode for ethdev.** + + Added proactive error handling mode for ethdev, and three events were + introduced: + + * Added new event: ``RTE_ETH_EVENT_ERR_RECOVERING`` for the PMD to report + that the port is recovering from an error. + * Added new event: ``RTE_ETH_EVENT_RECOVER_SUCCESS`` for the PMD to report + that the port recover successful from an error. + * Added new event: ``RTE_ETH_EVENT_RECOVER_FAILED`` for the PMD to report + that the prot recover failed from an error. + * **Updated AF_XDP driver.** * Made compatible with libbpf v0.8.0 (when used with libxdp). diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index 5de8e13866..46ecc9a0fe 100644 --- a/lib/ethdev/rte_ethdev.h +++ b/lib/ethdev/rte_ethdev.h @@ -1700,6 +1700,12 @@ enum rte_eth_err_handle_mode { * application invoke @see rte_eth_dev_reset to recover the port. */ RTE_ETH_ERROR_HANDLE_MODE_PASSIVE, + /** Proactive error handling, after the PMD detect that a reset is + * required, the PMD reports @see RTE_ETH_EVENT_ERR_RECOVERING event, + * and do recovery internally, finally, reports the recovery result + * event (@see RTE_ETH_EVENT_RECOVERY_*). + */ + RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE, }; /** @@ -3886,6 +3892,59 @@ enum rte_eth_event_type { * @see rte_eth_rx_avail_thresh_set() */ RTE_ETH_EVENT_RX_AVAIL_THRESH, + /** Port recovering from a hardware or firmware error. + * If PMD supports proactive error recovery, it should trigger this + * event to notify application that it detected an error and the + * recovery is being started. Upon receiving the event, the application + * should not invoke any control path APIs (such as + * rte_eth_dev_configure/rte_eth_dev_stop...) until receiving + * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED + * event. + * The PMD will set the data path pointers to dummy functions, and + * re-set the data patch pointers to non-dummy functions before reports + * RTE_ETH_EVENT_RECOVERY_SUCCESS event. It means that the application + * cannot send or receive any packets during this period. + * @note Before the PMD reports the recovery result, the PMD may report + * the RTE_ETH_EVENT_ERR_RECOVERING event again, because a larger error + * may occur during the recovery. + */ + RTE_ETH_EVENT_ERR_RECOVERING, + /** Port recovers successful from the error. + * The PMD already re-configures the port, and the effect is the same as + * that of the restart operation. + * a) the following operation will be retained: (alphabetically) + * - DCB configuration + * - FEC configuration + * - Flow control configuration + * - LRO configuration + * - LSC configuration + * - MTU + * - Mac address (default and those supplied by MAC address array) + * - Promiscuous and allmulticast mode + * - PTP configuration + * - Queue (Rx/Tx) settings + * - Queue statistics mappings + * - RSS configuration by rte_eth_dev_rss_xxx() family + * - Rx checksum configuration + * - Rx interrupt settings + * - Traffic management configuration + * - VLAN configuration (including filtering, tpid, strip, pvid) + * - VMDq configuration + * b) the following configuration maybe retained or not depending on the + * device capabilities: + * - flow rules + * @see RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP + * - shared flow objects + * @see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP + * c) the other configuration will not be stored and will need to be + * re-configured. + */ + RTE_ETH_EVENT_RECOVERY_SUCCESS, + /** Port recovers failed from the error. + * It means that the port should not usable anymore. The application + * should close the port. + */ + RTE_ETH_EVENT_RECOVERY_FAILED, RTE_ETH_EVENT_MAX /**< max value of this enum */ };