Message ID | 20230104073043.1120168-1-feifei.wang2@arm.com (mailing list archive) |
---|---|
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id F2005A00C2; Wed, 4 Jan 2023 08:30:52 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id E5B3D42D15; Wed, 4 Jan 2023 08:30:52 +0100 (CET) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by mails.dpdk.org (Postfix) with ESMTP id 084C542D15 for <dev@dpdk.org>; Wed, 4 Jan 2023 08:30:52 +0100 (CET) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3AE831063; Tue, 3 Jan 2023 23:31:33 -0800 (PST) Received: from net-x86-dell-8268.shanghai.arm.com (net-x86-dell-8268.shanghai.arm.com [10.169.210.116]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id EC9543F71A; Tue, 3 Jan 2023 23:30:49 -0800 (PST) From: Feifei Wang <feifei.wang2@arm.com> To: Cc: dev@dpdk.org, konstantin.v.ananyev@yandex.ru, nd@arm.com, Feifei Wang <feifei.wang2@arm.com> Subject: [PATCH v3 0/3] Direct re-arming of buffers on receive side Date: Wed, 4 Jan 2023 15:30:40 +0800 Message-Id: <20230104073043.1120168-1-feifei.wang2@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220420081650.2043183-1-feifei.wang2@arm.com> References: <20220420081650.2043183-1-feifei.wang2@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org |
Series |
Direct re-arming of buffers on receive side
|
|
Message
Feifei Wang
Jan. 4, 2023, 7:30 a.m. UTC
Currently, the transmit side frees the buffers into the lcore cache and the receive side allocates buffers from the lcore cache. The transmit side typically frees 32 buffers resulting in 32*8=256B of stores to lcore cache. The receive side allocates 32 buffers and stores them in the receive side software ring, resulting in 32*8=256B of stores and 256B of load from the lcore cache. This patch proposes a mechanism to avoid freeing to/allocating from the lcore cache. i.e. the receive side will free the buffers from transmit side directly into its software ring. This will avoid the 256B of loads and stores introduced by the lcore cache. It also frees up the cache lines used by the lcore cache. However, this solution poses several constraints: 1)The receive queue needs to know which transmit queue it should take the buffers from. The application logic decides which transmit port to use to send out the packets. In many use cases the NIC might have a single port ([1], [2], [3]), in which case a given transmit queue is always mapped to a single receive queue (1:1 Rx queue: Tx queue). This is easy to configure. If the NIC has 2 ports (there are several references), then we will have 1:2 (RX queue: TX queue) mapping which is still easy to configure. However, if this is generalized to 'N' ports, the configuration can be long. More over the PMD would have to scan a list of transmit queues to pull the buffers from. 2)The other factor that needs to be considered is 'run-to-completion' vs 'pipeline' models. In the run-to-completion model, the receive side and the transmit side are running on the same lcore serially. In the pipeline model. The receive side and transmit side might be running on different lcores in parallel. This requires locking. This is not supported at this point. 3)Tx and Rx buffers must be from the same mempool. And we also must ensure Tx buffer free number is equal to Rx buffer free number. Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This is due to tx_next_dd is a variable to compute tx sw-ring free location. Its value will be one more round than the position where next time free starts. Current status in this patch: 1)Two APIs are added for users to enable direct-rearm mode: In control plane, users can call 'rte_eth_rx_queue_rearm_data_get' to get Rx sw_ring pointer and its rxq_info. (This avoid Tx load Rx data directly); In data plane, users can call 'rte_eth_dev_direct_rearm' to rearm Rx buffers and free Tx buffers at the same time. Specifically, in this API, there are two separated API for Rx and Tx. For Tx, 'rte_eth_tx_fill_sw_ring' can fill a given sw_ring by Tx freed buffers. For Rx, 'rte_eth_rx_flush_descriptor' can flush its descriptors based on the rearm buffers. Thus, this can separate Rx and Tx operation, and user can even re-arm RX queue not from the same driver's TX queue, but from different sources too. ----------------------------------------------------------------------- control plane: rte_eth_rx_queue_rearm_data_get(*rxq_rearm_data); data plane: loop { rte_eth_dev_direct_rearm(*rxq_rearm_data){ rte_eth_tx_fill_sw_ring{ for (i = 0; i <= 32; i++) { sw_ring.mbuf[i] = tx.mbuf[i]; } } rte_eth_rx_flush_descriptor{ for (i = 0; i <= 32; i++) { flush descs[i]; } } } rte_eth_rx_burst; rte_eth_tx_burst; } ----------------------------------------------------------------------- 2)The i40e driver is changed to do the direct re-arm of the receive side. 3)The ixgbe driver is changed to do the direct re-arm of the receive side. Testing status: (1) dpdk l3fwd test with multiple drivers: port 0: 82599 NIC port 1: XL710 NIC ------------------------------------------------------------- Without fast free With fast free Thunderx2: +9.44% +7.14% ------------------------------------------------------------- (2) dpdk l3fwd test with same driver: port 0 && 1: XL710 NIC ------------------------------------------------------------- *Direct rearm with exposing rx_sw_ring: Without fast free With fast free Ampere altra: +14.98% +15.77% n1sdp: +6.47% +0.52% ------------------------------------------------------------- (3) VPP test with same driver: port 0 && 1: XL710 NIC ------------------------------------------------------------- *Direct rearm with exposing rx_sw_ring: Ampere altra: +4.59% n1sdp: +5.4% ------------------------------------------------------------- Reference: [1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/ [2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-ethernet-network-adapter-e810cqda1/specifications.html [3] https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/n1100g V2: 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa) 2. Add 'txq_data_get' API to get txq info for Rx (Konstantin) 3. Use input parameter to enable direct rearm in l3fwd (Konstantin) 4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko) V3: 1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin) 2. Delete L3fwd change for direct rearm (Jerin) 3. enable direct rearm in ixgbe driver in Arm Feifei Wang (3): ethdev: enable direct rearm with separate API net/i40e: enable direct rearm with separate API net/ixgbe: enable direct rearm with separate API drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_ethdev.h | 2 + drivers/net/i40e/i40e_rxtx.c | 19 +++ drivers/net/i40e/i40e_rxtx.h | 4 + drivers/net/i40e/i40e_rxtx_vec_common.h | 54 +++++++ drivers/net/i40e/i40e_rxtx_vec_neon.c | 42 ++++++ drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_ethdev.h | 3 + drivers/net/ixgbe/ixgbe_rxtx.c | 19 +++ drivers/net/ixgbe/ixgbe_rxtx.h | 4 + drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 48 ++++++ drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 52 +++++++ lib/ethdev/ethdev_driver.h | 10 ++ lib/ethdev/ethdev_private.c | 2 + lib/ethdev/rte_ethdev.c | 52 +++++++ lib/ethdev/rte_ethdev.h | 174 ++++++++++++++++++++++ lib/ethdev/rte_ethdev_core.h | 11 ++ lib/ethdev/version.map | 6 + 18 files changed, 504 insertions(+)
Comments
+ping konstantin, Would you please give some comments for this patch series? Thanks very much. Best Regards Feifei
Hi Feifei, > +ping konstantin, > > Would you please give some comments for this patch series? > Thanks very much. Sure, will have a look in next few days. Apologies for the delay.
That's all right. Thanks very much for your attention~ > -----邮件原件----- > 发件人: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru> > 发送时间: Wednesday, February 1, 2023 9:11 AM > 收件人: Feifei Wang <Feifei.Wang2@arm.com> > 抄送: dev@dpdk.org; nd <nd@arm.com> > 主题: Re: 回复: [PATCH v3 0/3] Direct re-arming of buffers on receive side > > Hi Feifei, > > > +ping konstantin, > > > > Would you please give some comments for this patch series? > > Thanks very much. > > Sure, will have a look in next few days. > Apologies for the delay.
> From: Feifei Wang [mailto:feifei.wang2@arm.com] > Sent: Wednesday, 4 January 2023 08.31 > > Currently, the transmit side frees the buffers into the lcore cache and > the receive side allocates buffers from the lcore cache. The transmit > side typically frees 32 buffers resulting in 32*8=256B of stores to > lcore cache. The receive side allocates 32 buffers and stores them in > the receive side software ring, resulting in 32*8=256B of stores and > 256B of load from the lcore cache. > > This patch proposes a mechanism to avoid freeing to/allocating from > the lcore cache. i.e. the receive side will free the buffers from > transmit side directly into its software ring. This will avoid the 256B > of loads and stores introduced by the lcore cache. It also frees up the > cache lines used by the lcore cache. I am starting to wonder if we have been adding unnecessary feature creep in order to make this feature too generic. Could you please describe some of the most important high-volume use cases from real life? It would help setting the scope correctly. > > However, this solution poses several constraints: > > 1)The receive queue needs to know which transmit queue it should take > the buffers from. The application logic decides which transmit port to > use to send out the packets. In many use cases the NIC might have a > single port ([1], [2], [3]), in which case a given transmit queue is > always mapped to a single receive queue (1:1 Rx queue: Tx queue). This > is easy to configure. > > If the NIC has 2 ports (there are several references), then we will have > 1:2 (RX queue: TX queue) mapping which is still easy to configure. > However, if this is generalized to 'N' ports, the configuration can be > long. More over the PMD would have to scan a list of transmit queues to > pull the buffers from. > > 2)The other factor that needs to be considered is 'run-to-completion' vs > 'pipeline' models. In the run-to-completion model, the receive side and > the transmit side are running on the same lcore serially. In the pipeline > model. The receive side and transmit side might be running on different > lcores in parallel. This requires locking. This is not supported at this > point. > > 3)Tx and Rx buffers must be from the same mempool. And we also must > ensure Tx buffer free number is equal to Rx buffer free number. > Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This > is due to tx_next_dd is a variable to compute tx sw-ring free location. > Its value will be one more round than the position where next time free > starts. > > Current status in this patch: > 1)Two APIs are added for users to enable direct-rearm mode: > In control plane, users can call 'rte_eth_rx_queue_rearm_data_get' > to get Rx sw_ring pointer and its rxq_info. > (This avoid Tx load Rx data directly); > > In data plane, users can call 'rte_eth_dev_direct_rearm' to rearm Rx > buffers and free Tx buffers at the same time. Specifically, in this > API, there are two separated API for Rx and Tx. > For Tx, 'rte_eth_tx_fill_sw_ring' can fill a given sw_ring by Tx freed > buffers. > For Rx, 'rte_eth_rx_flush_descriptor' can flush its descriptors based > on the rearm buffers. > Thus, this can separate Rx and Tx operation, and user can even re-arm > RX queue not from the same driver's TX queue, but from different > sources too. > ----------------------------------------------------------------------- > control plane: > rte_eth_rx_queue_rearm_data_get(*rxq_rearm_data); > data plane: > loop { > rte_eth_dev_direct_rearm(*rxq_rearm_data){ > > rte_eth_tx_fill_sw_ring{ > for (i = 0; i <= 32; i++) { > sw_ring.mbuf[i] = tx.mbuf[i]; > } > } > > rte_eth_rx_flush_descriptor{ > for (i = 0; i <= 32; i++) { > flush descs[i]; > } > } > } > rte_eth_rx_burst; > rte_eth_tx_burst; > } > ----------------------------------------------------------------------- > 2)The i40e driver is changed to do the direct re-arm of the receive > side. > 3)The ixgbe driver is changed to do the direct re-arm of the receive > side. > > Testing status: > (1) dpdk l3fwd test with multiple drivers: > port 0: 82599 NIC port 1: XL710 NIC > ------------------------------------------------------------- > Without fast free With fast free > Thunderx2: +9.44% +7.14% > ------------------------------------------------------------- > > (2) dpdk l3fwd test with same driver: > port 0 && 1: XL710 NIC > ------------------------------------------------------------- > *Direct rearm with exposing rx_sw_ring: > Without fast free With fast free > Ampere altra: +14.98% +15.77% > n1sdp: +6.47% +0.52% > ------------------------------------------------------------- > > (3) VPP test with same driver: > port 0 && 1: XL710 NIC > ------------------------------------------------------------- > *Direct rearm with exposing rx_sw_ring: > Ampere altra: +4.59% > n1sdp: +5.4% > ------------------------------------------------------------- > > Reference: > [1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN- > CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/ > [2] https://www.intel.com/content/www/us/en/products/sku/192561/intel- > ethernet-network-adapter-e810cqda1/specifications.html > [3] https://www.broadcom.com/products/ethernet-connectivity/network- > adapters/100gb-nic-ocp/n1100g > > V2: > 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa) > 2. Add 'txq_data_get' API to get txq info for Rx (Konstantin) > 3. Use input parameter to enable direct rearm in l3fwd (Konstantin) > 4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko) > > V3: > 1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin) > 2. Delete L3fwd change for direct rearm (Jerin) > 3. enable direct rearm in ixgbe driver in Arm > > Feifei Wang (3): > ethdev: enable direct rearm with separate API > net/i40e: enable direct rearm with separate API > net/ixgbe: enable direct rearm with separate API > > drivers/net/i40e/i40e_ethdev.c | 1 + > drivers/net/i40e/i40e_ethdev.h | 2 + > drivers/net/i40e/i40e_rxtx.c | 19 +++ > drivers/net/i40e/i40e_rxtx.h | 4 + > drivers/net/i40e/i40e_rxtx_vec_common.h | 54 +++++++ > drivers/net/i40e/i40e_rxtx_vec_neon.c | 42 ++++++ > drivers/net/ixgbe/ixgbe_ethdev.c | 1 + > drivers/net/ixgbe/ixgbe_ethdev.h | 3 + > drivers/net/ixgbe/ixgbe_rxtx.c | 19 +++ > drivers/net/ixgbe/ixgbe_rxtx.h | 4 + > drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 48 ++++++ > drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 52 +++++++ > lib/ethdev/ethdev_driver.h | 10 ++ > lib/ethdev/ethdev_private.c | 2 + > lib/ethdev/rte_ethdev.c | 52 +++++++ > lib/ethdev/rte_ethdev.h | 174 ++++++++++++++++++++++ > lib/ethdev/rte_ethdev_core.h | 11 ++ > lib/ethdev/version.map | 6 + > 18 files changed, 504 insertions(+) > > -- > 2.25.1 >
> -----Original Message----- > From: Morten Brørup <mb@smartsharesystems.com> > Sent: Wednesday, March 22, 2023 7:57 AM > To: Feifei Wang <Feifei.Wang2@arm.com> > Cc: dev@dpdk.org; konstantin.v.ananyev@yandex.ru; nd <nd@arm.com>; > konstantin.ananyev@huawei.com; Yuying Zhang <Yuying.Zhang@intel.com>; > Beilei Xing <beilei.xing@intel.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>; > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> > Subject: RE: [PATCH v3 0/3] Direct re-arming of buffers on receive side > > > From: Feifei Wang [mailto:feifei.wang2@arm.com] > > Sent: Wednesday, 4 January 2023 08.31 > > > > Currently, the transmit side frees the buffers into the lcore cache > > and the receive side allocates buffers from the lcore cache. The > > transmit side typically frees 32 buffers resulting in 32*8=256B of > > stores to lcore cache. The receive side allocates 32 buffers and > > stores them in the receive side software ring, resulting in 32*8=256B > > of stores and 256B of load from the lcore cache. > > > > This patch proposes a mechanism to avoid freeing to/allocating from > > the lcore cache. i.e. the receive side will free the buffers from > > transmit side directly into its software ring. This will avoid the > > 256B of loads and stores introduced by the lcore cache. It also frees > > up the cache lines used by the lcore cache. > > I am starting to wonder if we have been adding unnecessary feature creep in > order to make this feature too generic. Can you please elaborate on the feature creep you are thinking? The features have been the same since the first implementation, but it is made more generic. > > Could you please describe some of the most important high-volume use cases > from real life? It would help setting the scope correctly. The use cases have been discussed several times already. > > > > > However, this solution poses several constraints: > > > > 1)The receive queue needs to know which transmit queue it should take > > the buffers from. The application logic decides which transmit port to > > use to send out the packets. In many use cases the NIC might have a > > single port ([1], [2], [3]), in which case a given transmit queue is > > always mapped to a single receive queue (1:1 Rx queue: Tx queue). This > > is easy to configure. > > > > If the NIC has 2 ports (there are several references), then we will > > have > > 1:2 (RX queue: TX queue) mapping which is still easy to configure. > > However, if this is generalized to 'N' ports, the configuration can be > > long. More over the PMD would have to scan a list of transmit queues > > to pull the buffers from. > > > > 2)The other factor that needs to be considered is 'run-to-completion' > > vs 'pipeline' models. In the run-to-completion model, the receive side > > and the transmit side are running on the same lcore serially. In the > > pipeline model. The receive side and transmit side might be running on > > different lcores in parallel. This requires locking. This is not > > supported at this point. > > > > 3)Tx and Rx buffers must be from the same mempool. And we also must > > ensure Tx buffer free number is equal to Rx buffer free number. > > Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This > > is due to tx_next_dd is a variable to compute tx sw-ring free location. > > Its value will be one more round than the position where next time > > free starts. > > > > Current status in this patch: > > 1)Two APIs are added for users to enable direct-rearm mode: > > In control plane, users can call 'rte_eth_rx_queue_rearm_data_get' > > to get Rx sw_ring pointer and its rxq_info. > > (This avoid Tx load Rx data directly); > > > > In data plane, users can call 'rte_eth_dev_direct_rearm' to rearm Rx > > buffers and free Tx buffers at the same time. Specifically, in this > > API, there are two separated API for Rx and Tx. > > For Tx, 'rte_eth_tx_fill_sw_ring' can fill a given sw_ring by Tx > > freed buffers. > > For Rx, 'rte_eth_rx_flush_descriptor' can flush its descriptors based > > on the rearm buffers. > > Thus, this can separate Rx and Tx operation, and user can even re-arm > > RX queue not from the same driver's TX queue, but from different > > sources too. > > ----------------------------------------------------------------------- > > control plane: > > rte_eth_rx_queue_rearm_data_get(*rxq_rearm_data); > > data plane: > > loop { > > rte_eth_dev_direct_rearm(*rxq_rearm_data){ > > > > rte_eth_tx_fill_sw_ring{ > > for (i = 0; i <= 32; i++) { > > sw_ring.mbuf[i] = tx.mbuf[i]; > > } > > } > > > > rte_eth_rx_flush_descriptor{ > > for (i = 0; i <= 32; i++) { > > flush descs[i]; > > } > > } > > } > > rte_eth_rx_burst; > > rte_eth_tx_burst; > > } > > ---------------------------------------------------------------------- > > - 2)The i40e driver is changed to do the direct re-arm of the receive > > side. > > 3)The ixgbe driver is changed to do the direct re-arm of the receive > > side. > > > > Testing status: > > (1) dpdk l3fwd test with multiple drivers: > > port 0: 82599 NIC port 1: XL710 NIC > > ------------------------------------------------------------- > > Without fast free With fast free > > Thunderx2: +9.44% +7.14% > > ------------------------------------------------------------- > > > > (2) dpdk l3fwd test with same driver: > > port 0 && 1: XL710 NIC > > ------------------------------------------------------------- > > *Direct rearm with exposing rx_sw_ring: > > Without fast free With fast free > > Ampere altra: +14.98% +15.77% > > n1sdp: +6.47% +0.52% > > ------------------------------------------------------------- > > > > (3) VPP test with same driver: > > port 0 && 1: XL710 NIC > > ------------------------------------------------------------- > > *Direct rearm with exposing rx_sw_ring: > > Ampere altra: +4.59% > > n1sdp: +5.4% > > ------------------------------------------------------------- > > > > Reference: > > [1] > > https://store.nvidia.com/en-us/networking/store/product/MCX623105AN- > > > CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECrypto > Disabled > > / [2] > > https://www.intel.com/content/www/us/en/products/sku/192561/intel- > > ethernet-network-adapter-e810cqda1/specifications.html > > [3] https://www.broadcom.com/products/ethernet-connectivity/network- > > adapters/100gb-nic-ocp/n1100g > > > > V2: > > 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa) 2. > > Add 'txq_data_get' API to get txq info for Rx (Konstantin) 3. Use > > input parameter to enable direct rearm in l3fwd (Konstantin) 4. Add > > condition detection for direct rearm API (Morten, Andrew Rybchenko) > > > > V3: > > 1. Seperate Rx and Tx operation with two APIs in direct-rearm > > (Konstantin) 2. Delete L3fwd change for direct rearm (Jerin) 3. enable > > direct rearm in ixgbe driver in Arm > > > > Feifei Wang (3): > > ethdev: enable direct rearm with separate API > > net/i40e: enable direct rearm with separate API > > net/ixgbe: enable direct rearm with separate API > > > > drivers/net/i40e/i40e_ethdev.c | 1 + > > drivers/net/i40e/i40e_ethdev.h | 2 + > > drivers/net/i40e/i40e_rxtx.c | 19 +++ > > drivers/net/i40e/i40e_rxtx.h | 4 + > > drivers/net/i40e/i40e_rxtx_vec_common.h | 54 +++++++ > > drivers/net/i40e/i40e_rxtx_vec_neon.c | 42 ++++++ > > drivers/net/ixgbe/ixgbe_ethdev.c | 1 + > > drivers/net/ixgbe/ixgbe_ethdev.h | 3 + > > drivers/net/ixgbe/ixgbe_rxtx.c | 19 +++ > > drivers/net/ixgbe/ixgbe_rxtx.h | 4 + > > drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 48 ++++++ > > drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 52 +++++++ > > lib/ethdev/ethdev_driver.h | 10 ++ > > lib/ethdev/ethdev_private.c | 2 + > > lib/ethdev/rte_ethdev.c | 52 +++++++ > > lib/ethdev/rte_ethdev.h | 174 ++++++++++++++++++++++ > > lib/ethdev/rte_ethdev_core.h | 11 ++ > > lib/ethdev/version.map | 6 + > > 18 files changed, 504 insertions(+) > > > > -- > > 2.25.1 > >
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] > Sent: Wednesday, 22 March 2023 14.42 > > > From: Morten Brørup <mb@smartsharesystems.com> > > Sent: Wednesday, March 22, 2023 7:57 AM > > > > > From: Feifei Wang [mailto:feifei.wang2@arm.com] > > > Sent: Wednesday, 4 January 2023 08.31 > > > > > > Currently, the transmit side frees the buffers into the lcore cache > > > and the receive side allocates buffers from the lcore cache. The > > > transmit side typically frees 32 buffers resulting in 32*8=256B of > > > stores to lcore cache. The receive side allocates 32 buffers and > > > stores them in the receive side software ring, resulting in 32*8=256B > > > of stores and 256B of load from the lcore cache. > > > > > > This patch proposes a mechanism to avoid freeing to/allocating from > > > the lcore cache. i.e. the receive side will free the buffers from > > > transmit side directly into its software ring. This will avoid the > > > 256B of loads and stores introduced by the lcore cache. It also frees > > > up the cache lines used by the lcore cache. > > > > I am starting to wonder if we have been adding unnecessary feature creep in > > order to make this feature too generic. > Can you please elaborate on the feature creep you are thinking? The features > have been the same since the first implementation, but it is made more > generic. Maybe not "features" as such; but the API has evolved, and perhaps we could simplify both the API and the implementation if we narrowed the scope. I'm not saying that what we have is bad or too complex; I'm only asking to consider if there are opportunities to take a step back and simplify some things. > > > > > Could you please describe some of the most important high-volume use cases > > from real life? It would help setting the scope correctly. > The use cases have been discussed several times already. Yes, but they should be mentioned in the patch cover letter - and later on in the documentation. It will help limiting the scope while developing this feature. And it will make it easier for application developers to relate to the feature and determine if it is relevant for their application. > > > > > > > > > However, this solution poses several constraints: > > > > > > 1)The receive queue needs to know which transmit queue it should take > > > the buffers from. The application logic decides which transmit port to > > > use to send out the packets. In many use cases the NIC might have a > > > single port ([1], [2], [3]), in which case a given transmit queue is > > > always mapped to a single receive queue (1:1 Rx queue: Tx queue). This > > > is easy to configure. > > > > > > If the NIC has 2 ports (there are several references), then we will > > > have > > > 1:2 (RX queue: TX queue) mapping which is still easy to configure. > > > However, if this is generalized to 'N' ports, the configuration can be > > > long. More over the PMD would have to scan a list of transmit queues > > > to pull the buffers from. > > > > > > 2)The other factor that needs to be considered is 'run-to-completion' > > > vs 'pipeline' models. In the run-to-completion model, the receive side > > > and the transmit side are running on the same lcore serially. In the > > > pipeline model. The receive side and transmit side might be running on > > > different lcores in parallel. This requires locking. This is not > > > supported at this point. > > > > > > 3)Tx and Rx buffers must be from the same mempool. And we also must > > > ensure Tx buffer free number is equal to Rx buffer free number. > > > Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This > > > is due to tx_next_dd is a variable to compute tx sw-ring free location. > > > Its value will be one more round than the position where next time > > > free starts. > > > > > > Current status in this patch: > > > 1)Two APIs are added for users to enable direct-rearm mode: > > > In control plane, users can call 'rte_eth_rx_queue_rearm_data_get' > > > to get Rx sw_ring pointer and its rxq_info. > > > (This avoid Tx load Rx data directly); > > > > > > In data plane, users can call 'rte_eth_dev_direct_rearm' to rearm Rx > > > buffers and free Tx buffers at the same time. Specifically, in this > > > API, there are two separated API for Rx and Tx. > > > For Tx, 'rte_eth_tx_fill_sw_ring' can fill a given sw_ring by Tx > > > freed buffers. > > > For Rx, 'rte_eth_rx_flush_descriptor' can flush its descriptors based > > > on the rearm buffers. > > > Thus, this can separate Rx and Tx operation, and user can even re-arm > > > RX queue not from the same driver's TX queue, but from different > > > sources too. > > > ----------------------------------------------------------------------- > > > control plane: > > > rte_eth_rx_queue_rearm_data_get(*rxq_rearm_data); > > > data plane: > > > loop { > > > rte_eth_dev_direct_rearm(*rxq_rearm_data){ > > > > > > rte_eth_tx_fill_sw_ring{ > > > for (i = 0; i <= 32; i++) { > > > sw_ring.mbuf[i] = tx.mbuf[i]; > > > } > > > } > > > > > > rte_eth_rx_flush_descriptor{ > > > for (i = 0; i <= 32; i++) { > > > flush descs[i]; > > > } > > > } > > > } > > > rte_eth_rx_burst; > > > rte_eth_tx_burst; > > > } > > > ---------------------------------------------------------------------- > > > - 2)The i40e driver is changed to do the direct re-arm of the receive > > > side. > > > 3)The ixgbe driver is changed to do the direct re-arm of the receive > > > side. > > > > > > Testing status: > > > (1) dpdk l3fwd test with multiple drivers: > > > port 0: 82599 NIC port 1: XL710 NIC > > > ------------------------------------------------------------- > > > Without fast free With fast free > > > Thunderx2: +9.44% +7.14% > > > ------------------------------------------------------------- > > > > > > (2) dpdk l3fwd test with same driver: > > > port 0 && 1: XL710 NIC > > > ------------------------------------------------------------- > > > *Direct rearm with exposing rx_sw_ring: > > > Without fast free With fast free > > > Ampere altra: +14.98% +15.77% > > > n1sdp: +6.47% +0.52% > > > ------------------------------------------------------------- > > > > > > (3) VPP test with same driver: > > > port 0 && 1: XL710 NIC > > > ------------------------------------------------------------- > > > *Direct rearm with exposing rx_sw_ring: > > > Ampere altra: +4.59% > > > n1sdp: +5.4% > > > ------------------------------------------------------------- > > > > > > Reference: > > > [1] > > > https://store.nvidia.com/en-us/networking/store/product/MCX623105AN- > > > > > CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECrypto > > Disabled > > > / [2] > > > https://www.intel.com/content/www/us/en/products/sku/192561/intel- > > > ethernet-network-adapter-e810cqda1/specifications.html > > > [3] https://www.broadcom.com/products/ethernet-connectivity/network- > > > adapters/100gb-nic-ocp/n1100g > > > > > > V2: > > > 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa) 2. > > > Add 'txq_data_get' API to get txq info for Rx (Konstantin) 3. Use > > > input parameter to enable direct rearm in l3fwd (Konstantin) 4. Add > > > condition detection for direct rearm API (Morten, Andrew Rybchenko) > > > > > > V3: > > > 1. Seperate Rx and Tx operation with two APIs in direct-rearm > > > (Konstantin) 2. Delete L3fwd change for direct rearm (Jerin) 3. enable > > > direct rearm in ixgbe driver in Arm > > > > > > Feifei Wang (3): > > > ethdev: enable direct rearm with separate API > > > net/i40e: enable direct rearm with separate API > > > net/ixgbe: enable direct rearm with separate API > > > > > > drivers/net/i40e/i40e_ethdev.c | 1 + > > > drivers/net/i40e/i40e_ethdev.h | 2 + > > > drivers/net/i40e/i40e_rxtx.c | 19 +++ > > > drivers/net/i40e/i40e_rxtx.h | 4 + > > > drivers/net/i40e/i40e_rxtx_vec_common.h | 54 +++++++ > > > drivers/net/i40e/i40e_rxtx_vec_neon.c | 42 ++++++ > > > drivers/net/ixgbe/ixgbe_ethdev.c | 1 + > > > drivers/net/ixgbe/ixgbe_ethdev.h | 3 + > > > drivers/net/ixgbe/ixgbe_rxtx.c | 19 +++ > > > drivers/net/ixgbe/ixgbe_rxtx.h | 4 + > > > drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 48 ++++++ > > > drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 52 +++++++ > > > lib/ethdev/ethdev_driver.h | 10 ++ > > > lib/ethdev/ethdev_private.c | 2 + > > > lib/ethdev/rte_ethdev.c | 52 +++++++ > > > lib/ethdev/rte_ethdev.h | 174 ++++++++++++++++++++++ > > > lib/ethdev/rte_ethdev_core.h | 11 ++ > > > lib/ethdev/version.map | 6 + > > > 18 files changed, 504 insertions(+) > > > > > > -- > > > 2.25.1 > > > >