mbox

[v1,0/5] Direct re-arming of buffers on receive side

Message ID 20220420081650.2043183-1-feifei.wang2@arm.com (mailing list archive)
Headers

Message

Feifei Wang April 20, 2022, 8:16 a.m. UTC
  Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.

This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into it's software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache.

However, this solution poses several constraints:

1)The receive queue needs to know which transmit queue it should take
the buffers from. The application logic decides which transmit port to
use to send out the packets. In many use cases the NIC might have a
single port ([1], [2], [3]), in which case a given transmit queue is
always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
is easy to configure.

If the NIC has 2 ports (there are several references), then we will have
1:2 (RX queue: TX queue) mapping which is still easy to configure.
However, if this is generalized to 'N' ports, the configuration can be
long. More over the PMD would have to scan a list of transmit queues to
pull the buffers from.

2)The other factor that needs to be considered is 'run-to-completion' vs
'pipeline' models. In the run-to-completion model, the receive side and
the transmit side are running on the same lcore serially. In the pipeline
model. The receive side and transmit side might be running on different
lcores in parallel. This requires locking. This is not supported at this
point.

3)Tx and Rx buffers must be from the same mempool. And we also must
ensure Tx buffer free number is equal to Rx buffer free number:
(txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH)
Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
is due to tx_next_dd is a variable to compute tx sw-ring free location.
Its value will be one more round than the position where next time free
starts.

Current status in this RFC:
1)An API is added to allow for mapping a TX queue to a RX queue.
  Currently it supports 1:1 mapping.
2)The i40e driver is changed to do the direct re-arm of the receive
  side.
3)L3fwd application is modified to do the direct rearm mapping
automatically without user config. This follows the rules that the
thread can map TX queue to a RX queue based on the first received
package destination port.

Testing status:
1.The testing results for L3fwd are as follows:
-------------------------------------------------------------------
enabled direct rearm
-------------------------------------------------------------------
Arm:
N1SDP(neon path):
without fast-free mode		with fast-free mode
	+14.1%				+7.0%

Ampere Altra(neon path):
without fast-free mode		with fast-free mode
	+17.1				+14.0%

X86:
Dell-8268(limit frequency):
sse path:
without fast-free mode		with fast-free mode
	+6.96%				+2.02%
avx2 path:
without fast-free mode		with fast-free mode
	+9.04%				+7.75%
avx512 path:
without fast-free mode		with fast-free mode
	+5.43%				+1.57%
-------------------------------------------------------------------
This patch can not affect base performance of normal mode.
Furthermore, the reason for that limiting the CPU frequency is
that dell-8268 can encounter i40e NIC bottleneck with maximum
frequency.

2.The testing results for VPP-L3fwd are as follows:
-------------------------------------------------------------------
Arm:
N1SDP(neon path):
with direct re-arm mode enabled
	+7.0%
-------------------------------------------------------------------
For Ampere Altra and X86,VPP-L3fwd test has not been done.

Reference:
[1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
[2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-ethernet-network-adapter-e810cqda1/specifications.html
[3] https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/n1100g

Feifei Wang (5):
  net/i40e: remove redundant Dtype initialization
  net/i40e: enable direct rearm mode
  ethdev: add API for direct rearm mode
  net/i40e: add direct rearm mode internal API
  examples/l3fwd: enable direct rearm mode

 drivers/net/i40e/i40e_ethdev.c          |  34 +++
 drivers/net/i40e/i40e_rxtx.c            |   4 -
 drivers/net/i40e/i40e_rxtx.h            |   4 +
 drivers/net/i40e/i40e_rxtx_common_avx.h | 269 ++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx_vec_avx2.c   |  14 +-
 drivers/net/i40e/i40e_rxtx_vec_avx512.c | 249 +++++++++++++++++++++-
 drivers/net/i40e/i40e_rxtx_vec_neon.c   | 141 ++++++++++++-
 drivers/net/i40e/i40e_rxtx_vec_sse.c    | 170 ++++++++++++++-
 examples/l3fwd/l3fwd_lpm.c              |  16 +-
 lib/ethdev/ethdev_driver.h              |  15 ++
 lib/ethdev/rte_ethdev.c                 |  14 ++
 lib/ethdev/rte_ethdev.h                 |  31 +++
 lib/ethdev/version.map                  |   1 +
 13 files changed, 949 insertions(+), 13 deletions(-)
  

Comments

Feifei Wang Aug. 22, 2023, 7:33 a.m. UTC | #1
Hi, Ferruh

Would you please give some comments on these patches? 
If no comments, would mbufs recycle mode be merged in dpdk-next branch?
Thanks very much.

Best Regards
Feifei

> -----Original Message-----
> From: Feifei Wang <feifei.wang2@arm.com>
> Sent: Tuesday, August 22, 2023 3:27 PM
> Cc: dev@dpdk.org; nd <nd@arm.com>; Feifei Wang
> <Feifei.Wang2@arm.com>
> Subject: [PATCH v11 0/4] Recycle mbufs from Tx queue into Rx queue
> 
>   Currently, the transmit side frees the buffers into the lcore cache and the
> receive side allocates buffers from the lcore cache. The transmit side typically
> frees 32 buffers resulting in 32*8=256B of stores to lcore cache. The receive
> side allocates 32 buffers and stores them in the receive side software ring,
> resulting in 32*8=256B of stores and 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from the lcore
> cache. i.e. the receive side will free the buffers from transmit side directly into
> its software ring. This will avoid the 256B of loads and stores introduced by
> the lcore cache. It also frees up the cache lines used by the lcore cache. And we
> can call this mode as mbufs recycle mode.
> 
> In the latest version, mbufs recycle mode is packaged as a separate API.
> This allows for the users to change rxq/txq pairing in real time in data plane,
> according to the analysis of the packet flow by the application, for example:
> -----------------------------------------------------------------------
> Step 1: upper application analyse the flow direction Step 2: recycle_rxq_info =
> rte_eth_recycle_rx_queue_info_get(rx_portid, rx_queueid) Step 3:
> rte_eth_recycle_mbufs(rx_portid, rx_queueid, tx_portid, tx_queueid,
> recycle_rxq_info); Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
> Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
> -----------------------------------------------------------------------
> Above can support user to change rxq/txq pairing  at run-time and user does
> not need to know the direction of flow in advance. This can effectively expand
> mbufs recycle mode's use scenarios.
> 
> Furthermore, mbufs recycle mode is no longer limited to the same pmd, it can
> support moving mbufs between different vendor pmds, even can put the
> mbufs anywhere into your Rx mbuf ring as long as the address of the mbuf
> ring can be provided.
> In the latest version, we enable mbufs recycle mode in i40e pmd and ixgbe
> pmd, and also try to use i40e driver in Rx, ixgbe driver in Tx, and then achieve
> 7-9% performance improvement by mbufs recycle mode.
> 
> Difference between mbuf recycle, ZC API used in mempool and general path
> For general path:
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts
> memcpy from temporary variable to mempool cache For ZC API used in
> mempool:
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
>                 Refer link:
> http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-
> kamalakshitha.aligeri@arm.com/
> For mbufs recycle:
>                 Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring Thus we can
> see in the one loop, compared to general path, mbufs recycle mode reduces
> 32+32=64 pkts memcpy; Compared to ZC API used in mempool, we can see
> mbufs recycle mode reduce 32 pkts memcpy in each loop.
> So, mbufs recycle has its own benefits.
> 
> Testing status:
> (1) dpdk l3fwd test with multiple drivers:
>     port 0: 82599 NIC   port 1: XL710 NIC
> -------------------------------------------------------------
> 		Without fast free	With fast free
> Thunderx2:      +7.53%	                +13.54%
> -------------------------------------------------------------
> 
> (2) dpdk l3fwd test with same driver:
>     port 0 && 1: XL710 NIC
> -------------------------------------------------------------
> 		Without fast free	With fast free
> Ampere altra:   +12.61%		        +11.42%
> n1sdp:		+8.30%			+3.85%
> x86-sse:	+8.43%			+3.72%
> -------------------------------------------------------------
> 
> (3) Performance comparison with ZC_mempool used
>     port 0 && 1: XL710 NIC
>     with fast free
> -------------------------------------------------------------
> 		With recycle buffer	With zc_mempool
> Ampere altra:	11.42%			3.54%
> -------------------------------------------------------------
> 
> Furthermore, we add recycle_mbuf engine in testpmd. Due to XL710 NIC has
> I/O bottleneck in testpmd in ampere altra, we can not see throughput change
> compared with I/O fwd engine. However, using record cmd in testpmd:
> '$set record-burst-stats on'
> we can see the ratio of 'Rx/Tx burst size of 32' is reduced. This indicate mbufs
> recycle can save CPU cycles.
> 
> V2:
> 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa) 2. Add
> 'txq_data_get' API to get txq info for Rx (Konstantin) 3. Use input parameter to
> enable direct rearm in l3fwd (Konstantin) 4. Add condition detection for direct
> rearm API (Morten, Andrew Rybchenko)
> 
> V3:
> 1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin) 2.
> Delete L3fwd change for direct rearm (Jerin) 3. enable direct rearm in ixgbe
> driver in Arm
> 
> v4:
> 1. Rename direct-rearm as buffer recycle. Based on this, function name and
> variable name are changed to let this mode more general for all drivers.
> (Konstantin, Morten) 2. Add ring wrapping check (Konstantin)
> 
> v5:
> 1. some change for ethdev API (Morten)
> 2. add support for avx2, sse, altivec path
> 
> v6:
> 1. fix ixgbe build issue in ppc
> 2. remove 'recycle_tx_mbufs_reuse' and 'recycle_rx_descriptors_refill'
>    API wrapper (Tech Board meeting)
> 3. add recycle_mbufs engine in testpmd (Tech Board meeting) 4. add
> namespace in the functions related to mbufs recycle(Ferruh)
> 
> v7:
> 1. move 'rxq/txq data' pointers to the beginning of eth_dev structure, in order
> to keep them in the same cache line as rx/tx_burst function pointers (Morten)
> 2. add the extra description for 'rte_eth_recycle_mbufs' to show it can support
> feeding 1 Rx queue from 2 Tx queues in the same thread
> (Konstantin)
> 3. For i40e/ixgbe driver, make the previous copied buffers as invalid if there are
> Tx buffers refcnt > 1 or from unexpected mempool (Konstantin) 4. add check
> for the return value of 'rte_eth_recycle_rx_queue_info_get'
> in testpmd fwd engine (Morten)
> 
> v8:
> 1. add arm/x86 build option to fix ixgbe build issue in ppc
> 
> v9:
> 1. delete duplicate file name for ixgbe
> 
> v10:
> 1. fix compile issue on windows
> 
> v11:
> 1. fix doc warning
> 
> Feifei Wang (4):
>   ethdev: add API for mbufs recycle mode
>   net/i40e: implement mbufs recycle mode
>   net/ixgbe: implement mbufs recycle mode
>   app/testpmd: add recycle mbufs engine
> 
>  app/test-pmd/meson.build                      |   1 +
>  app/test-pmd/recycle_mbufs.c                  |  58 ++++++
>  app/test-pmd/testpmd.c                        |   1 +
>  app/test-pmd/testpmd.h                        |   3 +
>  doc/guides/rel_notes/release_23_11.rst        |  15 ++
>  doc/guides/testpmd_app_ug/run_app.rst         |   1 +
>  doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
>  drivers/net/i40e/i40e_ethdev.c                |   1 +
>  drivers/net/i40e/i40e_ethdev.h                |   2 +
>  .../net/i40e/i40e_recycle_mbufs_vec_common.c  | 147 ++++++++++++++
>  drivers/net/i40e/i40e_rxtx.c                  |  32 ++++
>  drivers/net/i40e/i40e_rxtx.h                  |   4 +
>  drivers/net/i40e/meson.build                  |   1 +
>  drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
>  drivers/net/ixgbe/ixgbe_ethdev.h              |   3 +
>  .../ixgbe/ixgbe_recycle_mbufs_vec_common.c    | 143 ++++++++++++++
>  drivers/net/ixgbe/ixgbe_rxtx.c                |  37 +++-
>  drivers/net/ixgbe/ixgbe_rxtx.h                |   4 +
>  drivers/net/ixgbe/meson.build                 |   2 +
>  lib/ethdev/ethdev_driver.h                    |  10 +
>  lib/ethdev/ethdev_private.c                   |   2 +
>  lib/ethdev/rte_ethdev.c                       |  31 +++
>  lib/ethdev/rte_ethdev.h                       | 181 ++++++++++++++++++
>  lib/ethdev/rte_ethdev_core.h                  |  23 ++-
>  lib/ethdev/version.map                        |   3 +
>  25 files changed, 702 insertions(+), 9 deletions(-)  create mode 100644
> app/test-pmd/recycle_mbufs.c  create mode 100644
> drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
>  create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
> 
> --
> 2.25.1
  
Stephen Hemminger Aug. 22, 2023, 1:59 p.m. UTC | #2
On Tue, 22 Aug 2023 15:27:06 +0800
Feifei Wang <feifei.wang2@arm.com> wrote:

>   Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into its software ring. This will avoid the 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache. And we can call this mode as mbufs
> recycle mode.

Isn't the recycle ring just another cache? Why is the lcore cache slower?
Could we fix the general case there?
  
Feifei Wang Aug. 24, 2023, 3:11 a.m. UTC | #3
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, August 22, 2023 9:59 PM
> To: Feifei Wang <Feifei.Wang2@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>
> Subject: Re: [PATCH v11 0/4] Recycle mbufs from Tx queue into Rx queue
> 
> On Tue, 22 Aug 2023 15:27:06 +0800
> Feifei Wang <feifei.wang2@arm.com> wrote:
> 
> >   Currently, the transmit side frees the buffers into the lcore cache
> > and the receive side allocates buffers from the lcore cache. The
> > transmit side typically frees 32 buffers resulting in 32*8=256B of
> > stores to lcore cache. The receive side allocates 32 buffers and
> > stores them in the receive side software ring, resulting in 32*8=256B
> > of stores and 256B of load from the lcore cache.
> >
> > This patch proposes a mechanism to avoid freeing to/allocating from
> > the lcore cache. i.e. the receive side will free the buffers from
> > transmit side directly into its software ring. This will avoid the
> > 256B of loads and stores introduced by the lcore cache. It also frees
> > up the cache lines used by the lcore cache. And we can call this mode
> > as mbufs recycle mode.
> 
> Isn't the recycle ring just another cache? Why is the lcore cache slower?
> Could we fix the general case there?

Here lcore cache means the mempool lcore cache for each lcore:  mp->local_cache[lcore_id];
For each buffer allocate from mempool and free into mempool, the thread will firstly try to do
Memory copy from or into lcore cache.

We do not say lcore cache is slower, we means do memory copy from or into lcore cache will
cost much CPU cycle, and mbuf recycle can bypass these memory copy.

For generic case , we try to use zero copy to optimize, but the performance is worse than mbufs recycle:

For general path: 
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
For ZC API used in mempool:
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
                Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
For mbufs recycle:
                Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
  
Ferruh Yigit Sept. 20, 2023, 1:12 p.m. UTC | #4
On 8/24/2023 8:36 AM, Feifei Wang wrote:
> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into its software ring. This will avoid the 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache. And we can call this mode as mbufs
> recycle mode.
> 
> In the latest version, mbufs recycle mode is packaged as a separate API. 
> This allows for the users to change rxq/txq pairing in real time in data plane,
> according to the analysis of the packet flow by the application, for example:
> -----------------------------------------------------------------------
> Step 1: upper application analyse the flow direction
> Step 2: recycle_rxq_info = rte_eth_recycle_rx_queue_info_get(rx_portid, rx_queueid)
> Step 3: rte_eth_recycle_mbufs(rx_portid, rx_queueid, tx_portid, tx_queueid, recycle_rxq_info);
> Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
> Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
> -----------------------------------------------------------------------
> Above can support user to change rxq/txq pairing  at run-time and user does not need to
> know the direction of flow in advance. This can effectively expand mbufs recycle mode's
> use scenarios.
> 
> Furthermore, mbufs recycle mode is no longer limited to the same pmd,
> it can support moving mbufs between different vendor pmds, even can put the mbufs
> anywhere into your Rx mbuf ring as long as the address of the mbuf ring can be provided.
> In the latest version, we enable mbufs recycle mode in i40e pmd and ixgbe pmd, and also try to
> use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
> by mbufs recycle mode.
> 
> Difference between mbuf recycle, ZC API used in mempool and general path
> For general path: 
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
> For ZC API used in mempool:
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
>                 Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
> For mbufs recycle:
>                 Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
> Thus we can see in the one loop, compared to general path, mbufs recycle mode reduces 32+32=64 pkts memcpy;
> Compared to ZC API used in mempool, we can see mbufs recycle mode reduce 32 pkts memcpy in each loop.
> So, mbufs recycle has its own benefits.
> 
> Testing status:
> (1) dpdk l3fwd test with multiple drivers:
>     port 0: 82599 NIC   port 1: XL710 NIC
> -------------------------------------------------------------
> 		Without fast free	With fast free
> Thunderx2:      +7.53%	                +13.54%
> -------------------------------------------------------------
> 
> (2) dpdk l3fwd test with same driver:
>     port 0 && 1: XL710 NIC
> -------------------------------------------------------------
> 		Without fast free	With fast free
> Ampere altra:   +12.61%		        +11.42%
> n1sdp:		+8.30%			+3.85%
> x86-sse:	+8.43%			+3.72%
> -------------------------------------------------------------
> 
> (3) Performance comparison with ZC_mempool used
>     port 0 && 1: XL710 NIC
>     with fast free
> -------------------------------------------------------------
> 		With recycle buffer	With zc_mempool
> Ampere altra:	11.42%			3.54%
> -------------------------------------------------------------
> 
> Furthermore, we add recycle_mbuf engine in testpmd. Due to XL710 NIC has
> I/O bottleneck in testpmd in ampere altra, we can not see throughput change
> compared with I/O fwd engine. However, using record cmd in testpmd:
> '$set record-burst-stats on'
> we can see the ratio of 'Rx/Tx burst size of 32' is reduced. This
> indicate mbufs recycle can save CPU cycles.
> 
> V2:
> 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
> 2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
> 3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
> 4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)
> 
> V3:
> 1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
> 2. Delete L3fwd change for direct rearm (Jerin)
> 3. enable direct rearm in ixgbe driver in Arm
> 
> v4:
> 1. Rename direct-rearm as buffer recycle. Based on this, function name
> and variable name are changed to let this mode more general for all
> drivers. (Konstantin, Morten)
> 2. Add ring wrapping check (Konstantin)
> 
> v5:
> 1. some change for ethdev API (Morten)
> 2. add support for avx2, sse, altivec path
> 
> v6:
> 1. fix ixgbe build issue in ppc
> 2. remove 'recycle_tx_mbufs_reuse' and 'recycle_rx_descriptors_refill'
>    API wrapper (Tech Board meeting)
> 3. add recycle_mbufs engine in testpmd (Tech Board meeting)
> 4. add namespace in the functions related to mbufs recycle(Ferruh)
> 
> v7:
> 1. move 'rxq/txq data' pointers to the beginning of eth_dev structure,
> in order to keep them in the same cache line as rx/tx_burst function
> pointers (Morten)
> 2. add the extra description for 'rte_eth_recycle_mbufs' to show it can
> support feeding 1 Rx queue from 2 Tx queues in the same thread
> (Konstantin)
> 3. For i40e/ixgbe driver, make the previous copied buffers as invalid if
> there are Tx buffers refcnt > 1 or from unexpected mempool (Konstantin)
> 4. add check for the return value of 'rte_eth_recycle_rx_queue_info_get'
> in testpmd fwd engine (Morten)
> 
> v8:
> 1. add arm/x86 build option to fix ixgbe build issue in ppc
> 
> v9:
> 1. delete duplicate file name for ixgbe
> 
> v10:
> 1. fix compile issue on windows
> 
> v11:
> 1. fix doc warning
> 
> v12:
> 1. replace rx queue check code with eth_dev_validate_rx_queue
> function (Stephen)
> 2. put port and queue check before function call (Konstantin)
> 
> Feifei Wang (4):
>   ethdev: add API for mbufs recycle mode
>   net/i40e: implement mbufs recycle mode
>   net/ixgbe: implement mbufs recycle mode
>   app/testpmd: add recycle mbufs engine
> 

Thanks for dedication to improve the patchset and finding better
solution, it is appreciated.

Series applied to dpdk-next-net/main, thanks.
  
Ferruh Yigit Sept. 22, 2023, 3:30 p.m. UTC | #5
On 9/20/2023 2:12 PM, Ferruh Yigit wrote:
> On 8/24/2023 8:36 AM, Feifei Wang wrote:
>> Currently, the transmit side frees the buffers into the lcore cache and
>> the receive side allocates buffers from the lcore cache. The transmit
>> side typically frees 32 buffers resulting in 32*8=256B of stores to
>> lcore cache. The receive side allocates 32 buffers and stores them in
>> the receive side software ring, resulting in 32*8=256B of stores and
>> 256B of load from the lcore cache.
>>
>> This patch proposes a mechanism to avoid freeing to/allocating from
>> the lcore cache. i.e. the receive side will free the buffers from
>> transmit side directly into its software ring. This will avoid the 256B
>> of loads and stores introduced by the lcore cache. It also frees up the
>> cache lines used by the lcore cache. And we can call this mode as mbufs
>> recycle mode.
>>
>> In the latest version, mbufs recycle mode is packaged as a separate API. 
>> This allows for the users to change rxq/txq pairing in real time in data plane,
>> according to the analysis of the packet flow by the application, for example:
>> -----------------------------------------------------------------------
>> Step 1: upper application analyse the flow direction
>> Step 2: recycle_rxq_info = rte_eth_recycle_rx_queue_info_get(rx_portid, rx_queueid)
>> Step 3: rte_eth_recycle_mbufs(rx_portid, rx_queueid, tx_portid, tx_queueid, recycle_rxq_info);
>> Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
>> Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
>> -----------------------------------------------------------------------
>> Above can support user to change rxq/txq pairing  at run-time and user does not need to
>> know the direction of flow in advance. This can effectively expand mbufs recycle mode's
>> use scenarios.
>>
>> Furthermore, mbufs recycle mode is no longer limited to the same pmd,
>> it can support moving mbufs between different vendor pmds, even can put the mbufs
>> anywhere into your Rx mbuf ring as long as the address of the mbuf ring can be provided.
>> In the latest version, we enable mbufs recycle mode in i40e pmd and ixgbe pmd, and also try to
>> use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
>> by mbufs recycle mode.
>>
>> Difference between mbuf recycle, ZC API used in mempool and general path
>> For general path: 
>>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>>                 Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
>> For ZC API used in mempool:
>>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>>                 Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
>>                 Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
>> For mbufs recycle:
>>                 Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
>> Thus we can see in the one loop, compared to general path, mbufs recycle mode reduces 32+32=64 pkts memcpy;
>> Compared to ZC API used in mempool, we can see mbufs recycle mode reduce 32 pkts memcpy in each loop.
>> So, mbufs recycle has its own benefits.
>>
>> Testing status:
>> (1) dpdk l3fwd test with multiple drivers:
>>     port 0: 82599 NIC   port 1: XL710 NIC
>> -------------------------------------------------------------
>> 		Without fast free	With fast free
>> Thunderx2:      +7.53%	                +13.54%
>> -------------------------------------------------------------
>>
>> (2) dpdk l3fwd test with same driver:
>>     port 0 && 1: XL710 NIC
>> -------------------------------------------------------------
>> 		Without fast free	With fast free
>> Ampere altra:   +12.61%		        +11.42%
>> n1sdp:		+8.30%			+3.85%
>> x86-sse:	+8.43%			+3.72%
>> -------------------------------------------------------------
>>
>> (3) Performance comparison with ZC_mempool used
>>     port 0 && 1: XL710 NIC
>>     with fast free
>> -------------------------------------------------------------
>> 		With recycle buffer	With zc_mempool
>> Ampere altra:	11.42%			3.54%
>> -------------------------------------------------------------
>>
>> Furthermore, we add recycle_mbuf engine in testpmd. Due to XL710 NIC has
>> I/O bottleneck in testpmd in ampere altra, we can not see throughput change
>> compared with I/O fwd engine. However, using record cmd in testpmd:
>> '$set record-burst-stats on'
>> we can see the ratio of 'Rx/Tx burst size of 32' is reduced. This
>> indicate mbufs recycle can save CPU cycles.
>>
>> V2:
>> 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
>> 2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
>> 3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
>> 4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)
>>
>> V3:
>> 1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
>> 2. Delete L3fwd change for direct rearm (Jerin)
>> 3. enable direct rearm in ixgbe driver in Arm
>>
>> v4:
>> 1. Rename direct-rearm as buffer recycle. Based on this, function name
>> and variable name are changed to let this mode more general for all
>> drivers. (Konstantin, Morten)
>> 2. Add ring wrapping check (Konstantin)
>>
>> v5:
>> 1. some change for ethdev API (Morten)
>> 2. add support for avx2, sse, altivec path
>>
>> v6:
>> 1. fix ixgbe build issue in ppc
>> 2. remove 'recycle_tx_mbufs_reuse' and 'recycle_rx_descriptors_refill'
>>    API wrapper (Tech Board meeting)
>> 3. add recycle_mbufs engine in testpmd (Tech Board meeting)
>> 4. add namespace in the functions related to mbufs recycle(Ferruh)
>>
>> v7:
>> 1. move 'rxq/txq data' pointers to the beginning of eth_dev structure,
>> in order to keep them in the same cache line as rx/tx_burst function
>> pointers (Morten)
>> 2. add the extra description for 'rte_eth_recycle_mbufs' to show it can
>> support feeding 1 Rx queue from 2 Tx queues in the same thread
>> (Konstantin)
>> 3. For i40e/ixgbe driver, make the previous copied buffers as invalid if
>> there are Tx buffers refcnt > 1 or from unexpected mempool (Konstantin)
>> 4. add check for the return value of 'rte_eth_recycle_rx_queue_info_get'
>> in testpmd fwd engine (Morten)
>>
>> v8:
>> 1. add arm/x86 build option to fix ixgbe build issue in ppc
>>
>> v9:
>> 1. delete duplicate file name for ixgbe
>>
>> v10:
>> 1. fix compile issue on windows
>>
>> v11:
>> 1. fix doc warning
>>
>> v12:
>> 1. replace rx queue check code with eth_dev_validate_rx_queue
>> function (Stephen)
>> 2. put port and queue check before function call (Konstantin)
>>
>> Feifei Wang (4):
>>   ethdev: add API for mbufs recycle mode
>>   net/i40e: implement mbufs recycle mode
>>   net/ixgbe: implement mbufs recycle mode
>>   app/testpmd: add recycle mbufs engine
>>
> 
> Thanks for dedication to improve the patchset and finding better
> solution, it is appreciated.
> 
> Series applied to dpdk-next-net/main, thanks.
> 

Konstantin highlighted that there is an outstanding discussion:
http://patchwork.dpdk.org/project/dpdk/patch/20230822072710.1945027-3-feifei.wang2@arm.com/

Dropping the patchset from next-net, and updating its status in the
patchwork as "Changes Requested".
  
Ferruh Yigit Sept. 27, 2023, 5:24 p.m. UTC | #6
On 9/25/2023 4:19 AM, Feifei Wang wrote:
> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into its software ring. This will avoid the 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache. And we can call this mode as mbufs
> recycle mode.
> 
> In the latest version, mbufs recycle mode is packaged as a separate API. 
> This allows for the users to change rxq/txq pairing in real time in data plane,
> according to the analysis of the packet flow by the application, for example:
> -----------------------------------------------------------------------
> Step 1: upper application analyse the flow direction
> Step 2: recycle_rxq_info = rte_eth_recycle_rx_queue_info_get(rx_portid, rx_queueid)
> Step 3: rte_eth_recycle_mbufs(rx_portid, rx_queueid, tx_portid, tx_queueid, recycle_rxq_info);
> Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
> Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
> -----------------------------------------------------------------------
> Above can support user to change rxq/txq pairing  at run-time and user does not need to
> know the direction of flow in advance. This can effectively expand mbufs recycle mode's
> use scenarios.
> 
> Furthermore, mbufs recycle mode is no longer limited to the same pmd,
> it can support moving mbufs between different vendor pmds, even can put the mbufs
> anywhere into your Rx mbuf ring as long as the address of the mbuf ring can be provided.
> In the latest version, we enable mbufs recycle mode in i40e pmd and ixgbe pmd, and also try to
> use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
> by mbufs recycle mode.
> 
> Difference between mbuf recycle, ZC API used in mempool and general path
> For general path: 
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
> For ZC API used in mempool:
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
>                 Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
> For mbufs recycle:
>                 Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
> Thus we can see in the one loop, compared to general path, mbufs recycle mode reduces 32+32=64 pkts memcpy;
> Compared to ZC API used in mempool, we can see mbufs recycle mode reduce 32 pkts memcpy in each loop.
> So, mbufs recycle has its own benefits.
> 
> Testing status:
> (1) dpdk l3fwd test with multiple drivers:
>     port 0: 82599 NIC   port 1: XL710 NIC
> -------------------------------------------------------------
> 		Without fast free	With fast free
> Thunderx2:      +7.53%	                +13.54%
> -------------------------------------------------------------
> 
> (2) dpdk l3fwd test with same driver:
>     port 0 && 1: XL710 NIC
> -------------------------------------------------------------
> 		Without fast free	With fast free
> Ampere altra:   +12.61%		        +11.42%
> n1sdp:		+8.30%			+3.85%
> x86-sse:	+8.43%			+3.72%
> -------------------------------------------------------------
> 
> (3) Performance comparison with ZC_mempool used
>     port 0 && 1: XL710 NIC
>     with fast free
> -------------------------------------------------------------
> 		With recycle buffer	With zc_mempool
> Ampere altra:	11.42%			3.54%
> -------------------------------------------------------------
> 
> Furthermore, we add recycle_mbuf engine in testpmd. Due to XL710 NIC has
> I/O bottleneck in testpmd in ampere altra, we can not see throughput change
> compared with I/O fwd engine. However, using record cmd in testpmd:
> '$set record-burst-stats on'
> we can see the ratio of 'Rx/Tx burst size of 32' is reduced. This
> indicate mbufs recycle can save CPU cycles.
> 
> V2:
> 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
> 2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
> 3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
> 4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)
> 
> V3:
> 1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
> 2. Delete L3fwd change for direct rearm (Jerin)
> 3. enable direct rearm in ixgbe driver in Arm
> 
> v4:
> 1. Rename direct-rearm as buffer recycle. Based on this, function name
> and variable name are changed to let this mode more general for all
> drivers. (Konstantin, Morten)
> 2. Add ring wrapping check (Konstantin)
> 
> v5:
> 1. some change for ethdev API (Morten)
> 2. add support for avx2, sse, altivec path
> 
> v6:
> 1. fix ixgbe build issue in ppc
> 2. remove 'recycle_tx_mbufs_reuse' and 'recycle_rx_descriptors_refill'
>    API wrapper (Tech Board meeting)
> 3. add recycle_mbufs engine in testpmd (Tech Board meeting)
> 4. add namespace in the functions related to mbufs recycle(Ferruh)
> 
> v7:
> 1. move 'rxq/txq data' pointers to the beginning of eth_dev structure,
> in order to keep them in the same cache line as rx/tx_burst function
> pointers (Morten)
> 2. add the extra description for 'rte_eth_recycle_mbufs' to show it can
> support feeding 1 Rx queue from 2 Tx queues in the same thread
> (Konstantin)
> 3. For i40e/ixgbe driver, make the previous copied buffers as invalid if
> there are Tx buffers refcnt > 1 or from unexpected mempool (Konstantin)
> 4. add check for the return value of 'rte_eth_recycle_rx_queue_info_get'
> in testpmd fwd engine (Morten)
> 
> v8:
> 1. add arm/x86 build option to fix ixgbe build issue in ppc
> 
> v9:
> 1. delete duplicate file name for ixgbe
> 
> v10:
> 1. fix compile issue on windows
> 
> v11:
> 1. fix doc warning
> 
> v12:
> 1. replace rx queue check code with eth_dev_validate_rx_queue
> function (Stephen)
> 2. put port and queue check before function call (Konstantin)
> 
> v13:
> 1. for i40e and ixgbe drivers, reset nb_recycle_mbufs to zero
> when rxep[i] == NULL, no matter what value refill_requirement
> is (Konstantin)
> 
> Feifei Wang (4):
>   ethdev: add API for mbufs recycle mode
>   net/i40e: implement mbufs recycle mode
>   net/ixgbe: implement mbufs recycle mode
>   app/testpmd: add recycle mbufs engine
> 

Series applied to dpdk-next-net/main, thanks.