mbox series

[v5,0/3] Recycle buffers from Tx to Rx

Message ID 20230330062939.1206267-1-feifei.wang2@arm.com (mailing list archive)
Headers
Series Recycle buffers from Tx to Rx |

Message

Feifei Wang March 30, 2023, 6:29 a.m. UTC
  Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.

This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into its software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache. And we can call this mode as buffer
recycle mode.

In the latest version, buffer recycle mode is packaged as a separate API. 
This allows for the users to change rxq/txq pairing in real time in data plane,
according to the analysis of the packet flow by the application, for example:
-----------------------------------------------------------------------
Step 1: upper application analyse the flow direction
Step 2: rxq_buf_recycle_info = rte_eth_rx_buf_recycle_info_get(rx_portid, rx_queueid)
Step 3: rte_eth_dev_buf_recycle(rx_portid, rx_queueid, tx_portid, tx_queueid, rxq_buf_recycle_info);
Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
-----------------------------------------------------------------------
Above can support user to change rxq/txq pairing  at runtime and user does not need to
know the direction of flow in advance. This can effectively expand buffer recycle mode's
use scenarios.

Furthermore, buffer recycle mode is no longer limited to the same pmd,
it can support moving buffers between different vendor pmds, even can put the buffer
anywhere into your Rx buffer ring as long as the address of the buffer ring can be provided.
In the latest version, we enable direct-rearm in i40e pmd and ixgbe pmd, and also try to
use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
by buffer recycle mode.

Difference between buffer recycle, ZC API used in mempool and general path
For general path: 
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
For ZC API used in mempool:
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
                Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
For buffer recycle:
                Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
Thus we can see in the one loop, compared to general path, buffer recycle reduce 32+32=64 pkts memcpy;
Compared to ZC API used in mempool, we can see buffer recycle reduce 32 pkts memcpy in each loop.
So, buffer recycle has its own benefits.

Testing status:
(1) dpdk l3fwd test with multiple drivers:
    port 0: 82599 NIC   port 1: XL710 NIC
-------------------------------------------------------------
		Without fast free	With fast free
Thunderx2:      +7.53%	                +13.54%
-------------------------------------------------------------

(2) dpdk l3fwd test with same driver:
    port 0 && 1: XL710 NIC
-------------------------------------------------------------
		Without fast free	With fast free
Ampere altra:   +12.61%		        +11.42%
n1sdp:		+8.30%			+3.85%
x86-sse:	+8.43%			+3.72%
-------------------------------------------------------------

(3) Performance comparison with ZC_mempool used
    port 0 && 1: XL710 NIC
    with fast free
-------------------------------------------------------------
		With recycle buffer	With zc_mempool
Ampere altra:	11.42%			3.54%
-------------------------------------------------------------

V2:
1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)

V3:
1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
2. Delete L3fwd change for direct rearm (Jerin)
3. enable direct rearm in ixgbe driver in Arm

v4:
1. Rename direct-rearm as buffer recycle. Based on this, function name
and variable name are changed to let this mode more general for all
drivers. (Konstantin, Morten)
2. Add ring wrapping check (Konstantin)

v5:
1. some change for ethdev API (Morten)
2. add support for avx2, sse, altivec path

Feifei Wang (3):
  ethdev: add API for buffer recycle mode
  net/i40e: implement recycle buffer mode
  net/ixgbe: implement recycle buffer mode

 drivers/net/i40e/i40e_ethdev.c   |   1 +
 drivers/net/i40e/i40e_ethdev.h   |   2 +
 drivers/net/i40e/i40e_rxtx.c     | 159 +++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h     |   4 +
 drivers/net/ixgbe/ixgbe_ethdev.c |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.h |   3 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 153 ++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |   4 +
 lib/ethdev/ethdev_driver.h       |  10 ++
 lib/ethdev/ethdev_private.c      |   2 +
 lib/ethdev/rte_ethdev.c          |  33 +++++
 lib/ethdev/rte_ethdev.h          | 230 +++++++++++++++++++++++++++++++
 lib/ethdev/rte_ethdev_core.h     |  15 +-
 lib/ethdev/version.map           |   6 +
 14 files changed, 621 insertions(+), 2 deletions(-)
  

Comments

Stephen Hemminger March 30, 2023, 3:04 p.m. UTC | #1
On Thu, 30 Mar 2023 14:29:36 +0800
Feifei Wang <feifei.wang2@arm.com> wrote:

> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into its software ring. This will avoid the 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache. And we can call this mode as buffer
> recycle mode.


My naive reading of this is that lcore cache update is slow on ARM
so you are introducing yet another cache. Perhaps a better solution
would be to figure out/optimize the lcore cache to work better.

Adding another layer of abstraction is not going to help everyone
and the implementation you chose requires modifications to drivers
to get it to work.

In current form, this is not acceptable.
  
Feifei Wang April 3, 2023, 2:48 a.m. UTC | #2
Thanks for the reviewing.

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Thursday, March 30, 2023 11:05 PM
> To: Feifei Wang <Feifei.Wang2@arm.com>
> Cc: dev@dpdk.org; konstantin.v.ananyev@yandex.ru;
> mb@smartsharesystems.com; nd <nd@arm.com>
> Subject: Re: [PATCH v5 0/3] Recycle buffers from Tx to Rx
> 
> On Thu, 30 Mar 2023 14:29:36 +0800
> Feifei Wang <feifei.wang2@arm.com> wrote:
> 
> > Currently, the transmit side frees the buffers into the lcore cache
> > and the receive side allocates buffers from the lcore cache. The
> > transmit side typically frees 32 buffers resulting in 32*8=256B of
> > stores to lcore cache. The receive side allocates 32 buffers and
> > stores them in the receive side software ring, resulting in 32*8=256B
> > of stores and 256B of load from the lcore cache.
> >
> > This patch proposes a mechanism to avoid freeing to/allocating from
> > the lcore cache. i.e. the receive side will free the buffers from
> > transmit side directly into its software ring. This will avoid the
> > 256B of loads and stores introduced by the lcore cache. It also frees
> > up the cache lines used by the lcore cache. And we can call this mode
> > as buffer recycle mode.
> 
> 
> My naive reading of this is that lcore cache update is slow on ARM so you are
> introducing yet another cache. Perhaps a better solution would be to figure
> out/optimize the lcore cache to work better.

From my point of view, 'recycle buffer' is a strategic optimization. It reduces the operation
of a buffer. Not only arm, but also x86 and other  architecture can benefit from this. For example,
we can see x86 sse path performance improvement in cover letter test results.

> 
> Adding another layer of abstraction is not going to help everyone and the
> implementation you chose requires modifications to drivers to get it to work.
> 
We did not change the original driver mechanism. Recycle buffer can be looked
at a feature for pmd, if the user needs  higher performance, he/she can choose to
call the API in the application to enable it.

> In current form, this is not acceptable.
  
Ferruh Yigit April 19, 2023, 2:56 p.m. UTC | #3
On 3/30/2023 7:29 AM, Feifei Wang wrote:
> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into its software ring. This will avoid the 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache. And we can call this mode as buffer
> recycle mode.
> 
> In the latest version, buffer recycle mode is packaged as a separate API. 
> This allows for the users to change rxq/txq pairing in real time in data plane,
> according to the analysis of the packet flow by the application, for example:
> -----------------------------------------------------------------------
> Step 1: upper application analyse the flow direction
> Step 2: rxq_buf_recycle_info = rte_eth_rx_buf_recycle_info_get(rx_portid, rx_queueid)
> Step 3: rte_eth_dev_buf_recycle(rx_portid, rx_queueid, tx_portid, tx_queueid, rxq_buf_recycle_info);
> Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
> Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
> -----------------------------------------------------------------------
> Above can support user to change rxq/txq pairing  at runtime and user does not need to
> know the direction of flow in advance. This can effectively expand buffer recycle mode's
> use scenarios.
> 
> Furthermore, buffer recycle mode is no longer limited to the same pmd,
> it can support moving buffers between different vendor pmds, even can put the buffer
> anywhere into your Rx buffer ring as long as the address of the buffer ring can be provided.
> In the latest version, we enable direct-rearm in i40e pmd and ixgbe pmd, and also try to
> use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
> by buffer recycle mode.
> 
> Difference between buffer recycle, ZC API used in mempool and general path
> For general path: 
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
> For ZC API used in mempool:
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
>                 Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
> For buffer recycle:
>                 Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
> Thus we can see in the one loop, compared to general path, buffer recycle reduce 32+32=64 pkts memcpy;
> Compared to ZC API used in mempool, we can see buffer recycle reduce 32 pkts memcpy in each loop.
> So, buffer recycle has its own benefits.
> 
> Testing status:
> (1) dpdk l3fwd test with multiple drivers:
>     port 0: 82599 NIC   port 1: XL710 NIC
> -------------------------------------------------------------
> 		Without fast free	With fast free
> Thunderx2:      +7.53%	                +13.54%
> -------------------------------------------------------------
> 
> (2) dpdk l3fwd test with same driver:
>     port 0 && 1: XL710 NIC
> -------------------------------------------------------------
> 		Without fast free	With fast free
> Ampere altra:   +12.61%		        +11.42%
> n1sdp:		+8.30%			+3.85%
> x86-sse:	+8.43%			+3.72%
> -------------------------------------------------------------
> 
> (3) Performance comparison with ZC_mempool used
>     port 0 && 1: XL710 NIC
>     with fast free
> -------------------------------------------------------------
> 		With recycle buffer	With zc_mempool
> Ampere altra:	11.42%			3.54%
> -------------------------------------------------------------
> 

Thanks for the perf test reports.

Since test is done on Intel NICs, it would be great to get some testing
and performance numbers from Intel side too, if possible.

> V2:
> 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
> 2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
> 3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
> 4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)
> 
> V3:
> 1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
> 2. Delete L3fwd change for direct rearm (Jerin)
> 3. enable direct rearm in ixgbe driver in Arm
> 
> v4:
> 1. Rename direct-rearm as buffer recycle. Based on this, function name
> and variable name are changed to let this mode more general for all
> drivers. (Konstantin, Morten)
> 2. Add ring wrapping check (Konstantin)
> 
> v5:
> 1. some change for ethdev API (Morten)
> 2. add support for avx2, sse, altivec path
> 
> Feifei Wang (3):
>   ethdev: add API for buffer recycle mode
>   net/i40e: implement recycle buffer mode
>   net/ixgbe: implement recycle buffer mode
> 
>  drivers/net/i40e/i40e_ethdev.c   |   1 +
>  drivers/net/i40e/i40e_ethdev.h   |   2 +
>  drivers/net/i40e/i40e_rxtx.c     | 159 +++++++++++++++++++++
>  drivers/net/i40e/i40e_rxtx.h     |   4 +
>  drivers/net/ixgbe/ixgbe_ethdev.c |   1 +
>  drivers/net/ixgbe/ixgbe_ethdev.h |   3 +
>  drivers/net/ixgbe/ixgbe_rxtx.c   | 153 ++++++++++++++++++++
>  drivers/net/ixgbe/ixgbe_rxtx.h   |   4 +
>  lib/ethdev/ethdev_driver.h       |  10 ++
>  lib/ethdev/ethdev_private.c      |   2 +
>  lib/ethdev/rte_ethdev.c          |  33 +++++
>  lib/ethdev/rte_ethdev.h          | 230 +++++++++++++++++++++++++++++++
>  lib/ethdev/rte_ethdev_core.h     |  15 +-
>  lib/ethdev/version.map           |   6 +
>  14 files changed, 621 insertions(+), 2 deletions(-)
> 

Is usage sample of these new APIs planned? Can it be a new forwarding
mode in testpmd?
  
Feifei Wang April 25, 2023, 7:57 a.m. UTC | #4
> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@amd.com>
> Sent: Wednesday, April 19, 2023 10:56 PM
> To: Feifei Wang <Feifei.Wang2@arm.com>; Qi Z Zhang
> <qi.z.zhang@intel.com>; Mcnamara, John <john.mcnamara@intel.com>
> Cc: dev@dpdk.org; konstantin.v.ananyev@yandex.ru;
> mb@smartsharesystems.com; nd <nd@arm.com>
> Subject: Re: [PATCH v5 0/3] Recycle buffers from Tx to Rx
> 
> On 3/30/2023 7:29 AM, Feifei Wang wrote:
> > Currently, the transmit side frees the buffers into the lcore cache
> > and the receive side allocates buffers from the lcore cache. The
> > transmit side typically frees 32 buffers resulting in 32*8=256B of
> > stores to lcore cache. The receive side allocates 32 buffers and
> > stores them in the receive side software ring, resulting in 32*8=256B
> > of stores and 256B of load from the lcore cache.
> >
> > This patch proposes a mechanism to avoid freeing to/allocating from
> > the lcore cache. i.e. the receive side will free the buffers from
> > transmit side directly into its software ring. This will avoid the
> > 256B of loads and stores introduced by the lcore cache. It also frees
> > up the cache lines used by the lcore cache. And we can call this mode
> > as buffer recycle mode.
> >
> > In the latest version, buffer recycle mode is packaged as a separate API.
> > This allows for the users to change rxq/txq pairing in real time in
> > data plane, according to the analysis of the packet flow by the application,
> for example:
> > ----------------------------------------------------------------------
> > - Step 1: upper application analyse the flow direction Step 2:
> > rxq_buf_recycle_info = rte_eth_rx_buf_recycle_info_get(rx_portid,
> > rx_queueid) Step 3: rte_eth_dev_buf_recycle(rx_portid, rx_queueid,
> > tx_portid, tx_queueid, rxq_buf_recycle_info); Step 4:
> > rte_eth_rx_burst(rx_portid,rx_queueid);
> > Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
> > ----------------------------------------------------------------------
> > - Above can support user to change rxq/txq pairing  at runtime and
> > user does not need to know the direction of flow in advance. This can
> > effectively expand buffer recycle mode's use scenarios.
> >
> > Furthermore, buffer recycle mode is no longer limited to the same pmd,
> > it can support moving buffers between different vendor pmds, even can
> > put the buffer anywhere into your Rx buffer ring as long as the address of the
> buffer ring can be provided.
> > In the latest version, we enable direct-rearm in i40e pmd and ixgbe
> > pmd, and also try to use i40e driver in Rx, ixgbe driver in Tx, and
> > then achieve 7-9% performance improvement by buffer recycle mode.
> >
> > Difference between buffer recycle, ZC API used in mempool and general
> > path For general path:
> >                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
> >                 Tx: 32 pkts memcpy from tx_sw_ring to temporary
> > variable + 32 pkts memcpy from temporary variable to mempool cache For
> ZC API used in mempool:
> >                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
> >                 Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
> >                 Refer link:
> > http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-
> kama
> > lakshitha.aligeri@arm.com/
> > For buffer recycle:
> >                 Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
> > Thus we can see in the one loop, compared to general path, buffer
> > recycle reduce 32+32=64 pkts memcpy; Compared to ZC API used in
> mempool, we can see buffer recycle reduce 32 pkts memcpy in each loop.
> > So, buffer recycle has its own benefits.
> >
> > Testing status:
> > (1) dpdk l3fwd test with multiple drivers:
> >     port 0: 82599 NIC   port 1: XL710 NIC
> > -------------------------------------------------------------
> > 		Without fast free	With fast free
> > Thunderx2:      +7.53%	                +13.54%
> > -------------------------------------------------------------
> >
> > (2) dpdk l3fwd test with same driver:
> >     port 0 && 1: XL710 NIC
> > -------------------------------------------------------------
> > 		Without fast free	With fast free
> > Ampere altra:   +12.61%		        +11.42%
> > n1sdp:		+8.30%			+3.85%
> > x86-sse:	+8.43%			+3.72%
> > -------------------------------------------------------------
> >
> > (3) Performance comparison with ZC_mempool used
> >     port 0 && 1: XL710 NIC
> >     with fast free
> > -------------------------------------------------------------
> > 		With recycle buffer	With zc_mempool
> > Ampere altra:	11.42%			3.54%
> > -------------------------------------------------------------
> >
> 
> Thanks for the perf test reports.
> 
> Since test is done on Intel NICs, it would be great to get some testing and
> performance numbers from Intel side too, if possible.

Thanks for the reviewing.
Actually, we have done the test in x86. From the performance number above,
It shows in x86-sse path, buffer recycle can improve performance by 3.72% ~ 8.43%.

> 
> > V2:
> > 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa) 2.
> > Add 'txq_data_get' API to get txq info for Rx (Konstantin) 3. Use
> > input parameter to enable direct rearm in l3fwd (Konstantin) 4. Add
> > condition detection for direct rearm API (Morten, Andrew Rybchenko)
> >
> > V3:
> > 1. Seperate Rx and Tx operation with two APIs in direct-rearm
> > (Konstantin) 2. Delete L3fwd change for direct rearm (Jerin) 3. enable
> > direct rearm in ixgbe driver in Arm
> >
> > v4:
> > 1. Rename direct-rearm as buffer recycle. Based on this, function name
> > and variable name are changed to let this mode more general for all
> > drivers. (Konstantin, Morten) 2. Add ring wrapping check (Konstantin)
> >
> > v5:
> > 1. some change for ethdev API (Morten) 2. add support for avx2, sse,
> > altivec path
> >
> > Feifei Wang (3):
> >   ethdev: add API for buffer recycle mode
> >   net/i40e: implement recycle buffer mode
> >   net/ixgbe: implement recycle buffer mode
> >
> >  drivers/net/i40e/i40e_ethdev.c   |   1 +
> >  drivers/net/i40e/i40e_ethdev.h   |   2 +
> >  drivers/net/i40e/i40e_rxtx.c     | 159 +++++++++++++++++++++
> >  drivers/net/i40e/i40e_rxtx.h     |   4 +
> >  drivers/net/ixgbe/ixgbe_ethdev.c |   1 +
> >  drivers/net/ixgbe/ixgbe_ethdev.h |   3 +
> >  drivers/net/ixgbe/ixgbe_rxtx.c   | 153 ++++++++++++++++++++
> >  drivers/net/ixgbe/ixgbe_rxtx.h   |   4 +
> >  lib/ethdev/ethdev_driver.h       |  10 ++
> >  lib/ethdev/ethdev_private.c      |   2 +
> >  lib/ethdev/rte_ethdev.c          |  33 +++++
> >  lib/ethdev/rte_ethdev.h          | 230
> +++++++++++++++++++++++++++++++
> >  lib/ethdev/rte_ethdev_core.h     |  15 +-
> >  lib/ethdev/version.map           |   6 +
> >  14 files changed, 621 insertions(+), 2 deletions(-)
> >
> 
> Is usage sample of these new APIs planned? Can it be a new forwarding mode
> in testpmd?

Agree. Following the discussion in Tech Board meeting, we will add buffer recycle into testpmd fwd engine.