[RFC,v1,0/4] Direct re-arming of buffers on receive side

Message ID	20211224164613.32569-1-feifei.wang2@arm.com (mailing list archive)
Headers	From: Feifei Wang <feifei.wang2@arm.com> To: Cc: dev@dpdk.org, nd@arm.com, Feifei Wang <feifei.wang2@arm.com> Subject: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Date: Sat, 25 Dec 2021 00:46:08 +0800 Message-Id: <20211224164613.32569-1-feifei.wang2@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dev-bounces@dpdk.org
Series	Direct re-arming of buffers on receive side \| [RFC,v1,0/4] Direct re-arming of buffers on receive side [RFC,v1,1/4] net/i40e: enable direct re-arm mode [RFC,v1,2/4] ethdev: add API for direct re-arm mode [RFC,v1,3/4] net/i40e: add direct re-arm mode internal API [RFC,v1,4/4] examples/l3fwd: give an example for direct rearm mode

Message ID

20211224164613.32569-1-feifei.wang2@arm.com (mailing list archive)

Headers

From: Feifei Wang <feifei.wang2@arm.com>
To: 
Cc: dev@dpdk.org,
	nd@arm.com,
	Feifei Wang <feifei.wang2@arm.com>
Subject: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
Date: Sat, 25 Dec 2021 00:46:08 +0800
Message-Id: <20211224164613.32569-1-feifei.wang2@arm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: dev-bounces@dpdk.org

Series

Direct re-arming of buffers on receive side |

Message

Feifei Wang Dec. 24, 2021, 4:46 p.m. UTC

Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.

This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into it's software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache.

However, this solution poses several constraint:

1)The receive queue needs to know which transmit queue it should take
the buffers from. The application logic decides which transmit port to
use to send out the packets. In many use cases the NIC might have a
single port ([1], [2], [3]), in which case a given transmit queue is
always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
is easy to configure.

If the NIC has 2 ports (there are several references), then we will have
1:2 (RX queue: TX queue) mapping which is still easy to configure.
However, if this is generalized to 'N' ports, the configuration can be
long. More over the PMD would have to scan a list of transmit queues to
pull the buffers from.

2)The other factor that needs to be considered is 'run-to-completion' vs
'pipeline' models. In the run-to-completion model, the receive side and
the transmit side are running on the same lcore serially. In the pipeline
model. The receive side and transmit side might be running on different
lcores in parallel. This requires locking. This is not supported at this
point.

3)Tx and Rx buffers must be from the same mempool. And we also must
ensure Tx buffer free number is equal to Rx buffer free number:
(txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH)
Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
is due to tx_next_dd is a variable to compute tx sw-ring free location.
Its value will be one more round than the position where next time free
starts.

Current status in this RFC:
1)An API is added to allow for mapping a TX queue to a RX queue.
Currently it supports 1:1 mapping.
2)The i40e driver is changed to do the direct re-arm of the receive
side.
3)L3fwd application is hacked to do the mapping for the following command:
one core two flows case:
$./examples/dpdk-l3fwd -n 4 -l 1 -a 0001:01:00.0 -a 0001:01:00.1
-- -p 0x3 -P --config='(0,0,1),(1,0,1)'
where:
Port 0 Rx queue 0 is mapped to Port 1 Tx queue 0
Port 1 Rx queue 0 is mapped to Port 0 Tx queue 0

Testing status:
1)Tested L3fwd with the above command:
The testing results for L3fwd are as follows:
-------------------------------------------------------------------
N1SDP:
Base performance(with this patch) with direct re-arm mode enabled
0% +14.1%

Ampere Altra:
Base performance(with this patch) with direct re-arm mode enabled
0% +17.1%
-------------------------------------------------------------------
This patch can not affect performance of normal mode, and if enable
direct-rearm mode, performance can be improved by 14% - 17% in n1sdp
and ampera-altra.

Feedback requested:
1) Has anyone done any similar experiments, any lessons learnt?
2) Feedback on API

Next steps:
1) Update the code for supporting 1:N(Rx : TX) mapping
2) Automate the configuration in L3fwd sample application

Reference:
[1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
[2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-ethernet-network-adapter-e810cqda1/specifications.html
[3] https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/n1100g

Feifei Wang (4):
net/i40e: enable direct re-arm mode
ethdev: add API for direct re-arm mode
net/i40e: add direct re-arm mode internal API
examples/l3fwd: give an example for direct rearm mode

Comments

Morten Brørup Dec. 26, 2021, 10:25 a.m. UTC | #1

> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Friday, 24 December 2021 17.46
> 
> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into it's software ring. This will avoid the
> 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache.
> 
> However, this solution poses several constraint:
> 
> 1)The receive queue needs to know which transmit queue it should take
> the buffers from. The application logic decides which transmit port to
> use to send out the packets. In many use cases the NIC might have a
> single port ([1], [2], [3]), in which case a given transmit queue is
> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> is easy to configure.
> 
> If the NIC has 2 ports (there are several references), then we will
> have
> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> However, if this is generalized to 'N' ports, the configuration can be
> long. More over the PMD would have to scan a list of transmit queues to
> pull the buffers from.

I disagree with the description of this constraint.

As I understand it, it doesn't matter now many ports or queues are in a NIC or system.

The constraint is more narrow:

This patch requires that all packets ingressing on some port/queue must egress on the specific port/queue that it has been configured to ream its buffers from. I.e. an application cannot route packets between multiple ports with this patch.

> 
> 2)The other factor that needs to be considered is 'run-to-completion'
> vs
> 'pipeline' models. In the run-to-completion model, the receive side and
> the transmit side are running on the same lcore serially. In the
> pipeline
> model. The receive side and transmit side might be running on different
> lcores in parallel. This requires locking. This is not supported at
> this
> point.
> 
> 3)Tx and Rx buffers must be from the same mempool. And we also must
> ensure Tx buffer free number is equal to Rx buffer free number:
> (txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH)
> Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
> is due to tx_next_dd is a variable to compute tx sw-ring free location.
> Its value will be one more round than the position where next time free
> starts.
> 

You are missing the fourth constraint:

4) The application must transmit all received packets immediately, i.e. QoS queueing and similar is prohibited.

> Current status in this RFC:
> 1)An API is added to allow for mapping a TX queue to a RX queue.
>   Currently it supports 1:1 mapping.
> 2)The i40e driver is changed to do the direct re-arm of the receive
>   side.
> 3)L3fwd application is hacked to do the mapping for the following
> command:
>   one core two flows case:
>   $./examples/dpdk-l3fwd -n 4 -l 1 -a 0001:01:00.0 -a 0001:01:00.1
>   -- -p 0x3 -P --config='(0,0,1),(1,0,1)'
>   where:
>   Port 0 Rx queue 0 is mapped to Port 1 Tx queue 0
>   Port 1 Rx queue 0 is mapped to Port 0 Tx queue 0
> 
> Testing status:
> 1)Tested L3fwd with the above command:
> The testing results for L3fwd are as follows:
> -------------------------------------------------------------------
> N1SDP:
> Base performance(with this patch)   with direct re-arm mode enabled
>       0%                                  +14.1%
> 
> Ampere Altra:
> Base performance(with this patch)   with direct re-arm mode enabled
>       0%                                  +17.1%
> -------------------------------------------------------------------
> This patch can not affect performance of normal mode, and if enable
> direct-rearm mode, performance can be improved by 14% - 17% in n1sdp
> and ampera-altra.
> 
> Feedback requested:
> 1) Has anyone done any similar experiments, any lessons learnt?
> 2) Feedback on API
> 
> Next steps:
> 1) Update the code for supporting 1:N(Rx : TX) mapping
> 2) Automate the configuration in L3fwd sample application
> 
> Reference:
> [1] https://store.nvidia.com/en-
> us/networking/store/product/MCX623105AN-
> CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
> [2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-
> ethernet-network-adapter-e810cqda1/specifications.html
> [3] https://www.broadcom.com/products/ethernet-connectivity/network-
> adapters/100gb-nic-ocp/n1100g
> 
> Feifei Wang (4):
>   net/i40e: enable direct re-arm mode
>   ethdev: add API for direct re-arm mode
>   net/i40e: add direct re-arm mode internal API
>   examples/l3fwd: give an example for direct rearm mode
> 
>  drivers/net/i40e/i40e_ethdev.c        |  34 ++++++
>  drivers/net/i40e/i40e_rxtx.h          |   4 +
>  drivers/net/i40e/i40e_rxtx_vec_neon.c | 149 +++++++++++++++++++++++++-
>  examples/l3fwd/main.c                 |   3 +
>  lib/ethdev/ethdev_driver.h            |  15 +++
>  lib/ethdev/rte_ethdev.c               |  14 +++
>  lib/ethdev/rte_ethdev.h               |  31 ++++++
>  lib/ethdev/version.map                |   3 +
>  8 files changed, 251 insertions(+), 2 deletions(-)
> 
> --
> 2.25.1
> 

The patch provides a significant performance improvement, but I am wondering if any real world applications exist that would use this. Only a "router on a stick" (i.e. a single-port router) comes to my mind, and that is probably sufficient to call it useful in the real world. Do you have any other examples to support the usefulness of this patch?

Anyway, the patch doesn't do any harm if unused, and the only performance cost is the "if (rxq->direct_rxrearm_enable)" branch in the Ethdev driver. So I don't oppose to it.

Feifei Wang Dec. 28, 2021, 6:55 a.m. UTC | #2

Thanks for your comments.

> -----邮件原件-----
> 发件人: Morten Brørup <mb@smartsharesystems.com>
> 发送时间: Sunday, December 26, 2021 6:25 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>
> 主题: RE: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
> 
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Friday, 24 December 2021 17.46
> >
> > Currently, the transmit side frees the buffers into the lcore cache
> > and the receive side allocates buffers from the lcore cache. The
> > transmit side typically frees 32 buffers resulting in 32*8=256B of
> > stores to lcore cache. The receive side allocates 32 buffers and
> > stores them in the receive side software ring, resulting in 32*8=256B
> > of stores and 256B of load from the lcore cache.
> >
> > This patch proposes a mechanism to avoid freeing to/allocating from
> > the lcore cache. i.e. the receive side will free the buffers from
> > transmit side directly into it's software ring. This will avoid the
> > 256B of loads and stores introduced by the lcore cache. It also frees
> > up the cache lines used by the lcore cache.
> >
> > However, this solution poses several constraint:
> >
> > 1)The receive queue needs to know which transmit queue it should take
> > the buffers from. The application logic decides which transmit port to
> > use to send out the packets. In many use cases the NIC might have a
> > single port ([1], [2], [3]), in which case a given transmit queue is
> > always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> > is easy to configure.
> >
> > If the NIC has 2 ports (there are several references), then we will
> > have
> > 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> > However, if this is generalized to 'N' ports, the configuration can be
> > long. More over the PMD would have to scan a list of transmit queues
> > to pull the buffers from.
> 
> I disagree with the description of this constraint.
> 
> As I understand it, it doesn't matter now many ports or queues are in a NIC
> or system.
> 
> The constraint is more narrow:
> 
> This patch requires that all packets ingressing on some port/queue must
> egress on the specific port/queue that it has been configured to ream its
> buffers from. I.e. an application cannot route packets between multiple
> ports with this patch.

First, I agree with that direct-rearm mode is suitable for the case that
user should know the direction of the flow in advance and map rx/tx with
each other. It is not suitable for the normal packet random route case.

Second, our proposed two cases: one port NIC and two port NIC means the
direction of flow is determined. Furthermore, for two port NIC, there maybe two
flow directions: from port 0 to port 1, or from port 0 to port 0. Thus we need to have
1:2 (Rx queue :  Tx queue) mapping.

At last, maybe we can change our description as follows:
"The first constraint is that user should know the direction of the flow in advance,
and based on this, user needs to map the Rx and Tx queues according to the flow direction:
For example, if the NIC just has one port
 ......
Or if the NIC have two ports 
......."
 
> 
> >
> > 2)The other factor that needs to be considered is 'run-to-completion'
> > vs
> > 'pipeline' models. In the run-to-completion model, the receive side
> > and the transmit side are running on the same lcore serially. In the
> > pipeline model. The receive side and transmit side might be running on
> > different lcores in parallel. This requires locking. This is not
> > supported at this point.
> >
> > 3)Tx and Rx buffers must be from the same mempool. And we also must
> > ensure Tx buffer free number is equal to Rx buffer free number:
> > (txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH) Thus, 'tx_next_dd'
> > can be updated correctly in direct-rearm mode. This is due to
> > tx_next_dd is a variable to compute tx sw-ring free location.
> > Its value will be one more round than the position where next time
> > free starts.
> >
> 
> You are missing the fourth constraint:
> 
> 4) The application must transmit all received packets immediately, i.e. QoS
> queueing and similar is prohibited.
> 

You are right and this is indeed one of the limitations.

> > Current status in this RFC:
> > 1)An API is added to allow for mapping a TX queue to a RX queue.
> >   Currently it supports 1:1 mapping.
> > 2)The i40e driver is changed to do the direct re-arm of the receive
> >   side.
> > 3)L3fwd application is hacked to do the mapping for the following
> > command:
> >   one core two flows case:
> >   $./examples/dpdk-l3fwd -n 4 -l 1 -a 0001:01:00.0 -a 0001:01:00.1
> >   -- -p 0x3 -P --config='(0,0,1),(1,0,1)'
> >   where:
> >   Port 0 Rx queue 0 is mapped to Port 1 Tx queue 0
> >   Port 1 Rx queue 0 is mapped to Port 0 Tx queue 0
> >
> > Testing status:
> > 1)Tested L3fwd with the above command:
> > The testing results for L3fwd are as follows:
> > -------------------------------------------------------------------
> > N1SDP:
> > Base performance(with this patch)   with direct re-arm mode enabled
> >       0%                                  +14.1%
> >
> > Ampere Altra:
> > Base performance(with this patch)   with direct re-arm mode enabled
> >       0%                                  +17.1%
> > -------------------------------------------------------------------
> > This patch can not affect performance of normal mode, and if enable
> > direct-rearm mode, performance can be improved by 14% - 17% in n1sdp
> > and ampera-altra.
> >
> > Feedback requested:
> > 1) Has anyone done any similar experiments, any lessons learnt?
> > 2) Feedback on API
> >
> > Next steps:
> > 1) Update the code for supporting 1:N(Rx : TX) mapping
> > 2) Automate the configuration in L3fwd sample application
> >
> > Reference:
> > [1] https://store.nvidia.com/en-
> > us/networking/store/product/MCX623105AN-
> >
> CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECrypt
> oDisabled
> > / [2]
> > https://www.intel.com/content/www/us/en/products/sku/192561/intel-
> > ethernet-network-adapter-e810cqda1/specifications.html
> > [3] https://www.broadcom.com/products/ethernet-
> connectivity/network-
> > adapters/100gb-nic-ocp/n1100g
> >
> > Feifei Wang (4):
> >   net/i40e: enable direct re-arm mode
> >   ethdev: add API for direct re-arm mode
> >   net/i40e: add direct re-arm mode internal API
> >   examples/l3fwd: give an example for direct rearm mode
> >
> >  drivers/net/i40e/i40e_ethdev.c        |  34 ++++++
> >  drivers/net/i40e/i40e_rxtx.h          |   4 +
> >  drivers/net/i40e/i40e_rxtx_vec_neon.c | 149
> +++++++++++++++++++++++++-
> >  examples/l3fwd/main.c                 |   3 +
> >  lib/ethdev/ethdev_driver.h            |  15 +++
> >  lib/ethdev/rte_ethdev.c               |  14 +++
> >  lib/ethdev/rte_ethdev.h               |  31 ++++++
> >  lib/ethdev/version.map                |   3 +
> >  8 files changed, 251 insertions(+), 2 deletions(-)
> >
> > --
> > 2.25.1
> >
> 
> The patch provides a significant performance improvement, but I am
> wondering if any real world applications exist that would use this. Only a
> "router on a stick" (i.e. a single-port router) comes to my mind, and that is
> probably sufficient to call it useful in the real world. Do you have any other
> examples to support the usefulness of this patch?
> 
One case I have is about network security. For network firewall, all packets need
to ingress on the specified port and egress on the specified port to do packet filtering.
In this case, we can know flow direction in advance.

> Anyway, the patch doesn't do any harm if unused, and the only performance
> cost is the "if (rxq->direct_rxrearm_enable)" branch in the Ethdev driver. So I
> don't oppose to it.
>

Ferruh Yigit Jan. 18, 2022, 3:51 p.m. UTC | #3

On 12/28/2021 6:55 AM, Feifei Wang wrote:
> Thanks for your comments.
> 
>> -----邮件原件-----
>> 发件人: Morten Brørup <mb@smartsharesystems.com>
>> 发送时间: Sunday, December 26, 2021 6:25 PM
>> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
>> 抄送: dev@dpdk.org; nd <nd@arm.com>
>> 主题: RE: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
>>
>>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
>>> Sent: Friday, 24 December 2021 17.46
>>>
>>> Currently, the transmit side frees the buffers into the lcore cache
>>> and the receive side allocates buffers from the lcore cache. The
>>> transmit side typically frees 32 buffers resulting in 32*8=256B of
>>> stores to lcore cache. The receive side allocates 32 buffers and
>>> stores them in the receive side software ring, resulting in 32*8=256B
>>> of stores and 256B of load from the lcore cache.
>>>
>>> This patch proposes a mechanism to avoid freeing to/allocating from
>>> the lcore cache. i.e. the receive side will free the buffers from
>>> transmit side directly into it's software ring. This will avoid the
>>> 256B of loads and stores introduced by the lcore cache. It also frees
>>> up the cache lines used by the lcore cache.
>>>
>>> However, this solution poses several constraint:
>>>
>>> 1)The receive queue needs to know which transmit queue it should take
>>> the buffers from. The application logic decides which transmit port to
>>> use to send out the packets. In many use cases the NIC might have a
>>> single port ([1], [2], [3]), in which case a given transmit queue is
>>> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
>>> is easy to configure.
>>>
>>> If the NIC has 2 ports (there are several references), then we will
>>> have
>>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
>>> However, if this is generalized to 'N' ports, the configuration can be
>>> long. More over the PMD would have to scan a list of transmit queues
>>> to pull the buffers from.
>>
>> I disagree with the description of this constraint.
>>
>> As I understand it, it doesn't matter now many ports or queues are in a NIC
>> or system.
>>
>> The constraint is more narrow:
>>
>> This patch requires that all packets ingressing on some port/queue must
>> egress on the specific port/queue that it has been configured to ream its
>> buffers from. I.e. an application cannot route packets between multiple
>> ports with this patch.
> 
> First, I agree with that direct-rearm mode is suitable for the case that
> user should know the direction of the flow in advance and map rx/tx with
> each other. It is not suitable for the normal packet random route case.
> 
> Second, our proposed two cases: one port NIC and two port NIC means the
> direction of flow is determined. Furthermore, for two port NIC, there maybe two
> flow directions: from port 0 to port 1, or from port 0 to port 0. Thus we need to have
> 1:2 (Rx queue :  Tx queue) mapping.
> 
> At last, maybe we can change our description as follows:
> "The first constraint is that user should know the direction of the flow in advance,
> and based on this, user needs to map the Rx and Tx queues according to the flow direction:
> For example, if the NIC just has one port
>   ......
> Or if the NIC have two ports
> ......."
>   
>>
>>>
>>> 2)The other factor that needs to be considered is 'run-to-completion'
>>> vs
>>> 'pipeline' models. In the run-to-completion model, the receive side
>>> and the transmit side are running on the same lcore serially. In the
>>> pipeline model. The receive side and transmit side might be running on
>>> different lcores in parallel. This requires locking. This is not
>>> supported at this point.
>>>
>>> 3)Tx and Rx buffers must be from the same mempool. And we also must
>>> ensure Tx buffer free number is equal to Rx buffer free number:
>>> (txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH) Thus, 'tx_next_dd'
>>> can be updated correctly in direct-rearm mode. This is due to
>>> tx_next_dd is a variable to compute tx sw-ring free location.
>>> Its value will be one more round than the position where next time
>>> free starts.
>>>
>>
>> You are missing the fourth constraint:
>>
>> 4) The application must transmit all received packets immediately, i.e. QoS
>> queueing and similar is prohibited.
>>
> 
> You are right and this is indeed one of the limitations.
> 
>>> Current status in this RFC:
>>> 1)An API is added to allow for mapping a TX queue to a RX queue.
>>>    Currently it supports 1:1 mapping.
>>> 2)The i40e driver is changed to do the direct re-arm of the receive
>>>    side.
>>> 3)L3fwd application is hacked to do the mapping for the following
>>> command:
>>>    one core two flows case:
>>>    $./examples/dpdk-l3fwd -n 4 -l 1 -a 0001:01:00.0 -a 0001:01:00.1
>>>    -- -p 0x3 -P --config='(0,0,1),(1,0,1)'
>>>    where:
>>>    Port 0 Rx queue 0 is mapped to Port 1 Tx queue 0
>>>    Port 1 Rx queue 0 is mapped to Port 0 Tx queue 0
>>>
>>> Testing status:
>>> 1)Tested L3fwd with the above command:
>>> The testing results for L3fwd are as follows:
>>> -------------------------------------------------------------------
>>> N1SDP:
>>> Base performance(with this patch)   with direct re-arm mode enabled
>>>        0%                                  +14.1%
>>>
>>> Ampere Altra:
>>> Base performance(with this patch)   with direct re-arm mode enabled
>>>        0%                                  +17.1%
>>> -------------------------------------------------------------------
>>> This patch can not affect performance of normal mode, and if enable
>>> direct-rearm mode, performance can be improved by 14% - 17% in n1sdp
>>> and ampera-altra.
>>>
>>> Feedback requested:
>>> 1) Has anyone done any similar experiments, any lessons learnt?
>>> 2) Feedback on API
>>>
>>> Next steps:
>>> 1) Update the code for supporting 1:N(Rx : TX) mapping
>>> 2) Automate the configuration in L3fwd sample application
>>>
>>> Reference:
>>> [1] https://store.nvidia.com/en-
>>> us/networking/store/product/MCX623105AN-
>>>
>> CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECrypt
>> oDisabled
>>> / [2]
>>> https://www.intel.com/content/www/us/en/products/sku/192561/intel-
>>> ethernet-network-adapter-e810cqda1/specifications.html
>>> [3] https://www.broadcom.com/products/ethernet-
>> connectivity/network-
>>> adapters/100gb-nic-ocp/n1100g
>>>
>>> Feifei Wang (4):
>>>    net/i40e: enable direct re-arm mode
>>>    ethdev: add API for direct re-arm mode
>>>    net/i40e: add direct re-arm mode internal API
>>>    examples/l3fwd: give an example for direct rearm mode
>>>
>>>   drivers/net/i40e/i40e_ethdev.c        |  34 ++++++
>>>   drivers/net/i40e/i40e_rxtx.h          |   4 +
>>>   drivers/net/i40e/i40e_rxtx_vec_neon.c | 149
>> +++++++++++++++++++++++++-
>>>   examples/l3fwd/main.c                 |   3 +
>>>   lib/ethdev/ethdev_driver.h            |  15 +++
>>>   lib/ethdev/rte_ethdev.c               |  14 +++
>>>   lib/ethdev/rte_ethdev.h               |  31 ++++++
>>>   lib/ethdev/version.map                |   3 +
>>>   8 files changed, 251 insertions(+), 2 deletions(-)
>>>
>>> --
>>> 2.25.1
>>>
>>
>> The patch provides a significant performance improvement, but I am
>> wondering if any real world applications exist that would use this. Only a
>> "router on a stick" (i.e. a single-port router) comes to my mind, and that is
>> probably sufficient to call it useful in the real world. Do you have any other
>> examples to support the usefulness of this patch?
>>
> One case I have is about network security. For network firewall, all packets need
> to ingress on the specified port and egress on the specified port to do packet filtering.
> In this case, we can know flow direction in advance.
> 

I also have some concerns on how useful this API will be in real life,
and does the use case worth the complexity it brings.
And it looks too much low level detail for the application.

cc'ed a few more folks for comment.

>> Anyway, the patch doesn't do any harm if unused, and the only performance
>> cost is the "if (rxq->direct_rxrearm_enable)" branch in the Ethdev driver. So I
>> don't oppose to it.
>>
>

Thomas Monjalon Jan. 18, 2022, 4:53 p.m. UTC | #4

[quick summary: ethdev API to bypass mempool]

18/01/2022 16:51, Ferruh Yigit:
> On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > Morten Brørup <mb@smartsharesystems.com>:
> >> The patch provides a significant performance improvement, but I am
> >> wondering if any real world applications exist that would use this. Only a
> >> "router on a stick" (i.e. a single-port router) comes to my mind, and that is
> >> probably sufficient to call it useful in the real world. Do you have any other
> >> examples to support the usefulness of this patch?
> >>
> > One case I have is about network security. For network firewall, all packets need
> > to ingress on the specified port and egress on the specified port to do packet filtering.
> > In this case, we can know flow direction in advance.
> 
> I also have some concerns on how useful this API will be in real life,
> and does the use case worth the complexity it brings.
> And it looks too much low level detail for the application.

That's difficult to judge.
The use case is limited and the API has some severe limitations.
The benefit is measured with l3fwd, which is not exactly a real app.
Do we want an API which improves performance in limited scenarios
at the cost of breaking some general design assumptions?

Can we achieve the same level of performance with a mempool trick?

Morten Brørup Jan. 18, 2022, 5:27 p.m. UTC | #5

> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Tuesday, 18 January 2022 17.54
> 
> [quick summary: ethdev API to bypass mempool]
> 
> 18/01/2022 16:51, Ferruh Yigit:
> > On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > > Morten Brørup <mb@smartsharesystems.com>:
> > >> The patch provides a significant performance improvement, but I am
> > >> wondering if any real world applications exist that would use
> this. Only a
> > >> "router on a stick" (i.e. a single-port router) comes to my mind,
> and that is
> > >> probably sufficient to call it useful in the real world. Do you
> have any other
> > >> examples to support the usefulness of this patch?
> > >>
> > > One case I have is about network security. For network firewall,
> all packets need
> > > to ingress on the specified port and egress on the specified port
> to do packet filtering.
> > > In this case, we can know flow direction in advance.
> >
> > I also have some concerns on how useful this API will be in real
> life,
> > and does the use case worth the complexity it brings.
> > And it looks too much low level detail for the application.
> 
> That's difficult to judge.
> The use case is limited and the API has some severe limitations.
> The benefit is measured with l3fwd, which is not exactly a real app.
> Do we want an API which improves performance in limited scenarios
> at the cost of breaking some general design assumptions?
> 
> Can we achieve the same level of performance with a mempool trick?

Perhaps the mbuf library could offer bulk functions for alloc/free of raw mbufs - essentially a shortcut directly to the mempool library.

There might be a few more details to micro-optimize in the mempool library, if approached with this use case in mind. E.g. the rte_mempool_default_cache() could do with a few unlikely() in its comparisons.

Also, for this use case, the mempool library adds tracing overhead, which this API bypasses. And considering how short the code path through the mempool cache is, the tracing overhead is relatively much. I.e.: memcpy(NIC->NIC) vs. trace() memcpy(NIC->cache) trace() memcpy(cache->NIC).

A key optimization point could be the number of mbufs being moved to/from the mempool cache. If that number was fixed at compile time, a faster memcpy() could be used. However, it seems that different PMDs use bursts of either 4, 8, or in this case 32 mbufs. If only they could agree on such a simple detail.

Overall, I strongly agree that it is preferable to optimize the core libraries, rather than bypass them. Bypassing will eventually lead to "spaghetti code".

Honnappa Nagarahalli Jan. 27, 2022, 4:06 a.m. UTC | #6

Thanks Morten, appreciate your comments. Few responses inline.

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Sunday, December 26, 2021 4:25 AM
> To: Feifei Wang <Feifei.Wang2@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>
> Subject: RE: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
> 
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Friday, 24 December 2021 17.46
> >
<snip>

> >
> > However, this solution poses several constraint:
> >
> > 1)The receive queue needs to know which transmit queue it should take
> > the buffers from. The application logic decides which transmit port to
> > use to send out the packets. In many use cases the NIC might have a
> > single port ([1], [2], [3]), in which case a given transmit queue is
> > always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> > is easy to configure.
> >
> > If the NIC has 2 ports (there are several references), then we will
> > have
> > 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> > However, if this is generalized to 'N' ports, the configuration can be
> > long. More over the PMD would have to scan a list of transmit queues
> > to pull the buffers from.
> 
> I disagree with the description of this constraint.
> 
> As I understand it, it doesn't matter now many ports or queues are in a NIC or
> system.
> 
> The constraint is more narrow:
> 
> This patch requires that all packets ingressing on some port/queue must
> egress on the specific port/queue that it has been configured to ream its
> buffers from. I.e. an application cannot route packets between multiple ports
> with this patch.
Agree, this patch as is has this constraint. It is not a constraint that would apply for NICs with single port. The above text is describing some of the issues associated with generalizing the solution for N number of ports. If N is small, the configuration is small and scanning should not be bad.

> 
> >

<snip>

> >
> 
> You are missing the fourth constraint:
> 
> 4) The application must transmit all received packets immediately, i.e. QoS
> queueing and similar is prohibited.
I do not understand this, can you please elaborate?. Even if there is QoS queuing, there would be steady stream of packets being transmitted. These transmitted packets will fill the buffers on the RX side.

> 
<snip>

> >
> 
> The patch provides a significant performance improvement, but I am
> wondering if any real world applications exist that would use this. Only a
> "router on a stick" (i.e. a single-port router) comes to my mind, and that is
> probably sufficient to call it useful in the real world. Do you have any other
> examples to support the usefulness of this patch?
SmartNIC is a clear and dominant use case, typically they have a single port for data plane traffic (dual ports are mostly for redundancy)
This patch avoids good amount of store operations. The smaller CPUs found in SmartNICs have smaller store buffers which can become bottlenecks. Avoiding the lcore cache saves valuable HW cache space.

> 
> Anyway, the patch doesn't do any harm if unused, and the only performance
> cost is the "if (rxq->direct_rxrearm_enable)" branch in the Ethdev driver. So I
> don't oppose to it.
>

Honnappa Nagarahalli Jan. 27, 2022, 5:16 a.m. UTC | #7

<snip>

> 
> [quick summary: ethdev API to bypass mempool]
> 
> 18/01/2022 16:51, Ferruh Yigit:
> > On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > > Morten Brørup <mb@smartsharesystems.com>:
> > >> The patch provides a significant performance improvement, but I am
> > >> wondering if any real world applications exist that would use this.
> > >> Only a "router on a stick" (i.e. a single-port router) comes to my
> > >> mind, and that is probably sufficient to call it useful in the real
> > >> world. Do you have any other examples to support the usefulness of this
> patch?
> > >>
> > > One case I have is about network security. For network firewall, all
> > > packets need to ingress on the specified port and egress on the specified
> port to do packet filtering.
> > > In this case, we can know flow direction in advance.
> >
> > I also have some concerns on how useful this API will be in real life,
> > and does the use case worth the complexity it brings.
> > And it looks too much low level detail for the application.
I think the application writer already needs to know many low level details to be able to extract performance out of PMDs. For ex: fast free, 

> 
> That's difficult to judge.
> The use case is limited and the API has some severe limitations.
The use case applies for SmartNICs which is a major use case. In terms of limitations, it depends on how one sees it. For ex: lcore cache is not applicable to pipeline mode, but it is still accepted as it is helpful for something else.

> The benefit is measured with l3fwd, which is not exactly a real app.
It is funny how we treat l3fwd. When it shows performance improvement, we treat it as 'not a real application'. When it shows (even a small) performance drop, the patches are not accepted. We need to make up our mind 😊

> Do we want an API which improves performance in limited scenarios at the
> cost of breaking some general design assumptions?
It is not breaking any existing design assumptions. It is a very well suited optimization for SmartNIC use case. For this use case, it does not make sense for the same thread to copy data to a temp location (lcore cache), read it immediately and store it in another location. It is a waste of CPU cycles and memory bandwidth.

> 
> Can we achieve the same level of performance with a mempool trick?
We cannot as this patch basically avoids memory loads and stores (which reduces the backend stalls) caused by the temporary storage in lcore cache.

>

Honnappa Nagarahalli Jan. 27, 2022, 5:24 a.m. UTC | #8

<snip>

> 
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Tuesday, 18 January 2022 17.54
> >
> > [quick summary: ethdev API to bypass mempool]
> >
> > 18/01/2022 16:51, Ferruh Yigit:
> > > On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > > > Morten Brørup <mb@smartsharesystems.com>:
> > > >> The patch provides a significant performance improvement, but I
> > > >> am wondering if any real world applications exist that would use
> > this. Only a
> > > >> "router on a stick" (i.e. a single-port router) comes to my mind,
> > and that is
> > > >> probably sufficient to call it useful in the real world. Do you
> > have any other
> > > >> examples to support the usefulness of this patch?
> > > >>
> > > > One case I have is about network security. For network firewall,
> > all packets need
> > > > to ingress on the specified port and egress on the specified port
> > to do packet filtering.
> > > > In this case, we can know flow direction in advance.
> > >
> > > I also have some concerns on how useful this API will be in real
> > life,
> > > and does the use case worth the complexity it brings.
> > > And it looks too much low level detail for the application.
> >
> > That's difficult to judge.
> > The use case is limited and the API has some severe limitations.
> > The benefit is measured with l3fwd, which is not exactly a real app.
> > Do we want an API which improves performance in limited scenarios at
> > the cost of breaking some general design assumptions?
> >
> > Can we achieve the same level of performance with a mempool trick?
> 
> Perhaps the mbuf library could offer bulk functions for alloc/free of raw
> mbufs - essentially a shortcut directly to the mempool library.
> 
> There might be a few more details to micro-optimize in the mempool library,
> if approached with this use case in mind. E.g. the
> rte_mempool_default_cache() could do with a few unlikely() in its
> comparisons.
> 
> Also, for this use case, the mempool library adds tracing overhead, which this
> API bypasses. And considering how short the code path through the mempool
> cache is, the tracing overhead is relatively much. I.e.: memcpy(NIC->NIC) vs.
> trace() memcpy(NIC->cache) trace() memcpy(cache->NIC).
> 
> A key optimization point could be the number of mbufs being moved to/from
> the mempool cache. If that number was fixed at compile time, a faster
> memcpy() could be used. However, it seems that different PMDs use bursts of
> either 4, 8, or in this case 32 mbufs. If only they could agree on such a simple
> detail.
This patch removes the stores and loads which saves on the backend cycles. I do not think, other optimizations can do the same.

> 
> Overall, I strongly agree that it is preferable to optimize the core libraries,
> rather than bypass them. Bypassing will eventually lead to "spaghetti code".
IMO, this is not "spaghetti code". There is no design rule in DPDK that says the RX side must allocate buffers from a mempool or TX side must free buffers to a mempool. This patch does not break any modular boundaries. For ex: access internal details of another library.

Ananyev, Konstantin Jan. 27, 2022, 4:45 p.m. UTC | #9

> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > Sent: Tuesday, 18 January 2022 17.54
> > >
> > > [quick summary: ethdev API to bypass mempool]
> > >
> > > 18/01/2022 16:51, Ferruh Yigit:
> > > > On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > > > > Morten Brørup <mb@smartsharesystems.com>:
> > > > >> The patch provides a significant performance improvement, but I
> > > > >> am wondering if any real world applications exist that would use
> > > this. Only a
> > > > >> "router on a stick" (i.e. a single-port router) comes to my mind,
> > > and that is
> > > > >> probably sufficient to call it useful in the real world. Do you
> > > have any other
> > > > >> examples to support the usefulness of this patch?
> > > > >>
> > > > > One case I have is about network security. For network firewall,
> > > all packets need
> > > > > to ingress on the specified port and egress on the specified port
> > > to do packet filtering.
> > > > > In this case, we can know flow direction in advance.
> > > >
> > > > I also have some concerns on how useful this API will be in real
> > > life,
> > > > and does the use case worth the complexity it brings.
> > > > And it looks too much low level detail for the application.
> > >
> > > That's difficult to judge.
> > > The use case is limited and the API has some severe limitations.
> > > The benefit is measured with l3fwd, which is not exactly a real app.
> > > Do we want an API which improves performance in limited scenarios at
> > > the cost of breaking some general design assumptions?
> > >
> > > Can we achieve the same level of performance with a mempool trick?
> >
> > Perhaps the mbuf library could offer bulk functions for alloc/free of raw
> > mbufs - essentially a shortcut directly to the mempool library.
> >
> > There might be a few more details to micro-optimize in the mempool library,
> > if approached with this use case in mind. E.g. the
> > rte_mempool_default_cache() could do with a few unlikely() in its
> > comparisons.
> >
> > Also, for this use case, the mempool library adds tracing overhead, which this
> > API bypasses. And considering how short the code path through the mempool
> > cache is, the tracing overhead is relatively much. I.e.: memcpy(NIC->NIC) vs.
> > trace() memcpy(NIC->cache) trace() memcpy(cache->NIC).
> >
> > A key optimization point could be the number of mbufs being moved to/from
> > the mempool cache. If that number was fixed at compile time, a faster
> > memcpy() could be used. However, it seems that different PMDs use bursts of
> > either 4, 8, or in this case 32 mbufs. If only they could agree on such a simple
> > detail.
> This patch removes the stores and loads which saves on the backend cycles. I do not think, other optimizations can do the same.

My thought here was that we can try to introduce for mempool-cache ZC API,
similar to one we have for the ring.
Then on TX free path we wouldn't need to copy mbufs to be freed to temporary array on the stack.
Instead we can put them straight from TX SW ring to the mempool cache.
That should save extra store/load for mbuf and might help to achieve some performance gain
without by-passing mempool.    

> 
> >
> > Overall, I strongly agree that it is preferable to optimize the core libraries,
> > rather than bypass them. Bypassing will eventually lead to "spaghetti code".
> IMO, this is not "spaghetti code". There is no design rule in DPDK that says the RX side must allocate buffers from a mempool or TX side
> must free buffers to a mempool. This patch does not break any modular boundaries. For ex: access internal details of another library.

I also have few concerns about that approach:
- proposed implementation breaks boundary logical boundary between RX/TX code.
  Right now they co-exist independently, and design of TX path doesn't directly affect RX path
  and visa-versa. With proposed approach RX path need to be aware about TX queue details and
  mbuf freeing strategy. So if we'll decide to change TX code, we probably would be able to do that   
  without affecting RX path.
  That probably can be fixed by formalizing things a bit more by introducing new dev-ops API:
  eth_dev_tx_queue_free_mbufs(port id, queue id, mbufs_to_free[], ...)
  But that would probably eat-up significant portion of the gain you are seeing right now.    

- very limited usage scenario - it will have a positive effect only when we have a fixed forwarding mapping:
  all (or nearly all) packets from the RX queue are forwarded into the same TX queue. 
  Even for l3fwd it doesn’t look like a generic scenario.

- we effectively link RX and TX queues - when this feature is enabled, user can't stop TX queue,
  without stopping RX queue first.

Morten Brørup Jan. 27, 2022, 5:13 p.m. UTC | #10

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Thursday, 27 January 2022 05.07
> 
> Thanks Morten, appreciate your comments. Few responses inline.
> 
> > -----Original Message-----
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: Sunday, December 26, 2021 4:25 AM
> >
> > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > Sent: Friday, 24 December 2021 17.46
> > >
> <snip>
> 
> > >
> > > However, this solution poses several constraint:
> > >
> > > 1)The receive queue needs to know which transmit queue it should
> take
> > > the buffers from. The application logic decides which transmit port
> to
> > > use to send out the packets. In many use cases the NIC might have a
> > > single port ([1], [2], [3]), in which case a given transmit queue
> is
> > > always mapped to a single receive queue (1:1 Rx queue: Tx queue).
> This
> > > is easy to configure.
> > >
> > > If the NIC has 2 ports (there are several references), then we will
> > > have
> > > 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> > > However, if this is generalized to 'N' ports, the configuration can
> be
> > > long. More over the PMD would have to scan a list of transmit
> queues
> > > to pull the buffers from.
> >
> > I disagree with the description of this constraint.
> >
> > As I understand it, it doesn't matter now many ports or queues are in
> a NIC or
> > system.
> >
> > The constraint is more narrow:
> >
> > This patch requires that all packets ingressing on some port/queue
> must
> > egress on the specific port/queue that it has been configured to ream
> its
> > buffers from. I.e. an application cannot route packets between
> multiple ports
> > with this patch.
> Agree, this patch as is has this constraint. It is not a constraint
> that would apply for NICs with single port. The above text is
> describing some of the issues associated with generalizing the solution
> for N number of ports. If N is small, the configuration is small and
> scanning should not be bad.
> 

Perhaps we can live with the 1:1 limitation, if that is the primary use case.

Alternatively, the feature could fall back to using the mempool if unable to get/put buffers directly from/to a participating NIC. In this case, I envision a library serving as a shim layer between the NICs and the mempool. In other words: Take a step back from the implementation, and discuss the high level requirements and architecture of the proposed feature.

> >
> > >
> 
> <snip>
> 
> > >
> >
> > You are missing the fourth constraint:
> >
> > 4) The application must transmit all received packets immediately,
> i.e. QoS
> > queueing and similar is prohibited.
> I do not understand this, can you please elaborate?. Even if there is
> QoS queuing, there would be steady stream of packets being transmitted.
> These transmitted packets will fill the buffers on the RX side.

E.g. an appliance may receive packets on a 10 Gbps backbone port, and queue some of the packets up for a customer with a 20 Mbit/s subscription. When there is a large burst of packets towards that subscriber, they will queue up in the QoS queue dedicated to that subscriber. During that traffic burst, there is much more RX than TX. And after the traffic burst, there will be more TX than RX.

> 
> >
> <snip>
> 
> > >
> >
> > The patch provides a significant performance improvement, but I am
> > wondering if any real world applications exist that would use this.
> Only a
> > "router on a stick" (i.e. a single-port router) comes to my mind, and
> that is
> > probably sufficient to call it useful in the real world. Do you have
> any other
> > examples to support the usefulness of this patch?
> SmartNIC is a clear and dominant use case, typically they have a single
> port for data plane traffic (dual ports are mostly for redundancy)
> This patch avoids good amount of store operations. The smaller CPUs
> found in SmartNICs have smaller store buffers which can become
> bottlenecks. Avoiding the lcore cache saves valuable HW cache space.

OK. This is an important use case!

> 
> >
> > Anyway, the patch doesn't do any harm if unused, and the only
> performance
> > cost is the "if (rxq->direct_rxrearm_enable)" branch in the Ethdev
> driver. So I
> > don't oppose to it.
> >
>

Morten Brørup Jan. 28, 2022, 11:29 a.m. UTC | #11

> From: Morten Brørup
> Sent: Thursday, 27 January 2022 18.14
> 
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Thursday, 27 January 2022 05.07
> >
> > Thanks Morten, appreciate your comments. Few responses inline.
> >
> > > -----Original Message-----
> > > From: Morten Brørup <mb@smartsharesystems.com>
> > > Sent: Sunday, December 26, 2021 4:25 AM
> > >
> > > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > > Sent: Friday, 24 December 2021 17.46
> > > >
> > <snip>
> >
> > > >
> > > > However, this solution poses several constraint:
> > > >
> > > > 1)The receive queue needs to know which transmit queue it should
> > take
> > > > the buffers from. The application logic decides which transmit
> port
> > to
> > > > use to send out the packets. In many use cases the NIC might have
> a
> > > > single port ([1], [2], [3]), in which case a given transmit queue
> > is
> > > > always mapped to a single receive queue (1:1 Rx queue: Tx queue).
> > This
> > > > is easy to configure.
> > > >
> > > > If the NIC has 2 ports (there are several references), then we
> will
> > > > have
> > > > 1:2 (RX queue: TX queue) mapping which is still easy to
> configure.
> > > > However, if this is generalized to 'N' ports, the configuration
> can
> > be
> > > > long. More over the PMD would have to scan a list of transmit
> > queues
> > > > to pull the buffers from.
> > >
> > > I disagree with the description of this constraint.
> > >
> > > As I understand it, it doesn't matter now many ports or queues are
> in
> > a NIC or
> > > system.
> > >
> > > The constraint is more narrow:
> > >
> > > This patch requires that all packets ingressing on some port/queue
> > must
> > > egress on the specific port/queue that it has been configured to
> ream
> > its
> > > buffers from. I.e. an application cannot route packets between
> > multiple ports
> > > with this patch.
> > Agree, this patch as is has this constraint. It is not a constraint
> > that would apply for NICs with single port. The above text is
> > describing some of the issues associated with generalizing the
> solution
> > for N number of ports. If N is small, the configuration is small and
> > scanning should not be bad.

But I think N is the number of queues, not the number of ports.

> >
> 
> Perhaps we can live with the 1:1 limitation, if that is the primary use
> case.

Or some similar limitation for NICs with dual ports for redundancy.

> 
> Alternatively, the feature could fall back to using the mempool if
> unable to get/put buffers directly from/to a participating NIC. In this
> case, I envision a library serving as a shim layer between the NICs and
> the mempool. In other words: Take a step back from the implementation,
> and discuss the high level requirements and architecture of the
> proposed feature.

Please ignore my comment above. I had missed the fact that the direct re-arm feature only works inside a single NIC, and not across multiple NICs. And it is not going to work across multiple NICs, unless they are exactly the same type, because their internal descriptor structures may differ. Also, taking a deeper look at the i40e part of the patch, I notice that it already falls back to using the mempool.

> 
> > >
> > > >
> >
> > <snip>
> >
> > > >
> > >
> > > You are missing the fourth constraint:
> > >
> > > 4) The application must transmit all received packets immediately,
> > i.e. QoS
> > > queueing and similar is prohibited.
> > I do not understand this, can you please elaborate?. Even if there is
> > QoS queuing, there would be steady stream of packets being
> transmitted.
> > These transmitted packets will fill the buffers on the RX side.
> 
> E.g. an appliance may receive packets on a 10 Gbps backbone port, and
> queue some of the packets up for a customer with a 20 Mbit/s
> subscription. When there is a large burst of packets towards that
> subscriber, they will queue up in the QoS queue dedicated to that
> subscriber. During that traffic burst, there is much more RX than TX.
> And after the traffic burst, there will be more TX than RX.
> 
> >
> > >
> > <snip>
> >
> > > >
> > >
> > > The patch provides a significant performance improvement, but I am
> > > wondering if any real world applications exist that would use this.
> > Only a
> > > "router on a stick" (i.e. a single-port router) comes to my mind,
> and
> > that is
> > > probably sufficient to call it useful in the real world. Do you
> have
> > any other
> > > examples to support the usefulness of this patch?
> > SmartNIC is a clear and dominant use case, typically they have a
> single
> > port for data plane traffic (dual ports are mostly for redundancy)
> > This patch avoids good amount of store operations. The smaller CPUs
> > found in SmartNICs have smaller store buffers which can become
> > bottlenecks. Avoiding the lcore cache saves valuable HW cache space.
> 
> OK. This is an important use case!

Some NICs have many queues, so the number of RX/TX queue mappings is big. Aren't SmartNICs going to use many RX/TX queues?

> 
> >
> > >
> > > Anyway, the patch doesn't do any harm if unused, and the only
> > performance
> > > cost is the "if (rxq->direct_rxrearm_enable)" branch in the Ethdev
> > driver. So I
> > > don't oppose to it.

If a PMD maintainer agrees to maintaining such a feature, I don't oppose either.

The PMDs are full of cruft already, so why bother complaining about more, if the performance impact is negligible. :-)

Honnappa Nagarahalli Feb. 2, 2022, 7:46 p.m. UTC | #12

<snip>

> 
> > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > Sent: Tuesday, 18 January 2022 17.54
> > > >
> > > > [quick summary: ethdev API to bypass mempool]
> > > >
> > > > 18/01/2022 16:51, Ferruh Yigit:
> > > > > On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > > > > > Morten Brørup <mb@smartsharesystems.com>:
> > > > > >> The patch provides a significant performance improvement, but
> > > > > >> I am wondering if any real world applications exist that
> > > > > >> would use
> > > > this. Only a
> > > > > >> "router on a stick" (i.e. a single-port router) comes to my
> > > > > >> mind,
> > > > and that is
> > > > > >> probably sufficient to call it useful in the real world. Do
> > > > > >> you
> > > > have any other
> > > > > >> examples to support the usefulness of this patch?
> > > > > >>
> > > > > > One case I have is about network security. For network
> > > > > > firewall,
> > > > all packets need
> > > > > > to ingress on the specified port and egress on the specified
> > > > > > port
> > > > to do packet filtering.
> > > > > > In this case, we can know flow direction in advance.
> > > > >
> > > > > I also have some concerns on how useful this API will be in real
> > > > life,
> > > > > and does the use case worth the complexity it brings.
> > > > > And it looks too much low level detail for the application.
> > > >
> > > > That's difficult to judge.
> > > > The use case is limited and the API has some severe limitations.
> > > > The benefit is measured with l3fwd, which is not exactly a real app.
> > > > Do we want an API which improves performance in limited scenarios
> > > > at the cost of breaking some general design assumptions?
> > > >
> > > > Can we achieve the same level of performance with a mempool trick?
> > >
> > > Perhaps the mbuf library could offer bulk functions for alloc/free
> > > of raw mbufs - essentially a shortcut directly to the mempool library.
> > >
> > > There might be a few more details to micro-optimize in the mempool
> > > library, if approached with this use case in mind. E.g. the
> > > rte_mempool_default_cache() could do with a few unlikely() in its
> > > comparisons.
> > >
> > > Also, for this use case, the mempool library adds tracing overhead,
> > > which this API bypasses. And considering how short the code path
> > > through the mempool cache is, the tracing overhead is relatively much.
> I.e.: memcpy(NIC->NIC) vs.
> > > trace() memcpy(NIC->cache) trace() memcpy(cache->NIC).
> > >
> > > A key optimization point could be the number of mbufs being moved
> > > to/from the mempool cache. If that number was fixed at compile time,
> > > a faster
> > > memcpy() could be used. However, it seems that different PMDs use
> > > bursts of either 4, 8, or in this case 32 mbufs. If only they could
> > > agree on such a simple detail.
> > This patch removes the stores and loads which saves on the backend cycles.
> I do not think, other optimizations can do the same.
> 
> My thought here was that we can try to introduce for mempool-cache ZC API,
> similar to one we have for the ring.
> Then on TX free path we wouldn't need to copy mbufs to be freed to
> temporary array on the stack.
> Instead we can put them straight from TX SW ring to the mempool cache.
> That should save extra store/load for mbuf and might help to achieve some
> performance gain
> without by-passing mempool.
Agree, it will remove one set of loads and stores, but not all of them. I am not sure if it can solve the performance problems. We will give it a try.

> 
> >
> > >
> > > Overall, I strongly agree that it is preferable to optimize the core
> > > libraries, rather than bypass them. Bypassing will eventually lead to
> "spaghetti code".
> > IMO, this is not "spaghetti code". There is no design rule in DPDK
> > that says the RX side must allocate buffers from a mempool or TX side must
> free buffers to a mempool. This patch does not break any modular
> boundaries. For ex: access internal details of another library.
> 
> I also have few concerns about that approach:
> - proposed implementation breaks boundary logical boundary between RX/TX
> code.
>   Right now they co-exist independently, and design of TX path doesn't directly
> affect RX path
>   and visa-versa. With proposed approach RX path need to be aware about TX
> queue details and
>   mbuf freeing strategy. So if we'll decide to change TX code, we probably
> would be able to do that
>   without affecting RX path.
Agree that now both paths will be coupled on the areas you have mentioned. This is happening within the driver code. From the application perspective, they still remain separated. I also do not see that the TX free strategy has not changed much.

>   That probably can be fixed by formalizing things a bit more by introducing
> new dev-ops API:
>   eth_dev_tx_queue_free_mbufs(port id, queue id, mbufs_to_free[], ...)
>   But that would probably eat-up significant portion of the gain you are seeing
> right now.
> 
> - very limited usage scenario - it will have a positive effect only when we have
Agree, it is limited to few scenarios. But, the scenario itself is a major scenario.

> a fixed forwarding mapping:
>   all (or nearly all) packets from the RX queue are forwarded into the same TX
> queue.
>   Even for l3fwd it doesn’t look like a generic scenario.
I think it is possible to have some logic (based on the port mask and the routes involved) to enable this feature. We will try to add that in the next version.

> 
> - we effectively link RX and TX queues - when this feature is enabled, user
> can't stop TX queue,
>   without stopping RX queue first.
Agree. How much of an issue is this? I would think when the application is shutting down, one would stop the RX side first. Are there any other scenarios we need to be aware of?

> 
>

Feifei Wang Feb. 28, 2023, 6:43 a.m. UTC | #13

Hi, Ferruh

This email summarizes our latest improvement work for direct-rearm and
hope it can fix some concerns about direct-rearm.

Best Regards
Feifei

> -----邮件原件-----
> 发件人: Ferruh Yigit <ferruh.yigit@intel.com>
> 发送时间: Tuesday, January 18, 2022 11:52 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Morten Brørup
> <mb@smartsharesystems.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; thomas@monjalon.net; Andrew
> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Qi Zhang
> <qi.z.zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>
> 主题: Re: 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive
> side
> 
> On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > Thanks for your comments.
> >
> >> -----邮件原件-----
> >> 发件人: Morten Brørup <mb@smartsharesystems.com>
> >> 发送时间: Sunday, December 26, 2021 6:25 PM
> >> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
> >> 抄送: dev@dpdk.org; nd <nd@arm.com>
> >> 主题: RE: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive
> >> side
> >>
> >>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> >>> Sent: Friday, 24 December 2021 17.46
> >>>
> >>> Currently, the transmit side frees the buffers into the lcore cache
> >>> and the receive side allocates buffers from the lcore cache. The
> >>> transmit side typically frees 32 buffers resulting in 32*8=256B of
> >>> stores to lcore cache. The receive side allocates 32 buffers and
> >>> stores them in the receive side software ring, resulting in
> >>> 32*8=256B of stores and 256B of load from the lcore cache.
> >>>
> >>> This patch proposes a mechanism to avoid freeing to/allocating from
> >>> the lcore cache. i.e. the receive side will free the buffers from
> >>> transmit side directly into it's software ring. This will avoid the
> >>> 256B of loads and stores introduced by the lcore cache. It also
> >>> frees up the cache lines used by the lcore cache.
> >>>
> >>> However, this solution poses several constraint:
> >>>
> >>> 1)The receive queue needs to know which transmit queue it should
> >>> take the buffers from. The application logic decides which transmit
> >>> port to use to send out the packets. In many use cases the NIC might
> >>> have a single port ([1], [2], [3]), in which case a given transmit
> >>> queue is always mapped to a single receive queue (1:1 Rx queue: Tx
> >>> queue). This is easy to configure.
> >>>
> >>> If the NIC has 2 ports (there are several references), then we will
> >>> have
> >>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> >>> However, if this is generalized to 'N' ports, the configuration can
> >>> be long. More over the PMD would have to scan a list of transmit
> >>> queues to pull the buffers from.
> >>
> >> I disagree with the description of this constraint.
> >>
> >> As I understand it, it doesn't matter now many ports or queues are in
> >> a NIC or system.
> >>
> >> The constraint is more narrow:
> >>
> >> This patch requires that all packets ingressing on some port/queue
> >> must egress on the specific port/queue that it has been configured to
> >> ream its buffers from. I.e. an application cannot route packets
> >> between multiple ports with this patch.
> >
> > First, I agree with that direct-rearm mode is suitable for the case
> > that user should know the direction of the flow in advance and map
> > rx/tx with each other. It is not suitable for the normal packet random route
> case.
> >
> > Second, our proposed two cases: one port NIC and two port NIC means
> > the direction of flow is determined. Furthermore, for two port NIC,
> > there maybe two flow directions: from port 0 to port 1, or from port 0
> > to port 0. Thus we need to have
> > 1:2 (Rx queue :  Tx queue) mapping.
> >
> > At last, maybe we can change our description as follows:
> > "The first constraint is that user should know the direction of the
> > flow in advance, and based on this, user needs to map the Rx and Tx
> queues according to the flow direction:
> > For example, if the NIC just has one port
> >   ......
> > Or if the NIC have two ports
> > ......."
> >
> >>
> >>>
> >>> 2)The other factor that needs to be considered is 'run-to-completion'
> >>> vs
> >>> 'pipeline' models. In the run-to-completion model, the receive side
> >>> and the transmit side are running on the same lcore serially. In the
> >>> pipeline model. The receive side and transmit side might be running
> >>> on different lcores in parallel. This requires locking. This is not
> >>> supported at this point.
> >>>
> >>> 3)Tx and Rx buffers must be from the same mempool. And we also must
> >>> ensure Tx buffer free number is equal to Rx buffer free number:
> >>> (txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH) Thus,
> 'tx_next_dd'
> >>> can be updated correctly in direct-rearm mode. This is due to
> >>> tx_next_dd is a variable to compute tx sw-ring free location.
> >>> Its value will be one more round than the position where next time
> >>> free starts.
> >>>
> >>
> >> You are missing the fourth constraint:
> >>
> >> 4) The application must transmit all received packets immediately,
> >> i.e. QoS queueing and similar is prohibited.
> >>
> >
> > You are right and this is indeed one of the limitations.
> >
> >>> Current status in this RFC:
> >>> 1)An API is added to allow for mapping a TX queue to a RX queue.
> >>>    Currently it supports 1:1 mapping.
> >>> 2)The i40e driver is changed to do the direct re-arm of the receive
> >>>    side.
> >>> 3)L3fwd application is hacked to do the mapping for the following
> >>> command:
> >>>    one core two flows case:
> >>>    $./examples/dpdk-l3fwd -n 4 -l 1 -a 0001:01:00.0 -a 0001:01:00.1
> >>>    -- -p 0x3 -P --config='(0,0,1),(1,0,1)'
> >>>    where:
> >>>    Port 0 Rx queue 0 is mapped to Port 1 Tx queue 0
> >>>    Port 1 Rx queue 0 is mapped to Port 0 Tx queue 0
> >>>
> >>> Testing status:
> >>> 1)Tested L3fwd with the above command:
> >>> The testing results for L3fwd are as follows:
> >>> -------------------------------------------------------------------
> >>> N1SDP:
> >>> Base performance(with this patch)   with direct re-arm mode enabled
> >>>        0%                                  +14.1%
> >>>
> >>> Ampere Altra:
> >>> Base performance(with this patch)   with direct re-arm mode enabled
> >>>        0%                                  +17.1%
> >>> -------------------------------------------------------------------
> >>> This patch can not affect performance of normal mode, and if enable
> >>> direct-rearm mode, performance can be improved by 14% - 17% in n1sdp
> >>> and ampera-altra.
> >>>
> >>> Feedback requested:
> >>> 1) Has anyone done any similar experiments, any lessons learnt?
> >>> 2) Feedback on API
> >>>
> >>> Next steps:
> >>> 1) Update the code for supporting 1:N(Rx : TX) mapping
> >>> 2) Automate the configuration in L3fwd sample application
> >>>
> >>> Reference:
> >>> [1] https://store.nvidia.com/en-
> >>> us/networking/store/product/MCX623105AN-
> >>>
> >>
> CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECrypt
> >> oDisabled
> >>> / [2]
> >>>
> https://www.intel.com/content/www/us/en/products/sku/192561/intel-
> >>> ethernet-network-adapter-e810cqda1/specifications.html
> >>> [3] https://www.broadcom.com/products/ethernet-
> >> connectivity/network-
> >>> adapters/100gb-nic-ocp/n1100g
> >>>
> >>> Feifei Wang (4):
> >>>    net/i40e: enable direct re-arm mode
> >>>    ethdev: add API for direct re-arm mode
> >>>    net/i40e: add direct re-arm mode internal API
> >>>    examples/l3fwd: give an example for direct rearm mode
> >>>
> >>>   drivers/net/i40e/i40e_ethdev.c        |  34 ++++++
> >>>   drivers/net/i40e/i40e_rxtx.h          |   4 +
> >>>   drivers/net/i40e/i40e_rxtx_vec_neon.c | 149
> >> +++++++++++++++++++++++++-
> >>>   examples/l3fwd/main.c                 |   3 +
> >>>   lib/ethdev/ethdev_driver.h            |  15 +++
> >>>   lib/ethdev/rte_ethdev.c               |  14 +++
> >>>   lib/ethdev/rte_ethdev.h               |  31 ++++++
> >>>   lib/ethdev/version.map                |   3 +
> >>>   8 files changed, 251 insertions(+), 2 deletions(-)
> >>>
> >>> --
> >>> 2.25.1
> >>>
> >>
> >> The patch provides a significant performance improvement, but I am
> >> wondering if any real world applications exist that would use this.
> >> Only a "router on a stick" (i.e. a single-port router) comes to my
> >> mind, and that is probably sufficient to call it useful in the real
> >> world. Do you have any other examples to support the usefulness of this
> patch?
> >>
> > One case I have is about network security. For network firewall, all
> > packets need to ingress on the specified port and egress on the specified
> port to do packet filtering.
> > In this case, we can know flow direction in advance.
> >
> 
> I also have some concerns on how useful this API will be in real life, and does
> the use case worth the complexity it brings.
> And it looks too much low level detail for the application.

Concerns of direct rearm:
1. Earlier version of the design required the rxq/txq pairing to be done before
starting the data plane threads. This required the user to know the direction
of the packet flow in advance. This limited the use cases.

In the latest version, direct-rearm mode is packaged as a separate API. 
This allows for the users to change rxq/txq pairing in real time in data plane,
according to the analysis of the packet flow by the application, for example:
------------------------------------------------------------------------------------------------------------
Step 1: upper application analyse the flow direction
Step 2: rxq_rearm_data = rte_eth_rx_get_rearm_data(rx_portid, rx_queueid)
Step 3: rte_eth_dev_direct_rearm(rx_portid, rx_queueid, tx_portid, tx_queueid, rxq_rearm_data);
Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
------------------------------------------------------------------------------------------------------------
Above can support user to change rxq/txq pairing  at runtime and user does not need to
know the direction of flow in advance. This can effectively expand direct-rearm
use scenarios.

2. Earlier version of direct rearm was breaking the independence between the RX and TX path.
In the latest version, we use a structure to let Rx and Tx interact, for example:
-----------------------------------------------------------------------------------------------------------------------------------
struct rte_eth_rxq_rearm_data {
       struct rte_mbuf **buf_ring; /**< Buffer ring of Rx queue. */
       uint16_t *refill_head;            /**< Head of buffer ring refilling descriptors. */
       uint16_t *receive_tail;          /**< Tail of buffer ring receiving pkts. */
       uint16_t nb_buf;                    /**< configured number of buffer ring. */
}  rxq_rearm_data;

data path:
	/* Get direct-rearm info for a receive queue of an Ethernet device. */
	rxq_rearm_data = rte_eth_rx_get_rearm_data(rx_portid, rx_queueid);
	rte_eth_dev_direct_rearm(rx_portid, rx_queueid, tx_portid, tx_queueid, rxq_rearm_data) {

		/*  Using Tx used buffer to refill Rx buffer ring in direct rearm mode */
		nb_rearm = rte_eth_tx_fill_sw_ring(tx_portid, tx_queueid, rxq_rearm_data );

		/* Flush Rx descriptor in direct rearm mode */
		rte_eth_rx_flush_descs(rx_portid, rx_queuid, nb_rearm) ;
	}
	rte_eth_rx_burst(rx_portid,rx_queueid);
	rte_eth_tx_burst(tx_portid,tx_queueid);
-----------------------------------------------------------------------------------------------------------------------------------
Furthermore, this let direct-rearm usage no longer limited to the same pmd,
it can support moving buffers between different vendor pmds, even can put the buffer
anywhere into your Rx buffer ring as long as the address of the buffer ring can be provided.
In the latest version, we enable direct-rearm in i40e pmd and ixgbe pmd, and also try to
use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
by direct-rearm.

3. Difference between direct rearm, ZC API used in mempool  and general path
For general path: 
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
For ZC API used in mempool:
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
                Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
For direct_rearm:
                Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
Thus we can see in the one loop, compared to general path direct rearm reduce 32+32=64 pkts memcpy;
Compared to ZC API used in mempool, we can see direct rearm reduce 32 pkts memcpy in each loop.
So, direct_rearm has its own benefits.

4. Performance test and real cases
For performance test, in l3fwd, we achieve the performance improvement of up to 15% in Arm server.
For real cases, we have enabled direct-rearm in vpp and achieved performance improvement.

> 
> cc'ed a few more folks for comment.
> 
> >> Anyway, the patch doesn't do any harm if unused, and the only
> >> performance cost is the "if (rxq->direct_rxrearm_enable)" branch in
> >> the Ethdev driver. So I don't oppose to it.
> >>
> >

Feifei Wang Feb. 28, 2023, 6:52 a.m. UTC | #14

CC to the right e-mail address.

> > I also have some concerns on how useful this API will be in real life,
> > and does the use case worth the complexity it brings.
> > And it looks too much low level detail for the application.
> 
> Concerns of direct rearm:
> 1. Earlier version of the design required the rxq/txq pairing to be done
> before starting the data plane threads. This required the user to know the
> direction of the packet flow in advance. This limited the use cases.
> 
> In the latest version, direct-rearm mode is packaged as a separate API.
> This allows for the users to change rxq/txq pairing in real time in data plane,
> according to the analysis of the packet flow by the application, for example:
> ----------------------------------------------------------------------------------------------
> --------------
> Step 1: upper application analyse the flow direction Step 2: rxq_rearm_data =
> rte_eth_rx_get_rearm_data(rx_portid, rx_queueid) Step 3:
> rte_eth_dev_direct_rearm(rx_portid, rx_queueid, tx_portid, tx_queueid,
> rxq_rearm_data); Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
> Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
> ----------------------------------------------------------------------------------------------
> --------------
> Above can support user to change rxq/txq pairing  at runtime and user does
> not need to know the direction of flow in advance. This can effectively
> expand direct-rearm use scenarios.
> 
> 2. Earlier version of direct rearm was breaking the independence between
> the RX and TX path.
> In the latest version, we use a structure to let Rx and Tx interact, for example:
> ----------------------------------------------------------------------------------------------
> -------------------------------------
> struct rte_eth_rxq_rearm_data {
>        struct rte_mbuf **buf_ring; /**< Buffer ring of Rx queue. */
>        uint16_t *refill_head;            /**< Head of buffer ring refilling descriptors.
> */
>        uint16_t *receive_tail;          /**< Tail of buffer ring receiving pkts. */
>        uint16_t nb_buf;                    /**< configured number of buffer ring. */
> }  rxq_rearm_data;
> 
> data path:
> 	/* Get direct-rearm info for a receive queue of an Ethernet device.
> */
> 	rxq_rearm_data = rte_eth_rx_get_rearm_data(rx_portid,
> rx_queueid);
> 	rte_eth_dev_direct_rearm(rx_portid, rx_queueid, tx_portid,
> tx_queueid, rxq_rearm_data) {
> 
> 		/*  Using Tx used buffer to refill Rx buffer ring in direct rearm
> mode */
> 		nb_rearm = rte_eth_tx_fill_sw_ring(tx_portid, tx_queueid,
> rxq_rearm_data );
> 
> 		/* Flush Rx descriptor in direct rearm mode */
> 		rte_eth_rx_flush_descs(rx_portid, rx_queuid, nb_rearm) ;
> 	}
> 	rte_eth_rx_burst(rx_portid,rx_queueid);
> 	rte_eth_tx_burst(tx_portid,tx_queueid);
> ----------------------------------------------------------------------------------------------
> -------------------------------------
> Furthermore, this let direct-rearm usage no longer limited to the same pmd,
> it can support moving buffers between different vendor pmds, even can put
> the buffer anywhere into your Rx buffer ring as long as the address of the
> buffer ring can be provided.
> In the latest version, we enable direct-rearm in i40e pmd and ixgbe pmd, and
> also try to use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9%
> performance improvement by direct-rearm.
> 
> 3. Difference between direct rearm, ZC API used in mempool  and general
> path For general path:
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts
> memcpy from temporary variable to mempool cache For ZC API used in
> mempool:
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
>                 Refer link:
> http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-
> kamalakshitha.aligeri@arm.com/
> For direct_rearm:
>                 Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring Thus we can
> see in the one loop, compared to general path direct rearm reduce 32+32=64
> pkts memcpy; Compared to ZC API used in mempool, we can see direct
> rearm reduce 32 pkts memcpy in each loop.
> So, direct_rearm has its own benefits.
> 
> 4. Performance test and real cases
> For performance test, in l3fwd, we achieve the performance improvement
> of up to 15% in Arm server.
> For real cases, we have enabled direct-rearm in vpp and achieved
> performance improvement.
>