[RFC,v1] net/i40e: put mempool cache out of API

Message ID 20220613055136.1949784-1-feifei.wang2@arm.com (mailing list archive)
State RFC, archived
Delegated to: Qi Zhang
Headers
Series [RFC,v1] net/i40e: put mempool cache out of API |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation fail Compilation issues

Commit Message

Feifei Wang June 13, 2022, 5:51 a.m. UTC
  Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
out of API to free buffers directly. There are two changes different
with previous version:
1. change txep from "i40e_entry" to "i40e_vec_entry"
2. put cache out of "mempool_bulk" API to copy buffers into it directly

Performance Test with l3fwd neon path:
		with this patch
n1sdp:		no perforamnce change
amper-altra:	+4.0%

Suggested-by: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
---
 drivers/net/i40e/i40e_rxtx_vec_common.h | 36 ++++++++++++++++++++-----
 drivers/net/i40e/i40e_rxtx_vec_neon.c   | 10 ++++---
 2 files changed, 36 insertions(+), 10 deletions(-)
  

Comments

Morten Brørup June 13, 2022, 9:58 a.m. UTC | #1
> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Monday, 13 June 2022 07.52
> 
> Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
> out of API to free buffers directly. There are two changes different
> with previous version:
> 1. change txep from "i40e_entry" to "i40e_vec_entry"
> 2. put cache out of "mempool_bulk" API to copy buffers into it directly
> 
> Performance Test with l3fwd neon path:
> 		with this patch
> n1sdp:		no perforamnce change
> amper-altra:	+4.0%
> 
> Suggested-by: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> ---

Once again bypassing the mempool API and copy-pasting internal code from the mempool library to a PMD for performance optimization.

Certainly not happy about it, but it's already done elsewhere, and thus we should not deny it here...

Acked-by: Morten Brørup <mb@smartsharesystems.com>
  
Honnappa Nagarahalli June 22, 2022, 7:07 p.m. UTC | #2
<snip>

> Subject: RE: [RFC PATCH v1] net/i40e: put mempool cache out of API
> 
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Monday, 13 June 2022 07.52
> >
> > Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache out
> > of API to free buffers directly. There are two changes different with
> > previous version:
> > 1. change txep from "i40e_entry" to "i40e_vec_entry"
> > 2. put cache out of "mempool_bulk" API to copy buffers into it
> > directly
> >
> > Performance Test with l3fwd neon path:
> > 		with this patch
> > n1sdp:		no perforamnce change
> > amper-altra:	+4.0%
> >
> > Suggested-by: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > ---
> 
> Once again bypassing the mempool API and copy-pasting internal code from
> the mempool library to a PMD for performance optimization.
> 
> Certainly not happy about it, but it's already done elsewhere, and thus we
> should not deny it here...
> 
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
This was done with the intention to understand the performance of the zero-copy mempool APIs that Konstantin suggested. Even though this patch does not have the zero-copy mempool APIs, it reflects the code/instructions of creating such as API. We are trying to show here that zero-copy mempool APIs will not provide equivalent performance of direct-rearm method.
  
Konstantin Ananyev July 3, 2022, 12:20 p.m. UTC | #3
> Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
> out of API to free buffers directly. There are two changes different
> with previous version:
> 1. change txep from "i40e_entry" to "i40e_vec_entry"
> 2. put cache out of "mempool_bulk" API to copy buffers into it directly
> 
> Performance Test with l3fwd neon path:
> 		with this patch
> n1sdp:		no perforamnce change
> amper-altra:	+4.0%
> 


Thanks for RFC, appreciate your effort.
So, as I understand - bypassing mempool put/get itself
gives about 7-10% speedup for RX/TX on ARM platforms,
correct?

About direct-rearm RX approach you propose:
After another thought, probably it is possible to
re-arrange it in a way that would help avoid related negatives.
The basic idea as follows:

1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
    Also make sw_ring de-coupled from RXQ itself, i.e:
    when RXQ is stopped or even destroyed, related sw_ring may still
    exist (probably ref-counter or RCU would be sufficient here).
    All that means we need a common layout/api for rxq_sw_ring
    and PMDs that would like to support direct-rearming will have to
    use/obey it.

2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
    at txq_free_bufs() try to store released mbufs inside attached
    sw_ring directly. If there is no attached sw_ring, or not enough
    free space in it - continue with mempool_put() as usual.
    Note that actual arming of HW RXDs still remains responsibility
    of RX code-path:
    rxq_rearm(rxq) {
      ...
      - check are there are N already filled entries inside rxq_sw_ring.
        if not, populate them from mempool (usual mempool_get()).
      - arm related RXDs and mark these sw_ring entries as managed by HW.
      ...
    }


So rxq_sw_ring will serve two purposes:
- track mbufs that are managed by HW (that what it does now)
- private (per RXQ) mbuf cache

Now, if TXQ is stopped while RXQ is running -
no extra synchronization is required, RXQ would just use
mempool_get() to rearm its sw_ring itself.

If RXQ is stopped while TXQ is still running -
TXQ can still continue to populate related sw_ring till it gets full.
Then it will continue with mempool_put() as usual.
Of-course it means that user who wants to use this feature should 
probably account some extra mbufs for such case, or might be
rxq_sw_ring can have enable/disable flag to mitigate such situation.

As another benefit here - such approach makes possible to use
several TXQs (even from different devices) to rearm same RXQ.

Have to say, that I am still not sure that 10% RX/TX improvement is 
worth bypassing mempool completely and introducing all this extra 
complexity in RX/TX path.
But, if we'll still decide to go ahead with direct-rearming, this
re-arrangement, I think, should help to keep things clear and
avoid introducing new limitations in existing functionality.

WDYT?

Konstantin
  
Feifei Wang July 6, 2022, 8:52 a.m. UTC | #4
> -----邮件原件-----
> 发件人: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> 发送时间: Sunday, July 3, 2022 8:20 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Yuying Zhang
> <Yuying.Zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>; Ruifeng
> Wang <Ruifeng.Wang@arm.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> 主题: Re: [RFC PATCH v1] net/i40e: put mempool cache out of API
> 
> 
> > Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache out
> > of API to free buffers directly. There are two changes different with
> > previous version:
> > 1. change txep from "i40e_entry" to "i40e_vec_entry"
> > 2. put cache out of "mempool_bulk" API to copy buffers into it
> > directly
> >
> > Performance Test with l3fwd neon path:
> > 		with this patch
> > n1sdp:		no perforamnce change
> > amper-altra:	+4.0%
> >
> 
> 
Thanks for your detailed comments.

> Thanks for RFC, appreciate your effort.
> So, as I understand - bypassing mempool put/get itself gives about 7-10%
> speedup for RX/TX on ARM platforms, correct?
[Feifei] Yes.

> 
> About direct-rearm RX approach you propose:
> After another thought, probably it is possible to re-arrange it in a way that
> would help avoid related negatives.
> The basic idea as follows:
> 
> 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
>     Also make sw_ring de-coupled from RXQ itself, i.e:
>     when RXQ is stopped or even destroyed, related sw_ring may still
>     exist (probably ref-counter or RCU would be sufficient here).
>     All that means we need a common layout/api for rxq_sw_ring
>     and PMDs that would like to support direct-rearming will have to
>     use/obey it.
[Feifei] de-coupled sw-ring and RXQ may cause dangerous case due to
RXQ is stopped but elements of it (sw-ring) is still kept and we may forget
to free this sw-ring in the end.
Furthermore,  if we apply this, we need to separate operation when closing
RXQ and add Rx sw-ring free operation when closing TXQ. This will be complex
and it is not conducive to subsequent maintenance if maintainer does not
understand direct-rearm mode very well.

> 
> 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
>     at txq_free_bufs() try to store released mbufs inside attached
>     sw_ring directly. If there is no attached sw_ring, or not enough
>     free space in it - continue with mempool_put() as usual.
>     Note that actual arming of HW RXDs still remains responsibility
>     of RX code-path:
>     rxq_rearm(rxq) {
>       ...
>       - check are there are N already filled entries inside rxq_sw_ring.
>         if not, populate them from mempool (usual mempool_get()).
>       - arm related RXDs and mark these sw_ring entries as managed by HW.
>       ...
>     }
> 
[Feifei] We try to create two modes, one is direct-rearm and the other is direct-free like above.
And by performance comparison, we select direct-rearm which improve performance by
7% - 14% compared with direct-free by 3.6% - 7% in n1sdp. 
Furthermore, I think put direct mode in Tx or Rx is equivalent. For direct-rearm, if there is no
Tx sw-ring, Rx will get mbufs from mempool. For direct-fee, if there is no Rx sw-ring, Tx will put
mbufs into mempool. At last, what affects our decision-making is the improvement of performance.

> 
> So rxq_sw_ring will serve two purposes:
> - track mbufs that are managed by HW (that what it does now)
> - private (per RXQ) mbuf cache
> 
> Now, if TXQ is stopped while RXQ is running - no extra synchronization is
> required, RXQ would just use
> mempool_get() to rearm its sw_ring itself.
> 
> If RXQ is stopped while TXQ is still running - TXQ can still continue to populate
> related sw_ring till it gets full.
> Then it will continue with mempool_put() as usual.
> Of-course it means that user who wants to use this feature should probably
> account some extra mbufs for such case, or might be rxq_sw_ring can have
> enable/disable flag to mitigate such situation.
> 
[Feifei] For direct-rearm, the key point should be the communication between TXQ
and RXQ when TXQ is stopped. De-coupled sw-ring is complex, maybe we can simplify
this and assign this to the application. My thought is that if direct-rearm is enabled, when
users want to close TX port, they must firstly close mapped RX port and disable direct-rearm
feature. Then they can restart RX port.

> As another benefit here - such approach makes possible to use several TXQs
> (even from different devices) to rearm same RXQ.
[Feifei] Actually, for direct-rearm, it can use several RXQs to rearm same TXQ, so this
is equivalent for direct-rearm and direct-free. Furthermore, If use multiple cores,
I think we need to consider synchronization of variables, and lock is necessary.

> 
> Have to say, that I am still not sure that 10% RX/TX improvement is worth
> bypassing mempool completely and introducing all this extra complexity in
> RX/TX path.
[Feifei] Thus maybe we can avoid this complexity as much as possible.
We should not increase the complexity of the bottom layer for convenience, 
but leave it to the user to decide.  If user wants performance, he needs to consider
and operate more.

> But, if we'll still decide to go ahead with direct-rearming, this re-arrangement,
> I think, should help to keep things clear and avoid introducing new limitations
> in existing functionality.
> 
> WDYT?
> 
> Konstantin
> 
> 
> 
> 
> 
> 
>
  
Feifei Wang July 6, 2022, 11:35 a.m. UTC | #5
> -----邮件原件-----
> 发件人: Feifei Wang
> 发送时间: Wednesday, July 6, 2022 4:53 PM
> 收件人: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>; Yuying
> Zhang <Yuying.Zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> 主题: 回复: [RFC PATCH v1] net/i40e: put mempool cache out of API
> 
> 
> 
> > -----邮件原件-----
> > 发件人: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> > 发送时间: Sunday, July 3, 2022 8:20 PM
> > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Yuying Zhang
> > <Yuying.Zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>; Ruifeng
> > Wang <Ruifeng.Wang@arm.com>
> > 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>
> > 主题: Re: [RFC PATCH v1] net/i40e: put mempool cache out of API
> >
> >
> > > Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
> > > out of API to free buffers directly. There are two changes different
> > > with previous version:
> > > 1. change txep from "i40e_entry" to "i40e_vec_entry"
> > > 2. put cache out of "mempool_bulk" API to copy buffers into it
> > > directly
> > >
> > > Performance Test with l3fwd neon path:
> > > 		with this patch
> > > n1sdp:		no perforamnce change
> > > amper-altra:	+4.0%
> > >
> >
> >
> Thanks for your detailed comments.
> 
> > Thanks for RFC, appreciate your effort.
> > So, as I understand - bypassing mempool put/get itself gives about
> > 7-10% speedup for RX/TX on ARM platforms, correct?
> [Feifei] Yes.
> 
> >
> > About direct-rearm RX approach you propose:
> > After another thought, probably it is possible to re-arrange it in a
> > way that would help avoid related negatives.
> > The basic idea as follows:
> >
> > 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
> >     Also make sw_ring de-coupled from RXQ itself, i.e:
> >     when RXQ is stopped or even destroyed, related sw_ring may still
> >     exist (probably ref-counter or RCU would be sufficient here).
> >     All that means we need a common layout/api for rxq_sw_ring
> >     and PMDs that would like to support direct-rearming will have to
> >     use/obey it.
> [Feifei] de-coupled sw-ring and RXQ may cause dangerous case due to RXQ is
> stopped but elements of it (sw-ring) is still kept and we may forget to free
> this sw-ring in the end.
> Furthermore,  if we apply this, we need to separate operation when closing
> RXQ and add Rx sw-ring free operation when closing TXQ. This will be
> complex and it is not conducive to subsequent maintenance if maintainer
> does not understand direct-rearm mode very well.
> 
> >
> > 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
> >     at txq_free_bufs() try to store released mbufs inside attached
> >     sw_ring directly. If there is no attached sw_ring, or not enough
> >     free space in it - continue with mempool_put() as usual.
> >     Note that actual arming of HW RXDs still remains responsibility
> >     of RX code-path:
> >     rxq_rearm(rxq) {
> >       ...
> >       - check are there are N already filled entries inside rxq_sw_ring.
> >         if not, populate them from mempool (usual mempool_get()).
> >       - arm related RXDs and mark these sw_ring entries as managed by HW.
> >       ...
> >     }
> >
> [Feifei] We try to create two modes, one is direct-rearm and the other is
> direct-free like above.
> And by performance comparison, we select direct-rearm which improve
> performance by 7% - 14% compared with direct-free by 3.6% - 7% in n1sdp.
> Furthermore, I think put direct mode in Tx or Rx is equivalent. For direct-
> rearm, if there is no Tx sw-ring, Rx will get mbufs from mempool. For direct-
> fee, if there is no Rx sw-ring, Tx will put mbufs into mempool. At last, what
> affects our decision-making is the improvement of performance.
> 
> >
> > So rxq_sw_ring will serve two purposes:
> > - track mbufs that are managed by HW (that what it does now)
> > - private (per RXQ) mbuf cache
> >
> > Now, if TXQ is stopped while RXQ is running - no extra synchronization
> > is required, RXQ would just use
> > mempool_get() to rearm its sw_ring itself.
> >
> > If RXQ is stopped while TXQ is still running - TXQ can still continue
> > to populate related sw_ring till it gets full.
> > Then it will continue with mempool_put() as usual.
> > Of-course it means that user who wants to use this feature should
> > probably account some extra mbufs for such case, or might be
> > rxq_sw_ring can have enable/disable flag to mitigate such situation.
> >
> [Feifei] For direct-rearm, the key point should be the communication
> between TXQ and RXQ when TXQ is stopped. De-coupled sw-ring is complex,
> maybe we can simplify this and assign this to the application. My thought is
> that if direct-rearm is enabled, when users want to close TX port, they must
> firstly close mapped RX port and disable direct-rearm feature. Then they can
> restart RX port.
[Feifei] I think here we can refer to "hairpin queue" in ' rte_eth_dev_tx_queue_stop ':
''
if (rte_eth_dev_is_tx_hairpin_queue(dev, tx_queue_id)) {
		RTE_ETHDEV_LOG(INFO,
			"Can't stop Tx hairpin queue %"PRIu16" of device with port_id=%"PRIu16"\n",
			tx_queue_id, port_id);
		return -EINVAL;
	}
"
> 
> > As another benefit here - such approach makes possible to use several
> > TXQs (even from different devices) to rearm same RXQ.
> [Feifei] Actually, for direct-rearm, it can use several RXQs to rearm same TXQ,
> so this is equivalent for direct-rearm and direct-free. Furthermore, If use
> multiple cores, I think we need to consider synchronization of variables, and
> lock is necessary.
> 
> >
> > Have to say, that I am still not sure that 10% RX/TX improvement is
> > worth bypassing mempool completely and introducing all this extra
> > complexity in RX/TX path.
> [Feifei] Thus maybe we can avoid this complexity as much as possible.
> We should not increase the complexity of the bottom layer for convenience,
> but leave it to the user to decide.  If user wants performance, he needs to
> consider and operate more.
> 
> > But, if we'll still decide to go ahead with direct-rearming, this
> > re-arrangement, I think, should help to keep things clear and avoid
> > introducing new limitations in existing functionality.
> >
> > WDYT?
> >
> > Konstantin
> >
> >
> >
> >
> >
> >
> >
  
Feifei Wang July 11, 2022, 3:08 a.m. UTC | #6
> -----邮件原件-----
> 发件人: Feifei Wang
> 发送时间: Wednesday, July 6, 2022 7:36 PM
> 收件人: 'Konstantin Ananyev' <konstantin.v.ananyev@yandex.ru>; 'Yuying
> Zhang' <Yuying.Zhang@intel.com>; 'Beilei Xing' <beilei.xing@intel.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>
> 抄送: 'dev@dpdk.org' <dev@dpdk.org>; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; nd
> <nd@arm.com>
> 主题: 回复: [RFC PATCH v1] net/i40e: put mempool cache out of API
> 
> 
> 
> > -----邮件原件-----
> > 发件人: Feifei Wang
> > 发送时间: Wednesday, July 6, 2022 4:53 PM
> > 收件人: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>; Yuying
> Zhang
> > <Yuying.Zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>; Ruifeng
> > Wang <Ruifeng.Wang@arm.com>
> > 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> > 主题: 回复: [RFC PATCH v1] net/i40e: put mempool cache out of API
> >
> >
> >
> > > -----邮件原件-----
> > > 发件人: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> > > 发送时间: Sunday, July 3, 2022 8:20 PM
> > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Yuying Zhang
> > > <Yuying.Zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>;
> > > Ruifeng Wang <Ruifeng.Wang@arm.com>
> > > 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>
> > > 主题: Re: [RFC PATCH v1] net/i40e: put mempool cache out of API
> > >
> > >
> > > > Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
> > > > out of API to free buffers directly. There are two changes
> > > > different with previous version:
> > > > 1. change txep from "i40e_entry" to "i40e_vec_entry"
> > > > 2. put cache out of "mempool_bulk" API to copy buffers into it
> > > > directly
> > > >
> > > > Performance Test with l3fwd neon path:
> > > > 		with this patch
> > > > n1sdp:		no perforamnce change
> > > > amper-altra:	+4.0%
> > > >
> > >
> > >
> > Thanks for your detailed comments.
> >
> > > Thanks for RFC, appreciate your effort.
> > > So, as I understand - bypassing mempool put/get itself gives about
> > > 7-10% speedup for RX/TX on ARM platforms, correct?
> > [Feifei] Yes.
[Feifei] Sorry I need to correct this. Actually according to our test in direct-rearm
cover letter,  the improvement is 7% to 14% on N1SDP and 14% to 17% on Ampere
Altra.
> >
> > >
> > > About direct-rearm RX approach you propose:
> > > After another thought, probably it is possible to re-arrange it in a
> > > way that would help avoid related negatives.
> > > The basic idea as follows:
> > >
> > > 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
> > >     Also make sw_ring de-coupled from RXQ itself, i.e:
> > >     when RXQ is stopped or even destroyed, related sw_ring may still
> > >     exist (probably ref-counter or RCU would be sufficient here).
> > >     All that means we need a common layout/api for rxq_sw_ring
> > >     and PMDs that would like to support direct-rearming will have to
> > >     use/obey it.
> > [Feifei] de-coupled sw-ring and RXQ may cause dangerous case due to
> > RXQ is stopped but elements of it (sw-ring) is still kept and we may
> > forget to free this sw-ring in the end.
> > Furthermore,  if we apply this, we need to separate operation when
> > closing RXQ and add Rx sw-ring free operation when closing TXQ. This
> > will be complex and it is not conducive to subsequent maintenance if
> > maintainer does not understand direct-rearm mode very well.
> >
> > >
> > > 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
> > >     at txq_free_bufs() try to store released mbufs inside attached
> > >     sw_ring directly. If there is no attached sw_ring, or not enough
> > >     free space in it - continue with mempool_put() as usual.
> > >     Note that actual arming of HW RXDs still remains responsibility
> > >     of RX code-path:
> > >     rxq_rearm(rxq) {
> > >       ...
> > >       - check are there are N already filled entries inside rxq_sw_ring.
> > >         if not, populate them from mempool (usual mempool_get()).
> > >       - arm related RXDs and mark these sw_ring entries as managed by HW.
> > >       ...
> > >     }
> > >
> > [Feifei] We try to create two modes, one is direct-rearm and the other
> > is direct-free like above.
> > And by performance comparison, we select direct-rearm which improve
> > performance by 7% - 14% compared with direct-free by 3.6% - 7% in n1sdp.
> > Furthermore, I think put direct mode in Tx or Rx is equivalent. For
> > direct- rearm, if there is no Tx sw-ring, Rx will get mbufs from
> > mempool. For direct- fee, if there is no Rx sw-ring, Tx will put mbufs
> > into mempool. At last, what affects our decision-making is the
> improvement of performance.
> >
> > >
> > > So rxq_sw_ring will serve two purposes:
> > > - track mbufs that are managed by HW (that what it does now)
> > > - private (per RXQ) mbuf cache
> > >
> > > Now, if TXQ is stopped while RXQ is running - no extra
> > > synchronization is required, RXQ would just use
> > > mempool_get() to rearm its sw_ring itself.
> > >
> > > If RXQ is stopped while TXQ is still running - TXQ can still
> > > continue to populate related sw_ring till it gets full.
> > > Then it will continue with mempool_put() as usual.
> > > Of-course it means that user who wants to use this feature should
> > > probably account some extra mbufs for such case, or might be
> > > rxq_sw_ring can have enable/disable flag to mitigate such situation.
> > >
> > [Feifei] For direct-rearm, the key point should be the communication
> > between TXQ and RXQ when TXQ is stopped. De-coupled sw-ring is
> > complex, maybe we can simplify this and assign this to the
> > application. My thought is that if direct-rearm is enabled, when users
> > want to close TX port, they must firstly close mapped RX port and
> > disable direct-rearm feature. Then they can restart RX port.
> [Feifei] I think here we can refer to "hairpin queue" in '
> rte_eth_dev_tx_queue_stop ':
> ''
> if (rte_eth_dev_is_tx_hairpin_queue(dev, tx_queue_id)) {
> 		RTE_ETHDEV_LOG(INFO,
> 			"Can't stop Tx hairpin queue %"PRIu16" of device
> with port_id=%"PRIu16"\n",
> 			tx_queue_id, port_id);
> 		return -EINVAL;
> 	}
> "
> >
> > > As another benefit here - such approach makes possible to use
> > > several TXQs (even from different devices) to rearm same RXQ.
> > [Feifei] Actually, for direct-rearm, it can use several RXQs to rearm
> > same TXQ, so this is equivalent for direct-rearm and direct-free.
> > Furthermore, If use multiple cores, I think we need to consider
> > synchronization of variables, and lock is necessary.
> >
> > >
> > > Have to say, that I am still not sure that 10% RX/TX improvement is
> > > worth bypassing mempool completely and introducing all this extra
> > > complexity in RX/TX path.
> > [Feifei] Thus maybe we can avoid this complexity as much as possible.
> > We should not increase the complexity of the bottom layer for
> > convenience, but leave it to the user to decide.  If user wants
> > performance, he needs to consider and operate more.
> >
> > > But, if we'll still decide to go ahead with direct-rearming, this
> > > re-arrangement, I think, should help to keep things clear and avoid
> > > introducing new limitations in existing functionality.
> > >
> > > WDYT?
> > >
> > > Konstantin
> > >
> > >
> > >
> > >
> > >
> > >
> > >
  
Honnappa Nagarahalli July 11, 2022, 6:19 a.m. UTC | #7
<snip>
> 
> 
> > Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache out
> > of API to free buffers directly. There are two changes different with
> > previous version:
> > 1. change txep from "i40e_entry" to "i40e_vec_entry"
> > 2. put cache out of "mempool_bulk" API to copy buffers into it
> > directly
> >
> > Performance Test with l3fwd neon path:
> > 		with this patch
> > n1sdp:		no perforamnce change
> > amper-altra:	+4.0%
> >
> 
> 
> Thanks for RFC, appreciate your effort.
> So, as I understand - bypassing mempool put/get itself gives about 7-10%
> speedup for RX/TX on ARM platforms, correct?
It is 7% to 14% on N1SDP, 14% to 17% on Ampere machines
The current RFC needs to be converted to a patch (albeit with zero-copy mempool APIs) as it applies for a more generic use case.

> 
> About direct-rearm RX approach you propose:
> After another thought, probably it is possible to re-arrange it in a way that
> would help avoid related negatives.
Thanks for the idea.

> The basic idea as follows:
> 
> 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
>     Also make sw_ring de-coupled from RXQ itself, i.e:
>     when RXQ is stopped or even destroyed, related sw_ring may still
>     exist (probably ref-counter or RCU would be sufficient here).
>     All that means we need a common layout/api for rxq_sw_ring
>     and PMDs that would like to support direct-rearming will have to
>     use/obey it.
This would mean, we will have additional for loop to fill the descriptors on the RX side.
May be we could keep the queue when the RXQ is stopped, and free it only when RXQ is destroyed. Don't have to allocate one more if the RXQ is started back again? This means, the mbufs will be in the sw_ring if the RXQ is stopped but not destroyed (same as what you have mentioned below).

> 
> 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
>     at txq_free_bufs() try to store released mbufs inside attached
>     sw_ring directly. If there is no attached sw_ring, or not enough
>     free space in it - continue with mempool_put() as usual.
>     Note that actual arming of HW RXDs still remains responsibility
What problems do you see if we arm the HW RXDs (assuming that we do not have to support this across multiple devices)?

>     of RX code-path:
>     rxq_rearm(rxq) {
>       ...
>       - check are there are N already filled entries inside rxq_sw_ring.
>         if not, populate them from mempool (usual mempool_get()).
>       - arm related RXDs and mark these sw_ring entries as managed by HW.
>       ...
>     }
> 
> 
> So rxq_sw_ring will serve two purposes:
> - track mbufs that are managed by HW (that what it does now)
> - private (per RXQ) mbuf cache
> 
> Now, if TXQ is stopped while RXQ is running - no extra synchronization is
> required, RXQ would just use
> mempool_get() to rearm its sw_ring itself.
There would be some synchronization required which tells the data plane threads not to access the TXQ before the TXQ is stopped. Other than this, no extra synchronization is required.

For the current patch, we could use a similar approach. i.e. when TXQ is stopped, it does not free the mbufs from sw_ring, we just let the RXQ consume all the mbufs from TX side sw_ring till it is empty.

> 
> If RXQ is stopped while TXQ is still running - TXQ can still continue to populate
> related sw_ring till it gets full.
> Then it will continue with mempool_put() as usual.
> Of-course it means that user who wants to use this feature should probably
> account some extra mbufs for such case, or might be rxq_sw_ring can have
> enable/disable flag to mitigate such situation.
> 
> As another benefit here - such approach makes possible to use several TXQs
> (even from different devices) to rearm same RXQ.
Being able to use several TXQs is a practical use case. But, I am not sure if different devices in the same server is a practical scenario that we need to address.

> 
> Have to say, that I am still not sure that 10% RX/TX improvement is worth
> bypassing mempool completely and introducing all this extra complexity in
> RX/TX path.
It is more, clarified above.

> But, if we'll still decide to go ahead with direct-rearming, this re-arrangement, I
> think, should help to keep things clear and avoid introducing new limitations in
> existing functionality.
> 
> WDYT?
I had another idea. This requires the application to change, which might be ok given that it is new feature/optimization.

We could expose a new API to the application which allows RX side to take buffers from TX side. The application would call this additional API just before calling the eth_rx_burst API.

The advantages I see with this approach is:
1) The static mapping of the ports is not required. The mapping is dynamic. This allows for the application to make decisions based on current conditions in the data plane.
2) Code is simple, because it does not have to check for many conditions. The application can make better decisions as it knows the scenario well. Not all applications have to take the hit of conditionals.
2) The existing synchronization used by the applications (to stop RXQ/TXQ) is sufficient. No new synchronization required.

The disadvantage is that it is another function pointer call which will reduce some performance.

> 
> Konstantin
> 
> 
> 
> 
> 
> 
>
  
Konstantin Ananyev July 13, 2022, 6:48 p.m. UTC | #8
11/07/2022 07:19, Honnappa Nagarahalli пишет:
> <snip>
>>
>>
>>> Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache out
>>> of API to free buffers directly. There are two changes different with
>>> previous version:
>>> 1. change txep from "i40e_entry" to "i40e_vec_entry"
>>> 2. put cache out of "mempool_bulk" API to copy buffers into it
>>> directly
>>>
>>> Performance Test with l3fwd neon path:
>>> 		with this patch
>>> n1sdp:		no perforamnce change
>>> amper-altra:	+4.0%
>>>
>>
>>
>> Thanks for RFC, appreciate your effort.
>> So, as I understand - bypassing mempool put/get itself gives about 7-10%
>> speedup for RX/TX on ARM platforms, correct?
> It is 7% to 14% on N1SDP, 14% to 17% on Ampere machines
> The current RFC needs to be converted to a patch (albeit with zero-copy mempool APIs) as it applies for a more generic use case.
> 
>>
>> About direct-rearm RX approach you propose:
>> After another thought, probably it is possible to re-arrange it in a way that
>> would help avoid related negatives.
> Thanks for the idea.
> 
>> The basic idea as follows:
>>
>> 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
>>      Also make sw_ring de-coupled from RXQ itself, i.e:
>>      when RXQ is stopped or even destroyed, related sw_ring may still
>>      exist (probably ref-counter or RCU would be sufficient here).
>>      All that means we need a common layout/api for rxq_sw_ring
>>      and PMDs that would like to support direct-rearming will have to
>>      use/obey it.
> This would mean, we will have additional for loop to fill the descriptors on the RX side.

Yes, rx will go though avaiable entries in sw_ring and fill actual HW
descriptors.

> May be we could keep the queue when the RXQ is stopped, and free it only when RXQ is destroyed. Don't have to allocate one more if the RXQ is started back again? This means, the mbufs will be in the sw_ring if the RXQ is stopped but not destroyed (same as what you have mentioned below).

My thought here was - yes sw_ring can stay at queue stop
(even probably at destroy) to avoid related TXQ operation abortion.
Though actual HW desc ring will be de-init/destroyed as usual.
That's why I think we need to decouple sw_ring from queue and
it's HW ring.

> 
>>
>> 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
>>      at txq_free_bufs() try to store released mbufs inside attached
>>      sw_ring directly. If there is no attached sw_ring, or not enough
>>      free space in it - continue with mempool_put() as usual.
>>      Note that actual arming of HW RXDs still remains responsibility
> What problems do you see if we arm the HW RXDs (assuming that we do not have to support this across multiple devices)?

Same as visa-versa scenario:
- RXQ might be already stopped, while TXQ still running.
- Also it doesn't sound right from design point of view to allow
   TX code to manipulate RX HW queue directly and visa-versa.

> 
>>      of RX code-path:
>>      rxq_rearm(rxq) {
>>        ...
>>        - check are there are N already filled entries inside rxq_sw_ring.
>>          if not, populate them from mempool (usual mempool_get()).
>>        - arm related RXDs and mark these sw_ring entries as managed by HW.
>>        ...
>>      }
>>
>>
>> So rxq_sw_ring will serve two purposes:
>> - track mbufs that are managed by HW (that what it does now)
>> - private (per RXQ) mbuf cache
>>
>> Now, if TXQ is stopped while RXQ is running - no extra synchronization is
>> required, RXQ would just use
>> mempool_get() to rearm its sw_ring itself.
> There would be some synchronization required which tells the data plane threads not to access the TXQ before the TXQ is stopped. Other than this, no extra synchronization is required.
> 
> For the current patch, we could use a similar approach. i.e. when TXQ is stopped, it does not free the mbufs from sw_ring, we just let the RXQ consume all the mbufs from TX side sw_ring till it is empty.
> 
>>
>> If RXQ is stopped while TXQ is still running - TXQ can still continue to populate
>> related sw_ring till it gets full.
>> Then it will continue with mempool_put() as usual.
>> Of-course it means that user who wants to use this feature should probably
>> account some extra mbufs for such case, or might be rxq_sw_ring can have
>> enable/disable flag to mitigate such situation.
>>
>> As another benefit here - such approach makes possible to use several TXQs
>> (even from different devices) to rearm same RXQ.
> Being able to use several TXQs is a practical use case. But, I am not sure if different devices in the same server is a practical scenario that we need to address.
> 
>>
>> Have to say, that I am still not sure that 10% RX/TX improvement is worth
>> bypassing mempool completely and introducing all this extra complexity in
>> RX/TX path.
> It is more, clarified above.
> 
>> But, if we'll still decide to go ahead with direct-rearming, this re-arrangement, I
>> think, should help to keep things clear and avoid introducing new limitations in
>> existing functionality.
>>
>> WDYT?
> I had another idea. This requires the application to change, which might be ok given that it is new feature/optimization.
> 
> We could expose a new API to the application which allows RX side to take buffers from TX side. The application would call this additional API just before calling the eth_rx_burst API.
> 
> The advantages I see with this approach is:
> 1) The static mapping of the ports is not required. The mapping is dynamic. This allows for the application to make decisions based on current conditions in the data plane.
> 2) Code is simple, because it does not have to check for many conditions. The application can make better decisions as it knows the scenario well. Not all applications have to take the hit of conditionals.
> 2) The existing synchronization used by the applications (to stop RXQ/TXQ) is sufficient. No new synchronization required.


That sounds like a very good idea to me.
I think it will allow to avoid all short-comings that are
present in original approach.
Again, if we like we can even allow user to feed RXQ from some other 
sources (dropped packets).
Just to confirm that we are talking about the same thing:

1. rte_ethdev will have an API for the user to get sw_ring handle
from the RXQ, something like:
struct rxq_sw_ring *
rte_eth_rxq_get_sw_ring(uint16_t port_id, uint16_t rxq_id);

We probbly can even keep 'struct rxq_sw_ring' as internal one,
and use it as a handle.

2. rte_ethdev will have an API that will ask TXQ to free mbufs
and fill given rxq_sw_ring:
int
rte_eth_txq_fill_sw_ring(uint16_t port_id, uint16_t txq_id, struct * 
rxq_sw_ring);


So the app code will look something like:

rxq_sw_ring = rxq_get_sw_ring(rx_port, rxq);
...
while (...) {

	n = eth_tx_burst(tx_port, txq, pkts, N);
	...
	eth_txq_fill_sw_ring(tx_port, txq, rxq_sw_ring);
	n = eth_rx_burst(rx_port, rxq, pkts, N);
	...
}


Did I get your idea right?

> 
> The disadvantage is that it is another function pointer call which will reduce some performance.
> 
>>
>> Konstantin
>>
>>
>>
>>
>>
>>
>>
>
  
Honnappa Nagarahalli July 15, 2022, 9:52 p.m. UTC | #9
<snip>
> >>
> >>
> >>> Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
> >>> out of API to free buffers directly. There are two changes different
> >>> with previous version:
> >>> 1. change txep from "i40e_entry" to "i40e_vec_entry"
> >>> 2. put cache out of "mempool_bulk" API to copy buffers into it
> >>> directly
> >>>
> >>> Performance Test with l3fwd neon path:
> >>> 		with this patch
> >>> n1sdp:		no perforamnce change
> >>> amper-altra:	+4.0%
> >>>
> >>
> >>
> >> Thanks for RFC, appreciate your effort.
> >> So, as I understand - bypassing mempool put/get itself gives about
> >> 7-10% speedup for RX/TX on ARM platforms, correct?
> > It is 7% to 14% on N1SDP, 14% to 17% on Ampere machines The current
> > RFC needs to be converted to a patch (albeit with zero-copy mempool APIs) as
> it applies for a more generic use case.
> >
> >>
> >> About direct-rearm RX approach you propose:
> >> After another thought, probably it is possible to re-arrange it in a
> >> way that would help avoid related negatives.
> > Thanks for the idea.
> >
> >> The basic idea as follows:
> >>
> >> 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
> >>      Also make sw_ring de-coupled from RXQ itself, i.e:
> >>      when RXQ is stopped or even destroyed, related sw_ring may still
> >>      exist (probably ref-counter or RCU would be sufficient here).
> >>      All that means we need a common layout/api for rxq_sw_ring
> >>      and PMDs that would like to support direct-rearming will have to
> >>      use/obey it.
> > This would mean, we will have additional for loop to fill the descriptors on the
> RX side.
> 
> Yes, rx will go though avaiable entries in sw_ring and fill actual HW descriptors.
> 
> > May be we could keep the queue when the RXQ is stopped, and free it only
> when RXQ is destroyed. Don't have to allocate one more if the RXQ is started
> back again? This means, the mbufs will be in the sw_ring if the RXQ is stopped
> but not destroyed (same as what you have mentioned below).
> 
> My thought here was - yes sw_ring can stay at queue stop (even probably at
> destroy) to avoid related TXQ operation abortion.
Understood

> Though actual HW desc ring will be de-init/destroyed as usual.
I do not understand what you mean by this.
I looked at the function ' i40e_dev_tx_queue_stop'. I see that it disables the queue by calling ' i40e_switch_tx_queue'. The rest of the functionality, frees the mbufs in the sw_ring and resets the descriptors (and the associated indices) in the host memory. I guess this is what you are referring to?

> That's why I think we need to decouple sw_ring from queue and it's HW ring.
I have not looked at other PMDs. But, based on what I see in i40e PMD, this is a synchronization problem between control plane and data plane. As long as the data plane is not accessing the TX queue while tx_queue_stop is called, we should be fine to access the descriptors as well. The synchronization requirement applies for the tx_queue_stop API already and we can add the direct-rearm requirement under the same synchronization. i.e. when the data plane says it is not accessing the TX port, it has stopped using the direct-rearm feature as well.

BTW, the rte_eth_dev_tx_queue_stop API does not call out the synchronization requirement and does not mention the need for the data plane to stop accessing the queue. I think it should be documented.

> 
> >
> >>
> >> 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
> >>      at txq_free_bufs() try to store released mbufs inside attached
> >>      sw_ring directly. If there is no attached sw_ring, or not enough
> >>      free space in it - continue with mempool_put() as usual.
> >>      Note that actual arming of HW RXDs still remains responsibility
> > What problems do you see if we arm the HW RXDs (assuming that we do not
> have to support this across multiple devices)?
> 
> Same as visa-versa scenario:
> - RXQ might be already stopped, while TXQ still running.
This is again a control plane - data plane synchronization problem. i.e. if RXQ is to be stopped (which requires synchronization), the data plane should stop using the direct-rearm.

> - Also it doesn't sound right from design point of view to allow
>    TX code to manipulate RX HW queue directly and visa-versa.
Ok, there is no hard technical issue that stops us from doing this, correct?

> 
> >
> >>      of RX code-path:
> >>      rxq_rearm(rxq) {
> >>        ...
> >>        - check are there are N already filled entries inside rxq_sw_ring.
> >>          if not, populate them from mempool (usual mempool_get()).
> >>        - arm related RXDs and mark these sw_ring entries as managed by HW.
> >>        ...
> >>      }
> >>
> >>
> >> So rxq_sw_ring will serve two purposes:
> >> - track mbufs that are managed by HW (that what it does now)
> >> - private (per RXQ) mbuf cache
> >>
> >> Now, if TXQ is stopped while RXQ is running - no extra
> >> synchronization is required, RXQ would just use
> >> mempool_get() to rearm its sw_ring itself.
> > There would be some synchronization required which tells the data plane
> threads not to access the TXQ before the TXQ is stopped. Other than this, no
> extra synchronization is required.
> >
> > For the current patch, we could use a similar approach. i.e. when TXQ is
> stopped, it does not free the mbufs from sw_ring, we just let the RXQ consume
> all the mbufs from TX side sw_ring till it is empty.
> >
> >>
> >> If RXQ is stopped while TXQ is still running - TXQ can still continue
> >> to populate related sw_ring till it gets full.
> >> Then it will continue with mempool_put() as usual.
> >> Of-course it means that user who wants to use this feature should
> >> probably account some extra mbufs for such case, or might be
> >> rxq_sw_ring can have enable/disable flag to mitigate such situation.
> >>
> >> As another benefit here - such approach makes possible to use several
> >> TXQs (even from different devices) to rearm same RXQ.
> > Being able to use several TXQs is a practical use case. But, I am not sure if
> different devices in the same server is a practical scenario that we need to
> address.
> >
> >>
> >> Have to say, that I am still not sure that 10% RX/TX improvement is
> >> worth bypassing mempool completely and introducing all this extra
> >> complexity in RX/TX path.
> > It is more, clarified above.
> >
> >> But, if we'll still decide to go ahead with direct-rearming, this
> >> re-arrangement, I think, should help to keep things clear and avoid
> >> introducing new limitations in existing functionality.
> >>
> >> WDYT?
> > I had another idea. This requires the application to change, which might be ok
> given that it is new feature/optimization.
> >
> > We could expose a new API to the application which allows RX side to take
> buffers from TX side. The application would call this additional API just before
> calling the eth_rx_burst API.
> >
> > The advantages I see with this approach is:
> > 1) The static mapping of the ports is not required. The mapping is dynamic.
> This allows for the application to make decisions based on current conditions in
> the data plane.
> > 2) Code is simple, because it does not have to check for many conditions. The
> application can make better decisions as it knows the scenario well. Not all
> applications have to take the hit of conditionals.
> > 2) The existing synchronization used by the applications (to stop RXQ/TXQ) is
> sufficient. No new synchronization required.
> 
> 
> That sounds like a very good idea to me.
> I think it will allow to avoid all short-comings that are present in original
> approach.
> Again, if we like we can even allow user to feed RXQ from some other sources
> (dropped packets).
Good point

> Just to confirm that we are talking about the same thing:
> 
> 1. rte_ethdev will have an API for the user to get sw_ring handle from the RXQ,
> something like:
> struct rxq_sw_ring *
> rte_eth_rxq_get_sw_ring(uint16_t port_id, uint16_t rxq_id);
> 
> We probbly can even keep 'struct rxq_sw_ring' as internal one, and use it as a
> handle.
Do we need to expose the SW ring to the application level? IMO, it would be good for the application to refer to port ID and queue ID.

> 
> 2. rte_ethdev will have an API that will ask TXQ to free mbufs and fill given
> rxq_sw_ring:
> int
> rte_eth_txq_fill_sw_ring(uint16_t port_id, uint16_t txq_id, struct *
> rxq_sw_ring);
It is good to fill the descriptors (but not update the head and tail pointers) which keeps the code light for better performance.

> 
> 
> So the app code will look something like:
> 
> rxq_sw_ring = rxq_get_sw_ring(rx_port, rxq);
> ...
> while (...) {
> 
> 	n = eth_tx_burst(tx_port, txq, pkts, N);
> 	...
> 	eth_txq_fill_sw_ring(tx_port, txq, rxq_sw_ring);
> 	n = eth_rx_burst(rx_port, rxq, pkts, N);
> 	...
> }
> 
> 
> Did I get your idea right?
Overall your understanding is correct, but the idea I had differs slightly (mainly keeping performance in mind and sticking to the experiments we have done).
1) eth_rxq_fill(rx_port, rxq, tx_port, txq);
This will take the buffers from TXQ and fill the RXQ sw_ring as well as the hardware buffers. 

2) The loop would be as follows:
while (...) {
 
	eth_rxq_fill((rx_port, rxq, tx_port, txq);
	n = eth_rx_burst(rx_port, rxq, pkts, N);
 	...
	n = eth_tx_burst(tx_port, txq, pkts, N);
}


> 
> >
> > The disadvantage is that it is another function pointer call which will reduce
> some performance.
> >
> >>
> >> Konstantin
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
  
Honnappa Nagarahalli July 15, 2022, 10:06 p.m. UTC | #10
<snip>
> >>
> >>
> >>> Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
> >>> out of API to free buffers directly. There are two changes different
> >>> with previous version:
> >>> 1. change txep from "i40e_entry" to "i40e_vec_entry"
> >>> 2. put cache out of "mempool_bulk" API to copy buffers into it
> >>> directly
> >>>
> >>> Performance Test with l3fwd neon path:
> >>> 		with this patch
> >>> n1sdp:		no perforamnce change
> >>> amper-altra:	+4.0%
> >>>
> >>
> >>
> >> Thanks for RFC, appreciate your effort.
> >> So, as I understand - bypassing mempool put/get itself gives about
> >> 7-10% speedup for RX/TX on ARM platforms, correct?
> > It is 7% to 14% on N1SDP, 14% to 17% on Ampere machines The current
> > RFC needs to be converted to a patch (albeit with zero-copy mempool APIs) as
> it applies for a more generic use case.
> >
> >>
> >> About direct-rearm RX approach you propose:
> >> After another thought, probably it is possible to re-arrange it in a
> >> way that would help avoid related negatives.
> > Thanks for the idea.
> >
> >> The basic idea as follows:
> >>
> >> 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
> >>      Also make sw_ring de-coupled from RXQ itself, i.e:
> >>      when RXQ is stopped or even destroyed, related sw_ring may still
> >>      exist (probably ref-counter or RCU would be sufficient here).
> >>      All that means we need a common layout/api for rxq_sw_ring
> >>      and PMDs that would like to support direct-rearming will have to
> >>      use/obey it.
> > This would mean, we will have additional for loop to fill the descriptors on the
> RX side.
> 
> Yes, rx will go though avaiable entries in sw_ring and fill actual HW descriptors.
> 
> > May be we could keep the queue when the RXQ is stopped, and free it only
> when RXQ is destroyed. Don't have to allocate one more if the RXQ is started
> back again? This means, the mbufs will be in the sw_ring if the RXQ is stopped
> but not destroyed (same as what you have mentioned below).
> 
> My thought here was - yes sw_ring can stay at queue stop (even probably at
> destroy) to avoid related TXQ operation abortion.
> Though actual HW desc ring will be de-init/destroyed as usual.
> That's why I think we need to decouple sw_ring from queue and it's HW ring.
> 
> >
> >>
> >> 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
> >>      at txq_free_bufs() try to store released mbufs inside attached
> >>      sw_ring directly. If there is no attached sw_ring, or not enough
> >>      free space in it - continue with mempool_put() as usual.
> >>      Note that actual arming of HW RXDs still remains responsibility
> > What problems do you see if we arm the HW RXDs (assuming that we do not
> have to support this across multiple devices)?
> 
> Same as visa-versa scenario:
> - RXQ might be already stopped, while TXQ still running.
> - Also it doesn't sound right from design point of view to allow
>    TX code to manipulate RX HW queue directly and visa-versa.
> 
> >
> >>      of RX code-path:
> >>      rxq_rearm(rxq) {
> >>        ...
> >>        - check are there are N already filled entries inside rxq_sw_ring.
> >>          if not, populate them from mempool (usual mempool_get()).
> >>        - arm related RXDs and mark these sw_ring entries as managed by HW.
> >>        ...
> >>      }
> >>
> >>
> >> So rxq_sw_ring will serve two purposes:
> >> - track mbufs that are managed by HW (that what it does now)
> >> - private (per RXQ) mbuf cache
> >>
> >> Now, if TXQ is stopped while RXQ is running - no extra
> >> synchronization is required, RXQ would just use
> >> mempool_get() to rearm its sw_ring itself.
> > There would be some synchronization required which tells the data plane
> threads not to access the TXQ before the TXQ is stopped. Other than this, no
> extra synchronization is required.
> >
> > For the current patch, we could use a similar approach. i.e. when TXQ is
> stopped, it does not free the mbufs from sw_ring, we just let the RXQ consume
> all the mbufs from TX side sw_ring till it is empty.
> >
> >>
> >> If RXQ is stopped while TXQ is still running - TXQ can still continue
> >> to populate related sw_ring till it gets full.
> >> Then it will continue with mempool_put() as usual.
> >> Of-course it means that user who wants to use this feature should
> >> probably account some extra mbufs for such case, or might be
> >> rxq_sw_ring can have enable/disable flag to mitigate such situation.
> >>
> >> As another benefit here - such approach makes possible to use several
> >> TXQs (even from different devices) to rearm same RXQ.
> > Being able to use several TXQs is a practical use case. But, I am not sure if
> different devices in the same server is a practical scenario that we need to
> address.
> >
> >>
> >> Have to say, that I am still not sure that 10% RX/TX improvement is
> >> worth bypassing mempool completely and introducing all this extra
> >> complexity in RX/TX path.
> > It is more, clarified above.
> >
> >> But, if we'll still decide to go ahead with direct-rearming, this
> >> re-arrangement, I think, should help to keep things clear and avoid
> >> introducing new limitations in existing functionality.
> >>
> >> WDYT?
> > I had another idea. This requires the application to change, which might be ok
> given that it is new feature/optimization.
> >
> > We could expose a new API to the application which allows RX side to take
> buffers from TX side. The application would call this additional API just before
> calling the eth_rx_burst API.
> >
> > The advantages I see with this approach is:
> > 1) The static mapping of the ports is not required. The mapping is dynamic.
> This allows for the application to make decisions based on current conditions in
> the data plane.
> > 2) Code is simple, because it does not have to check for many conditions. The
> application can make better decisions as it knows the scenario well. Not all
> applications have to take the hit of conditionals.
> > 2) The existing synchronization used by the applications (to stop RXQ/TXQ) is
> sufficient. No new synchronization required.
> 
> 
> That sounds like a very good idea to me.
> I think it will allow to avoid all short-comings that are present in original
> approach.
> Again, if we like we can even allow user to feed RXQ from some other sources
> (dropped packets).
> Just to confirm that we are talking about the same thing:
> 
> 1. rte_ethdev will have an API for the user to get sw_ring handle from the RXQ,
> something like:
> struct rxq_sw_ring *
> rte_eth_rxq_get_sw_ring(uint16_t port_id, uint16_t rxq_id);
> 
> We probbly can even keep 'struct rxq_sw_ring' as internal one, and use it as a
> handle.
> 
> 2. rte_ethdev will have an API that will ask TXQ to free mbufs and fill given
> rxq_sw_ring:
> int
> rte_eth_txq_fill_sw_ring(uint16_t port_id, uint16_t txq_id, struct *
> rxq_sw_ring);
> 
> 
> So the app code will look something like:
> 
> rxq_sw_ring = rxq_get_sw_ring(rx_port, rxq);
> ...
> while (...) {
> 
> 	n = eth_tx_burst(tx_port, txq, pkts, N);
> 	...
> 	eth_txq_fill_sw_ring(tx_port, txq, rxq_sw_ring);
> 	n = eth_rx_burst(rx_port, rxq, pkts, N);
> 	...
> }
I think the above loop can be re-written as follows:

while (...) {
	eth_txq_fill_sw_ring(tx_port, txq, rxq_sw_ring);
	n = eth_rx_burst(rx_port, rxq, pkts, N);
	...
	n = eth_tx_burst(tx_port, txq, pkts, N);
	...
}

> 
> 
> Did I get your idea right?
> 
> >
> > The disadvantage is that it is another function pointer call which will reduce
> some performance.
> >
> >>
> >> Konstantin
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
  
Konstantin Ananyev July 21, 2022, 10:02 a.m. UTC | #11
>>>>
>>>>> Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
>>>>> out of API to free buffers directly. There are two changes different
>>>>> with previous version:
>>>>> 1. change txep from "i40e_entry" to "i40e_vec_entry"
>>>>> 2. put cache out of "mempool_bulk" API to copy buffers into it
>>>>> directly
>>>>>
>>>>> Performance Test with l3fwd neon path:
>>>>> 		with this patch
>>>>> n1sdp:		no perforamnce change
>>>>> amper-altra:	+4.0%
>>>>>
>>>>
>>>>
>>>> Thanks for RFC, appreciate your effort.
>>>> So, as I understand - bypassing mempool put/get itself gives about
>>>> 7-10% speedup for RX/TX on ARM platforms, correct?
>>> It is 7% to 14% on N1SDP, 14% to 17% on Ampere machines The current
>>> RFC needs to be converted to a patch (albeit with zero-copy mempool APIs) as
>> it applies for a more generic use case.
>>>
>>>>
>>>> About direct-rearm RX approach you propose:
>>>> After another thought, probably it is possible to re-arrange it in a
>>>> way that would help avoid related negatives.
>>> Thanks for the idea.
>>>
>>>> The basic idea as follows:
>>>>
>>>> 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
>>>>       Also make sw_ring de-coupled from RXQ itself, i.e:
>>>>       when RXQ is stopped or even destroyed, related sw_ring may still
>>>>       exist (probably ref-counter or RCU would be sufficient here).
>>>>       All that means we need a common layout/api for rxq_sw_ring
>>>>       and PMDs that would like to support direct-rearming will have to
>>>>       use/obey it.
>>> This would mean, we will have additional for loop to fill the descriptors on the
>> RX side.
>>
>> Yes, rx will go though avaiable entries in sw_ring and fill actual HW descriptors.
>>
>>> May be we could keep the queue when the RXQ is stopped, and free it only
>> when RXQ is destroyed. Don't have to allocate one more if the RXQ is started
>> back again? This means, the mbufs will be in the sw_ring if the RXQ is stopped
>> but not destroyed (same as what you have mentioned below).
>>
>> My thought here was - yes sw_ring can stay at queue stop (even probably at
>> destroy) to avoid related TXQ operation abortion.
> Understood
> 
>> Though actual HW desc ring will be de-init/destroyed as usual.
> I do not understand what you mean by this.
> I looked at the function ' i40e_dev_tx_queue_stop'. I see that it disables the queue by calling ' i40e_switch_tx_queue'. The rest of the functionality, frees the mbufs in the sw_ring and resets the descriptors (and the associated indices) in the host memory. I guess this is what you are referring to?

 From what I remember for most Intel NICs:
queue_stop(): frees mbufs in sw_ring and resets indexes in the queue 
control strcuture
queue_release: frees mbufs, frees HW Desc ring,
sw_ring and related queue structure.
Though I was talking about situation when RXQ is stopped
(and probably destroyed) while TXQ is still active.
Anyway, I think that problem will go away if we'll have explicit
eth_txq_fill_..(), as you suggested below.
So if we all agree that explicit _fill_() will be better,
we can concentrate on it.

> 
>> That's why I think we need to decouple sw_ring from queue and it's HW ring.
> I have not looked at other PMDs. But, based on what I see in i40e PMD, this is a synchronization problem between control plane and data plane. As long as the data plane is not accessing the TX queue while tx_queue_stop is called, we should be fine to access the descriptors as well. The synchronization requirement applies for the tx_queue_stop API already and we can add the direct-rearm requirement under the same synchronization. i.e. when the data plane says it is not accessing the TX port, it has stopped using the direct-rearm feature as well.
> 
> BTW, the rte_eth_dev_tx_queue_stop API does not call out the synchronization requirement and does not mention the need for the data plane to stop accessing the queue. I think it should be documented.

Agree, it should.

> 
>>
>>>
>>>>
>>>> 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
>>>>       at txq_free_bufs() try to store released mbufs inside attached
>>>>       sw_ring directly. If there is no attached sw_ring, or not enough
>>>>       free space in it - continue with mempool_put() as usual.
>>>>       Note that actual arming of HW RXDs still remains responsibility
>>> What problems do you see if we arm the HW RXDs (assuming that we do not
>> have to support this across multiple devices)?
>>
>> Same as visa-versa scenario:
>> - RXQ might be already stopped, while TXQ still running.
> This is again a control plane - data plane synchronization problem. i.e. if RXQ is to be stopped (which requires synchronization), the data plane should stop using the direct-rearm.

Yes, see above.

> 
>> - Also it doesn't sound right from design point of view to allow
>>     TX code to manipulate RX HW queue directly and visa-versa.
> Ok, there is no hard technical issue that stops us from doing this, correct?


As I understand what you suggesting is just:
eth_rxq_fill((rx_port, rxq, tx_port, txq)
without introducing any intermediate common structure (sw_ring)
correct?
The main problem with such approach for me:
we would need either to introduce new dev-op for rxq_fill()
that would be called inside eth_rxq_fill() implementation.
That would work, but I don't see how to make it effective.
As in that case you'll need to put your freed mbufs to somewhere
(array on the stack) and then do function call to pass them to RXQ.
Another alternative - make TX code to directly access  and
manipulate RXQ structure and RXDs
(similar to what your original patch was doing).
 From my perspective it is undesirable - better to keep RX and TX path
clean and independent as they are right now.
It would ease maintenance, support and future development.
Second thing: with that approach what you propose will work only when
rx_port and tx_port belong to the same driver and device type.
With filling just sw_ring from TXQ (without accessing rest of RXQ
structure and RXDs) we can avoid these problems.

> 
>>
>>>
>>>>       of RX code-path:
>>>>       rxq_rearm(rxq) {
>>>>         ...
>>>>         - check are there are N already filled entries inside rxq_sw_ring.
>>>>           if not, populate them from mempool (usual mempool_get()).
>>>>         - arm related RXDs and mark these sw_ring entries as managed by HW.
>>>>         ...
>>>>       }
>>>>
>>>>
>>>> So rxq_sw_ring will serve two purposes:
>>>> - track mbufs that are managed by HW (that what it does now)
>>>> - private (per RXQ) mbuf cache
>>>>
>>>> Now, if TXQ is stopped while RXQ is running - no extra
>>>> synchronization is required, RXQ would just use
>>>> mempool_get() to rearm its sw_ring itself.
>>> There would be some synchronization required which tells the data plane
>> threads not to access the TXQ before the TXQ is stopped. Other than this, no
>> extra synchronization is required.
>>>
>>> For the current patch, we could use a similar approach. i.e. when TXQ is
>> stopped, it does not free the mbufs from sw_ring, we just let the RXQ consume
>> all the mbufs from TX side sw_ring till it is empty.
>>>
>>>>
>>>> If RXQ is stopped while TXQ is still running - TXQ can still continue
>>>> to populate related sw_ring till it gets full.
>>>> Then it will continue with mempool_put() as usual.
>>>> Of-course it means that user who wants to use this feature should
>>>> probably account some extra mbufs for such case, or might be
>>>> rxq_sw_ring can have enable/disable flag to mitigate such situation.
>>>>
>>>> As another benefit here - such approach makes possible to use several
>>>> TXQs (even from different devices) to rearm same RXQ.
>>> Being able to use several TXQs is a practical use case. But, I am not sure if
>> different devices in the same server is a practical scenario that we need to
>> address.
>>>
>>>>
>>>> Have to say, that I am still not sure that 10% RX/TX improvement is
>>>> worth bypassing mempool completely and introducing all this extra
>>>> complexity in RX/TX path.
>>> It is more, clarified above.
>>>
>>>> But, if we'll still decide to go ahead with direct-rearming, this
>>>> re-arrangement, I think, should help to keep things clear and avoid
>>>> introducing new limitations in existing functionality.
>>>>
>>>> WDYT?
>>> I had another idea. This requires the application to change, which might be ok
>> given that it is new feature/optimization.
>>>
>>> We could expose a new API to the application which allows RX side to take
>> buffers from TX side. The application would call this additional API just before
>> calling the eth_rx_burst API.
>>>
>>> The advantages I see with this approach is:
>>> 1) The static mapping of the ports is not required. The mapping is dynamic.
>> This allows for the application to make decisions based on current conditions in
>> the data plane.
>>> 2) Code is simple, because it does not have to check for many conditions. The
>> application can make better decisions as it knows the scenario well. Not all
>> applications have to take the hit of conditionals.
>>> 2) The existing synchronization used by the applications (to stop RXQ/TXQ) is
>> sufficient. No new synchronization required.
>>
>>
>> That sounds like a very good idea to me.
>> I think it will allow to avoid all short-comings that are present in original
>> approach.
>> Again, if we like we can even allow user to feed RXQ from some other sources
>> (dropped packets).
> Good point
> 
>> Just to confirm that we are talking about the same thing:
>>
>> 1. rte_ethdev will have an API for the user to get sw_ring handle from the RXQ,
>> something like:
>> struct rxq_sw_ring *
>> rte_eth_rxq_get_sw_ring(uint16_t port_id, uint16_t rxq_id);
>>
>> We probbly can even keep 'struct rxq_sw_ring' as internal one, and use it as a
>> handle.
> Do we need to expose the SW ring to the application level? IMO, it would be good for the application to refer to port ID and queue ID.
> 
>>
>> 2. rte_ethdev will have an API that will ask TXQ to free mbufs and fill given
>> rxq_sw_ring:
>> int
>> rte_eth_txq_fill_sw_ring(uint16_t port_id, uint16_t txq_id, struct *
>> rxq_sw_ring);
> It is good to fill the descriptors (but not update the head and tail pointers) which keeps the code light for better performance.
> 
>>
>>
>> So the app code will look something like:
>>
>> rxq_sw_ring = rxq_get_sw_ring(rx_port, rxq);
>> ...
>> while (...) {
>>
>> 	n = eth_tx_burst(tx_port, txq, pkts, N);
>> 	...
>> 	eth_txq_fill_sw_ring(tx_port, txq, rxq_sw_ring);
>> 	n = eth_rx_burst(rx_port, rxq, pkts, N);
>> 	...
>> }
>>
>>
>> Did I get your idea right?
> Overall your understanding is correct, but the idea I had differs slightly (mainly keeping performance in mind and sticking to the experiments we have done).
> 1) eth_rxq_fill(rx_port, rxq, tx_port, txq);
> This will take the buffers from TXQ and fill the RXQ sw_ring as well as the hardware buffers.
> 
> 2) The loop would be as follows:
> while (...) {
>   
> 	eth_rxq_fill((rx_port, rxq, tx_port, txq);
> 	n = eth_rx_burst(rx_port, rxq, pkts, N);
>   	...
> 	n = eth_tx_burst(tx_port, txq, pkts, N);
> }
> 
> 
>>
>>>
>>> The disadvantage is that it is another function pointer call which will reduce
>> some performance.
>>>
>>>>
>>>> Konstantin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>
  

Patch

diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index 959832ed6a..e418225b4e 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -81,7 +81,7 @@  reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 static __rte_always_inline int
 i40e_tx_free_bufs(struct i40e_tx_queue *txq)
 {
-	struct i40e_tx_entry *txep;
+	struct i40e_vec_tx_entry *txep;
 	uint32_t n;
 	uint32_t i;
 	int nb_free = 0;
@@ -98,17 +98,39 @@  i40e_tx_free_bufs(struct i40e_tx_queue *txq)
 	 /* first buffer to free from S/W ring is at index
 	  * tx_next_dd - (tx_rs_thresh-1)
 	  */
-	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	txep = (void *)txq->sw_ring;
+	txep += txq->tx_next_dd - (n - 1);
 
 	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
-		for (i = 0; i < n; i++) {
-			free[i] = txep[i].mbuf;
-			/* no need to reset txep[i].mbuf in vector path */
+		struct rte_mempool *mp = txep[0].mbuf->pool;
+		void **cache_objs;
+		struct rte_mempool_cache *cache = rte_mempool_default_cache(mp,
+				rte_lcore_id());
+
+		if (!cache || cache->len == 0)
+			goto normal;
+
+		cache_objs = &cache->objs[cache->len];
+
+		if (n > RTE_MEMPOOL_CACHE_MAX_SIZE) {
+			rte_mempool_ops_enqueue_bulk(mp, (void *)txep, n);
+			goto done;
+		}
+
+		rte_memcpy(cache_objs, txep, sizeof(void *) * n);
+		/* no need to reset txep[i].mbuf in vector path */
+		cache->len += n;
+
+		if (cache->len >= cache->flushthresh) {
+			rte_mempool_ops_enqueue_bulk
+				(mp, &cache->objs[cache->size],
+				cache->len - cache->size);
+			cache->len = cache->size;
 		}
-		rte_mempool_put_bulk(free[0]->pool, (void **)free, n);
 		goto done;
 	}
 
+normal:
 	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
 	if (likely(m != NULL)) {
 		free[0] = m;
@@ -147,7 +169,7 @@  i40e_tx_free_bufs(struct i40e_tx_queue *txq)
 }
 
 static __rte_always_inline void
-tx_backlog_entry(struct i40e_tx_entry *txep,
+tx_backlog_entry(struct i40e_vec_tx_entry *txep,
 		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 {
 	int i;
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 12e6f1cbcb..d2d61e8ef4 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -680,12 +680,15 @@  i40e_xmit_fixed_burst_vec(void *__rte_restrict tx_queue,
 {
 	struct i40e_tx_queue *txq = (struct i40e_tx_queue *)tx_queue;
 	volatile struct i40e_tx_desc *txdp;
-	struct i40e_tx_entry *txep;
+	struct i40e_vec_tx_entry *txep;
 	uint16_t n, nb_commit, tx_id;
 	uint64_t flags = I40E_TD_CMD;
 	uint64_t rs = I40E_TX_DESC_CMD_RS | I40E_TD_CMD;
 	int i;
 
+	/* cross rx_thresh boundary is not allowed */
+	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
 	if (txq->nb_tx_free < txq->tx_free_thresh)
 		i40e_tx_free_bufs(txq);
 
@@ -695,7 +698,8 @@  i40e_xmit_fixed_burst_vec(void *__rte_restrict tx_queue,
 
 	tx_id = txq->tx_tail;
 	txdp = &txq->tx_ring[tx_id];
-	txep = &txq->sw_ring[tx_id];
+	txep = (void *)txq->sw_ring;
+	txep += tx_id;
 
 	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
 
@@ -715,7 +719,7 @@  i40e_xmit_fixed_burst_vec(void *__rte_restrict tx_queue,
 
 		/* avoid reach the end of ring */
 		txdp = &txq->tx_ring[tx_id];
-		txep = &txq->sw_ring[tx_id];
+		txep = (void *)txq->sw_ring;
 	}
 
 	tx_backlog_entry(txep, tx_pkts, nb_commit);