[v1,0/4] maximize vector rx burst for PMDs

Message ID	20200827075452.1751-1-jia.guo@intel.com (mailing list archive)
Headers	IronPort-SDR: VEghY5kCu4wIEIk8uWiNahF49X3yH52OoThoUEq7Hh4ptDN57QWsROnXh8kh02cwDq7Mxln5g1 wvydEWQ8zKRQ== IronPort-SDR: 0SJOZX8KMFWgApK+BbH0S2ZPpHdlcgpDMxLeaaAgplyk03/5ApAHMf2jhE/ZKTMqI1wZ1spkaS CHTyLcpO1v+w== From: Jeff Guo <jia.guo@intel.com> To: qiming.yang@intel.com, beilei.xing@intel.com, wei.zhao1@intel.com, qi.z.zhang@intel.com, jingjing.wu@intel.com Cc: bruce.richardson@intel.com, dev@dpdk.org, jia.guo@intel.com, helin.zhang@intel.com, mb@smartsharesystems.com, ferruh.yigit@intel.com, barbette@kth.se Date: Thu, 27 Aug 2020 15:54:48 +0800 Message-Id: <20200827075452.1751-1-jia.guo@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [dpdk-dev] [PATCH v1 0/4] maximize vector rx burst for PMDs Precedence: list Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>
Series	maximize vector rx burst for PMDs \| [v1,0/4] maximize vector rx burst for PMDs [v1,1/4] net/ixgbe: maximize vector rx burst for ixgbe [v1,2/4] net/i40e: maximize vector rx burst for i40e [v1,3/4] net/ice: maximize vector rx burst for ice [v1,4/4] net/iavf: maximize vector rx burst for iavf

Message ID

20200827075452.1751-1-jia.guo@intel.com (mailing list archive)

Headers

IronPort-SDR: 
 VEghY5kCu4wIEIk8uWiNahF49X3yH52OoThoUEq7Hh4ptDN57QWsROnXh8kh02cwDq7Mxln5g1
 wvydEWQ8zKRQ==
IronPort-SDR: 
 0SJOZX8KMFWgApK+BbH0S2ZPpHdlcgpDMxLeaaAgplyk03/5ApAHMf2jhE/ZKTMqI1wZ1spkaS
 CHTyLcpO1v+w==
From: Jeff Guo <jia.guo@intel.com>
To: qiming.yang@intel.com, beilei.xing@intel.com, wei.zhao1@intel.com,
 qi.z.zhang@intel.com, jingjing.wu@intel.com
Cc: bruce.richardson@intel.com, dev@dpdk.org, jia.guo@intel.com,
 helin.zhang@intel.com, mb@smartsharesystems.com, ferruh.yigit@intel.com,
 barbette@kth.se
Date: Thu, 27 Aug 2020 15:54:48 +0800
Message-Id: <20200827075452.1751-1-jia.guo@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Subject: [dpdk-dev] [PATCH v1 0/4] maximize vector rx burst for PMDs
Precedence: list
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Series

maximize vector rx burst for PMDs |

Message

Guo, Jia Aug. 27, 2020, 7:54 a.m. UTC

  The limitation of burst size in vector rx was removed, since it should
retrieve as much received packets as possible. And also the scattered
receive path should use a wrapper function to achieve the goal of
burst maximizing.

This patch set aims to maximize vector rx burst for for
ixgbe/i40e/ice/iavf PMDs.

Jeff Guo (4):
  net/ixgbe: maximize vector rx burst for ixgbe
  net/i40e: maximize vector rx burst for i40e
  net/ice: maximize vector rx burst for ice
  net/iavf: maximize vector rx burst for iavf

 drivers/net/i40e/i40e_rxtx_vec_altivec.c |  60 ++++++++-----
 drivers/net/i40e/i40e_rxtx_vec_avx2.c    |   6 +-
 drivers/net/i40e/i40e_rxtx_vec_neon.c    |  49 +++++++----
 drivers/net/i40e/i40e_rxtx_vec_sse.c     |  51 ++++++++----
 drivers/net/iavf/iavf_rxtx_vec_sse.c     | 102 +++++++++++++++++------
 drivers/net/ice/ice_rxtx_vec_sse.c       |  47 ++++++++---
 drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c  |  51 +++++++++---
 drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c   |  48 ++++++++---
 8 files changed, 301 insertions(+), 113 deletions(-)

Comments

Morten Brørup Aug. 27, 2020, 8:40 a.m. UTC | #1

Jeff and Ethernet API maintainers Thomas, Ferruh and Andrew,

I'm hijacking this patch thread to propose a small API modification that prevents unnecessarily performance degradations.

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jeff Guo
> Sent: Thursday, August 27, 2020 9:55 AM
> 
> The limitation of burst size in vector rx was removed, since it should
> retrieve as much received packets as possible. And also the scattered
> receive path should use a wrapper function to achieve the goal of
> burst maximizing.
> 
> This patch set aims to maximize vector rx burst for for
> ixgbe/i40e/ice/iavf PMDs.
> 

Now I'm going to be pedantic and say that it still doesn't conform to the rte_eth_rx_burst() API, because the API does not specify any minimum requirement for nb_pkts.

In theory, that could also be fixed in the driver by calling the non-vector function from the vector functions if nb_pkts is too small for the vector implementation.

However, I think that calling rte_eth_rx_burst() with a small nb_pkts is silly and not in the spirit of DPDK, and introducing an additional comparison for a small nb_pkts in the driver vector functions would degrade their performance (only slightly, but anyway).

Instead, I propose that the rte_eth_rx_burst() API is updated with a minimum requirement for nb_pkts. This minimum requirement should be supported by all Ethernet drivers, instead of having minimum requirements for nb_pkts depending on driver and vector function.


I also have a small comment about the description for all the main rx functions:

+/**
+ * vPMD receive routine that reassembles scattered packets.
+ * Main receive routine that can handle arbitrary burst sizes
+ * Notice:
+ * - nb_pkts < RTE_I40E_DESCS_PER_LOOP, just return no packet
+ */

It says "can handle arbitrary burst sizes", but bears a notice that it really cannot. So please remove that line from all these functions.


Med venlig hilsen / kind regards
- Morten Brørup

Bruce Richardson Aug. 27, 2020, 9:09 a.m. UTC | #2

On Thu, Aug 27, 2020 at 10:40:11AM +0200, Morten Brørup wrote:
> Jeff and Ethernet API maintainers Thomas, Ferruh and Andrew,
> 
> I'm hijacking this patch thread to propose a small API modification that prevents unnecessarily performance degradations.
> 
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jeff Guo
> > Sent: Thursday, August 27, 2020 9:55 AM
> > 
> > The limitation of burst size in vector rx was removed, since it should
> > retrieve as much received packets as possible. And also the scattered
> > receive path should use a wrapper function to achieve the goal of
> > burst maximizing.
> > 
> > This patch set aims to maximize vector rx burst for for
> > ixgbe/i40e/ice/iavf PMDs.
> > 
> 
> Now I'm going to be pedantic and say that it still doesn't conform to the rte_eth_rx_burst() API, because the API does not specify any minimum requirement for nb_pkts.
> 
> In theory, that could also be fixed in the driver by calling the non-vector function from the vector functions if nb_pkts is too small for the vector implementation.
> 
> However, I think that calling rte_eth_rx_burst() with a small nb_pkts is silly and not in the spirit of DPDK, and introducing an additional comparison for a small nb_pkts in the driver vector functions would degrade their performance (only slightly, but anyway).
> 

Actually, I'd like to see a confirmed measurement showing a slowdown before
we discard such an option. :-) While I agree that using small bursts is not
keeping with the design approach of DPDK of using large bursts to amortize
costs and allow prefetching, there are cases where a user/app may want a
small burst size, e.g. 4, for latency reasons, and we need a way to support
that.

Since the path selection is dynamic, we need to either:
a) provide a way for the user to specify that they will use smaller bursts
and so that vector functions should not be used
b) have the vector functions transparently fallback to the scalar ones if
used with smaller bursts

Of these, option b) is simpler, and should be low cost since any check is
just once per burst, and - assuming an app is written using the same
request-size each time - should be entirely predictable after the first
call.

/Bruce

Morten Brørup Aug. 27, 2020, 9:31 a.m. UTC | #3

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Thursday, August 27, 2020 11:10 AM
> 
> On Thu, Aug 27, 2020 at 10:40:11AM +0200, Morten Brørup wrote:
> > Jeff and Ethernet API maintainers Thomas, Ferruh and Andrew,
> >
> > I'm hijacking this patch thread to propose a small API modification
> that prevents unnecessarily performance degradations.
> >
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jeff Guo
> > > Sent: Thursday, August 27, 2020 9:55 AM
> > >
> > > The limitation of burst size in vector rx was removed, since it
> should
> > > retrieve as much received packets as possible. And also the
> scattered
> > > receive path should use a wrapper function to achieve the goal of
> > > burst maximizing.
> > >
> > > This patch set aims to maximize vector rx burst for for
> > > ixgbe/i40e/ice/iavf PMDs.
> > >
> >
> > Now I'm going to be pedantic and say that it still doesn't conform to
> the rte_eth_rx_burst() API, because the API does not specify any
> minimum requirement for nb_pkts.
> >
> > In theory, that could also be fixed in the driver by calling the non-
> vector function from the vector functions if nb_pkts is too small for
> the vector implementation.
> >
> > However, I think that calling rte_eth_rx_burst() with a small nb_pkts
> is silly and not in the spirit of DPDK, and introducing an additional
> comparison for a small nb_pkts in the driver vector functions would
> degrade their performance (only slightly, but anyway).
> >
> 
> Actually, I'd like to see a confirmed measurement showing a slowdown
> before
> we discard such an option. :-)

Good point!

> While I agree that using small bursts is
> not
> keeping with the design approach of DPDK of using large bursts to
> amortize
> costs and allow prefetching, there are cases where a user/app may want
> a
> small burst size, e.g. 4, for latency reasons, and we need a way to
> support
> that.
> 
I assume that calling rte_eth_rx_burst() with nb_pkts=32 returns 4 packets if only 4 packets are available, so you would need to be extremely latency sensitive to call it with a smaller nb_pkts. I guess that high frequency trading is the only real life scenario here.

> Since the path selection is dynamic, we need to either:
> a) provide a way for the user to specify that they will use smaller
> bursts
> and so that vector functions should not be used
> b) have the vector functions transparently fallback to the scalar ones
> if
> used with smaller bursts
> 
> Of these, option b) is simpler, and should be low cost since any check
> is
> just once per burst, and - assuming an app is written using the same
> request-size each time - should be entirely predictable after the first
> call.
> 
Why does everyone assume that DPDK applications are so simple that the branch predictor will cover the entire data path? I hear this argument over and over again, and by principle I disagree with it!

How about c): add rte_eth_rx() and rte_eth_tx() functions for receiving/transmitting a single packet. The ring library has such functions.

Optimized single-packet functions might even perform better than calling the burst functions with nb_pkts=1. Great for latency focused applications. :-)

> /Bruce
>

Bruce Richardson Aug. 27, 2020, 9:43 a.m. UTC | #4

On Thu, Aug 27, 2020 at 11:31:15AM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Thursday, August 27, 2020 11:10 AM
> > 
> > On Thu, Aug 27, 2020 at 10:40:11AM +0200, Morten Brørup wrote:
> > > Jeff and Ethernet API maintainers Thomas, Ferruh and Andrew,
> > >
> > > I'm hijacking this patch thread to propose a small API modification
> > that prevents unnecessarily performance degradations.
> > >
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jeff Guo
> > > > Sent: Thursday, August 27, 2020 9:55 AM
> > > >
> > > > The limitation of burst size in vector rx was removed, since it
> > should
> > > > retrieve as much received packets as possible. And also the
> > scattered
> > > > receive path should use a wrapper function to achieve the goal of
> > > > burst maximizing.
> > > >
> > > > This patch set aims to maximize vector rx burst for for
> > > > ixgbe/i40e/ice/iavf PMDs.
> > > >
> > >
> > > Now I'm going to be pedantic and say that it still doesn't conform to
> > the rte_eth_rx_burst() API, because the API does not specify any
> > minimum requirement for nb_pkts.
> > >
> > > In theory, that could also be fixed in the driver by calling the non-
> > vector function from the vector functions if nb_pkts is too small for
> > the vector implementation.
> > >
> > > However, I think that calling rte_eth_rx_burst() with a small nb_pkts
> > is silly and not in the spirit of DPDK, and introducing an additional
> > comparison for a small nb_pkts in the driver vector functions would
> > degrade their performance (only slightly, but anyway).
> > >
> > 
> > Actually, I'd like to see a confirmed measurement showing a slowdown
> > before
> > we discard such an option. :-)
> 
> Good point!
> 
> > While I agree that using small bursts is
> > not
> > keeping with the design approach of DPDK of using large bursts to
> > amortize
> > costs and allow prefetching, there are cases where a user/app may want
> > a
> > small burst size, e.g. 4, for latency reasons, and we need a way to
> > support
> > that.
> > 
> I assume that calling rte_eth_rx_burst() with nb_pkts=32 returns 4 packets if only 4 packets are available, so you would need to be extremely latency sensitive to call it with a smaller nb_pkts. I guess that high frequency trading is the only real life scenario here.
>
Yes, it really boils down to whether you are prepared to accept lower
max throughput or dropped packets in order to gain lower latency.
 
> > Since the path selection is dynamic, we need to either:
> > a) provide a way for the user to specify that they will use smaller
> > bursts
> > and so that vector functions should not be used
> > b) have the vector functions transparently fallback to the scalar ones
> > if
> > used with smaller bursts
> > 
> > Of these, option b) is simpler, and should be low cost since any check
> > is
> > just once per burst, and - assuming an app is written using the same
> > request-size each time - should be entirely predictable after the first
> > call.
> > 
> Why does everyone assume that DPDK applications are so simple that the branch predictor will cover the entire data path? I hear this argument over and over again, and by principle I disagree with it!
> 

Fair enough, that was an assumption on my part. Do you see in your apps
many cases where branches are getting mispredicted despite going the same
way each time though the code?

> How about c): add rte_eth_rx() and rte_eth_tx() functions for receiving/transmitting a single packet. The ring library has such functions.
> 
> Optimized single-packet functions might even perform better than calling the burst functions with nb_pkts=1. Great for latency focused applications. :-)
>
That is another option, yes.
A further option is to add to the vector code a one-off switch to check first
time it's called that the request size is not lower than the min supported
(again basing on the assumption that one is not going to be varying the
burst size asked - which may not be true in call cases but won't leave us
any worse off than we are now!).

/Bruce

Morten Brørup Aug. 27, 2020, 10:13 a.m. UTC | #5

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Thursday, August 27, 2020 11:44 AM
> 
> On Thu, Aug 27, 2020 at 11:31:15AM +0200, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Thursday, August 27, 2020 11:10 AM
> > >
> > > On Thu, Aug 27, 2020 at 10:40:11AM +0200, Morten Brørup wrote:
> > > > Jeff and Ethernet API maintainers Thomas, Ferruh and Andrew,
> > > >
> > > > I'm hijacking this patch thread to propose a small API
> modification
> > > that prevents unnecessarily performance degradations.
> > > >
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jeff Guo
> > > > > Sent: Thursday, August 27, 2020 9:55 AM
> > > > >
> > > > > The limitation of burst size in vector rx was removed, since it
> > > should
> > > > > retrieve as much received packets as possible. And also the
> > > scattered
> > > > > receive path should use a wrapper function to achieve the goal
> of
> > > > > burst maximizing.
> > > > >
> > > > > This patch set aims to maximize vector rx burst for for
> > > > > ixgbe/i40e/ice/iavf PMDs.
> > > > >
> > > >
> > > > Now I'm going to be pedantic and say that it still doesn't
> conform to
> > > the rte_eth_rx_burst() API, because the API does not specify any
> > > minimum requirement for nb_pkts.
> > > >
> > > > In theory, that could also be fixed in the driver by calling the
> non-
> > > vector function from the vector functions if nb_pkts is too small
> for
> > > the vector implementation.
> > > >
> > > > However, I think that calling rte_eth_rx_burst() with a small
> nb_pkts
> > > is silly and not in the spirit of DPDK, and introducing an
> additional
> > > comparison for a small nb_pkts in the driver vector functions would
> > > degrade their performance (only slightly, but anyway).
> > > >
> > >
> > > Actually, I'd like to see a confirmed measurement showing a
> slowdown
> > > before
> > > we discard such an option. :-)
> >
> > Good point!
> >
> > > While I agree that using small bursts is
> > > not
> > > keeping with the design approach of DPDK of using large bursts to
> > > amortize
> > > costs and allow prefetching, there are cases where a user/app may
> want
> > > a
> > > small burst size, e.g. 4, for latency reasons, and we need a way to
> > > support
> > > that.
> > >
> > I assume that calling rte_eth_rx_burst() with nb_pkts=32 returns 4
> packets if only 4 packets are available, so you would need to be
> extremely latency sensitive to call it with a smaller nb_pkts. I guess
> that high frequency trading is the only real life scenario here.
> >
> Yes, it really boils down to whether you are prepared to accept lower
> max throughput or dropped packets in order to gain lower latency.
> 
> > > Since the path selection is dynamic, we need to either:
> > > a) provide a way for the user to specify that they will use smaller
> > > bursts
> > > and so that vector functions should not be used
> > > b) have the vector functions transparently fallback to the scalar
> ones
> > > if
> > > used with smaller bursts
> > >
> > > Of these, option b) is simpler, and should be low cost since any
> check
> > > is
> > > just once per burst, and - assuming an app is written using the
> same
> > > request-size each time - should be entirely predictable after the
> first
> > > call.
> > >
> > Why does everyone assume that DPDK applications are so simple that
> the branch predictor will cover the entire data path? I hear this
> argument over and over again, and by principle I disagree with it!
> >
> 
> Fair enough, that was an assumption on my part. Do you see in your apps
> many cases where branches are getting mispredicted despite going the
> same
> way each time though the code?
> 
We haven't looked deeply into this, but I don't think so.

My objection is of a more general nature. As a library, DPDK cannot assume that applications using it are simple, and - based on that assumption - take away resources that could have been available for the application.

The Intel general optimization guidelines specifies that code should be arranged to be consistent with the static branch prediction algorithm: make the fall-through code following a conditional branch be the likely target for a branch with a forward target, and make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target.

It also says: Conditional branches that are never taken do not consume BTB resources.

Somehow this last detail is completely ignored by DPDK developers.

We put a lot of effort into conserving resources in most areas in DPDK, but when it comes to the branch prediction target buffer (BTB), we gladly organize code with branches turning the wrong way, thus unnecessarily consuming BTB entries. And the argument goes: The branch predictor will catch it after the first time.

> > How about c): add rte_eth_rx() and rte_eth_tx() functions for
> receiving/transmitting a single packet. The ring library has such
> functions.
> >
> > Optimized single-packet functions might even perform better than
> calling the burst functions with nb_pkts=1. Great for latency focused
> applications. :-)
> >
> That is another option, yes.
> A further option is to add to the vector code a one-off switch to check
> first
> time it's called that the request size is not lower than the min
> supported
> (again basing on the assumption that one is not going to be varying the
> burst size asked - which may not be true in call cases but won't leave
> us
> any worse off than we are now!).

I certainly don't support this option. But it was worth mentioning.

Bruce Richardson Aug. 27, 2020, 11:41 a.m. UTC | #6

On Thu, Aug 27, 2020 at 12:13:51PM +0200, Morten Brørup wrote:
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Thursday, August 27, 2020 11:44 AM
> > 
> > On Thu, Aug 27, 2020 at 11:31:15AM +0200, Morten Brørup wrote:
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Thursday, August 27, 2020 11:10 AM
> > > >
> > > > On Thu, Aug 27, 2020 at 10:40:11AM +0200, Morten Brørup wrote:
> > > > > Jeff and Ethernet API maintainers Thomas, Ferruh and Andrew,
> > > > >
> > > > > I'm hijacking this patch thread to propose a small API
> > modification
> > > > that prevents unnecessarily performance degradations.
> > > > >
> > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jeff Guo
> > > > > > Sent: Thursday, August 27, 2020 9:55 AM
> > > > > >
> > > > > > The limitation of burst size in vector rx was removed, since it
> > > > should
> > > > > > retrieve as much received packets as possible. And also the
> > > > scattered
> > > > > > receive path should use a wrapper function to achieve the goal
> > of
> > > > > > burst maximizing.
> > > > > >
> > > > > > This patch set aims to maximize vector rx burst for for
> > > > > > ixgbe/i40e/ice/iavf PMDs.
> > > > > >
> > > > >
> > > > > Now I'm going to be pedantic and say that it still doesn't
> > conform to
> > > > the rte_eth_rx_burst() API, because the API does not specify any
> > > > minimum requirement for nb_pkts.
> > > > >
> > > > > In theory, that could also be fixed in the driver by calling the
> > non-
> > > > vector function from the vector functions if nb_pkts is too small
> > for
> > > > the vector implementation.
> > > > >
> > > > > However, I think that calling rte_eth_rx_burst() with a small
> > nb_pkts
> > > > is silly and not in the spirit of DPDK, and introducing an
> > additional
> > > > comparison for a small nb_pkts in the driver vector functions would
> > > > degrade their performance (only slightly, but anyway).
> > > > >
> > > >
> > > > Actually, I'd like to see a confirmed measurement showing a
> > slowdown
> > > > before
> > > > we discard such an option. :-)
> > >
> > > Good point!
> > >
> > > > While I agree that using small bursts is
> > > > not
> > > > keeping with the design approach of DPDK of using large bursts to
> > > > amortize
> > > > costs and allow prefetching, there are cases where a user/app may
> > want
> > > > a
> > > > small burst size, e.g. 4, for latency reasons, and we need a way to
> > > > support
> > > > that.
> > > >
> > > I assume that calling rte_eth_rx_burst() with nb_pkts=32 returns 4
> > packets if only 4 packets are available, so you would need to be
> > extremely latency sensitive to call it with a smaller nb_pkts. I guess
> > that high frequency trading is the only real life scenario here.
> > >
> > Yes, it really boils down to whether you are prepared to accept lower
> > max throughput or dropped packets in order to gain lower latency.
> > 
> > > > Since the path selection is dynamic, we need to either:
> > > > a) provide a way for the user to specify that they will use smaller
> > > > bursts
> > > > and so that vector functions should not be used
> > > > b) have the vector functions transparently fallback to the scalar
> > ones
> > > > if
> > > > used with smaller bursts
> > > >
> > > > Of these, option b) is simpler, and should be low cost since any
> > check
> > > > is
> > > > just once per burst, and - assuming an app is written using the
> > same
> > > > request-size each time - should be entirely predictable after the
> > first
> > > > call.
> > > >
> > > Why does everyone assume that DPDK applications are so simple that
> > the branch predictor will cover the entire data path? I hear this
> > argument over and over again, and by principle I disagree with it!
> > >
> > 
> > Fair enough, that was an assumption on my part. Do you see in your apps
> > many cases where branches are getting mispredicted despite going the
> > same
> > way each time though the code?
> > 
> We haven't looked deeply into this, but I don't think so.
> 
> My objection is of a more general nature. As a library, DPDK cannot assume that applications using it are simple, and - based on that assumption - take away resources that could have been available for the application.
> 
> The Intel general optimization guidelines specifies that code should be arranged to be consistent with the static branch prediction algorithm: make the fall-through code following a conditional branch be the likely target for a branch with a forward target, and make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target.
> 
> It also says: Conditional branches that are never taken do not consume BTB resources.
> 
> Somehow this last detail is completely ignored by DPDK developers.
> 
> We put a lot of effort into conserving resources in most areas in DPDK, but when it comes to the branch prediction target buffer (BTB), we gladly organize code with branches turning the wrong way, thus unnecessarily consuming BTB entries. And the argument goes: The branch predictor will catch it after the first time.
>

Looks like something to investigate more. Thanks for bringing this up.
 
> > > How about c): add rte_eth_rx() and rte_eth_tx() functions for
> > receiving/transmitting a single packet. The ring library has such
> > functions.
> > >
> > > Optimized single-packet functions might even perform better than
> > calling the burst functions with nb_pkts=1. Great for latency focused
> > applications. :-)
> > >
> > That is another option, yes.
> > A further option is to add to the vector code a one-off switch to check
> > first
> > time it's called that the request size is not lower than the min
> > supported
> > (again basing on the assumption that one is not going to be varying the
> > burst size asked - which may not be true in call cases but won't leave
> > us
> > any worse off than we are now!).
> 
> I certainly don't support this option. But it was worth mentioning.
> 

Right. For now then, it seems like just documenting a minimum burst size is
reasonable.

/Bruce

Morten Brørup Aug. 28, 2020, 9:03 a.m. UTC | #7

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Thursday, August 27, 2020 1:41 PM
> 
> On Thu, Aug 27, 2020 at 12:13:51PM +0200, Morten Brørup wrote:
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> Richardson
> > > Sent: Thursday, August 27, 2020 11:44 AM
> > >
> > > On Thu, Aug 27, 2020 at 11:31:15AM +0200, Morten Brørup wrote:
> > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > Sent: Thursday, August 27, 2020 11:10 AM
> > > > >
> > > > > On Thu, Aug 27, 2020 at 10:40:11AM +0200, Morten Brørup wrote:
> > > > > > Jeff and Ethernet API maintainers Thomas, Ferruh and Andrew,
> > > > > >
> > > > > > I'm hijacking this patch thread to propose a small API
> > > modification
> > > > > that prevents unnecessarily performance degradations.
> > > > > >
> > > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jeff
> Guo
> > > > > > > Sent: Thursday, August 27, 2020 9:55 AM
> > > > > > >
> > > > > > > The limitation of burst size in vector rx was removed,
> since it
> > > > > should
> > > > > > > retrieve as much received packets as possible. And also the
> > > > > scattered
> > > > > > > receive path should use a wrapper function to achieve the
> goal
> > > of
> > > > > > > burst maximizing.
> > > > > > >
> > > > > > > This patch set aims to maximize vector rx burst for for
> > > > > > > ixgbe/i40e/ice/iavf PMDs.
> > > > > > >
> > > > > >
> > > > > > Now I'm going to be pedantic and say that it still doesn't
> > > conform to
> > > > > the rte_eth_rx_burst() API, because the API does not specify
> any
> > > > > minimum requirement for nb_pkts.
> > > > > >
> > > > > > In theory, that could also be fixed in the driver by calling
> the
> > > non-
> > > > > vector function from the vector functions if nb_pkts is too
> small
> > > for
> > > > > the vector implementation.
> > > > > >
> > > > > > However, I think that calling rte_eth_rx_burst() with a small
> > > nb_pkts
> > > > > is silly and not in the spirit of DPDK, and introducing an
> > > additional
> > > > > comparison for a small nb_pkts in the driver vector functions
> would
> > > > > degrade their performance (only slightly, but anyway).
> > > > > >
> > > > >
> > > > > Actually, I'd like to see a confirmed measurement showing a
> > > slowdown
> > > > > before
> > > > > we discard such an option. :-)
> > > >
> > > > Good point!
> > > >
> > > > > While I agree that using small bursts is
> > > > > not
> > > > > keeping with the design approach of DPDK of using large bursts
> to
> > > > > amortize
> > > > > costs and allow prefetching, there are cases where a user/app
> may
> > > want
> > > > > a
> > > > > small burst size, e.g. 4, for latency reasons, and we need a
> way to
> > > > > support
> > > > > that.
> > > > >
> > > > I assume that calling rte_eth_rx_burst() with nb_pkts=32 returns
> 4
> > > packets if only 4 packets are available, so you would need to be
> > > extremely latency sensitive to call it with a smaller nb_pkts. I
> guess
> > > that high frequency trading is the only real life scenario here.
> > > >
> > > Yes, it really boils down to whether you are prepared to accept
> lower
> > > max throughput or dropped packets in order to gain lower latency.
> > >
> > > > > Since the path selection is dynamic, we need to either:
> > > > > a) provide a way for the user to specify that they will use
> smaller
> > > > > bursts
> > > > > and so that vector functions should not be used

After thinking about it, and also inspired by Haiyue's comment in the other thread, this may be a good option. Configure the driver with the burst size that the application is going to use, so the driver can select the appropriate function. The driver could even select another vector function, optimized for the specific size.

It could be a field in the rte_eth_rxmode structure, which is part of the rte_eth_conf structure used in the rte_eth_dev_configure() function.

This would also make the nb_pkts in the rte_eth_rx_burst() obsolete. So we could add a function without the nb_pkts parameter.

Alternatively, we could introduce new variants of the rte_eth_rx_burst() function with fixed burst size, e.g. rte_eth_rx_burst32() to receive a burst of 32 packets, rte_eth_rx_burst64() for 64 packets, and rte_eth_rx_burst128(). Assuming that the performance difference beyond a certain vector size is insignificant, e.g. 128 packets, we don't need to add functions for larger vectors than that.

Or my suggestion here might just be a case of over-optimizing with insignificant performance benefits.

> > > > > b) have the vector functions transparently fallback to the
> scalar
> > > ones
> > > > > if
> > > > > used with smaller bursts
> > > > >
> > > > > Of these, option b) is simpler, and should be low cost since
> any
> > > check
> > > > > is
> > > > > just once per burst, and - assuming an app is written using the
> > > same
> > > > > request-size each time - should be entirely predictable after
> the
> > > first
> > > > > call.
> > > > >
> > > > Why does everyone assume that DPDK applications are so simple
> that
> > > the branch predictor will cover the entire data path? I hear this
> > > argument over and over again, and by principle I disagree with it!
> > > >
> > >
> > > Fair enough, that was an assumption on my part. Do you see in your
> apps
> > > many cases where branches are getting mispredicted despite going
> the
> > > same
> > > way each time though the code?
> > >
> > We haven't looked deeply into this, but I don't think so.
> >
> > My objection is of a more general nature. As a library, DPDK cannot
> assume that applications using it are simple, and - based on that
> assumption - take away resources that could have been available for the
> application.
> >
> > The Intel general optimization guidelines specifies that code should
> be arranged to be consistent with the static branch prediction
> algorithm: make the fall-through code following a conditional branch be
> the likely target for a branch with a forward target, and make the
> fall-through code following a conditional branch be the unlikely target
> for a branch with a backward target.
> >
> > It also says: Conditional branches that are never taken do not
> consume BTB resources.
> >
> > Somehow this last detail is completely ignored by DPDK developers.
> >
> > We put a lot of effort into conserving resources in most areas in
> DPDK, but when it comes to the branch prediction target buffer (BTB),
> we gladly organize code with branches turning the wrong way, thus
> unnecessarily consuming BTB entries. And the argument goes: The branch
> predictor will catch it after the first time.
> >
> 
> Looks like something to investigate more. Thanks for bringing this up.
> 
> > > > How about c): add rte_eth_rx() and rte_eth_tx() functions for
> > > receiving/transmitting a single packet. The ring library has such
> > > functions.
> > > >
> > > > Optimized single-packet functions might even perform better than
> > > calling the burst functions with nb_pkts=1. Great for latency
> focused
> > > applications. :-)
> > > >
> > > That is another option, yes.
> > > A further option is to add to the vector code a one-off switch to
> check
> > > first
> > > time it's called that the request size is not lower than the min
> > > supported
> > > (again basing on the assumption that one is not going to be varying
> the
> > > burst size asked - which may not be true in call cases but won't
> leave
> > > us
> > > any worse off than we are now!).
> >
> > I certainly don't support this option. But it was worth mentioning.
> >
> 
> Right. For now then, it seems like just documenting a minimum burst
> size is
> reasonable.

I agree. It is so far from the spirit of DPDK to call rte_eth_rx_burst() with a small nb_pkts that the driver developers didn't even consider it. The API documentation needs fixing, not the drivers.

It doesn't take care of your example 4 packet latency sensitive application, though. Which BTW also doesn’t work today on drivers with vector support. So it might not be a real world scenario anyway. :-)

> 
> /Bruce

Bruce Richardson Aug. 28, 2020, 10:07 a.m. UTC | #8

On Fri, Aug 28, 2020 at 11:03:59AM +0200, Morten Brørup wrote:
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> >
<snip>
> > 
> > Right. For now then, it seems like just documenting a minimum burst
> > size is
> > reasonable.
> 
> I agree. It is so far from the spirit of DPDK to call rte_eth_rx_burst() with a small nb_pkts that the driver developers didn't even consider it. The API documentation needs fixing, not the drivers.
> 
> It doesn't take care of your example 4 packet latency sensitive application, though. Which BTW also doesn’t work today on drivers with vector support. So it might not be a real world scenario anyway. :-)
> 
AFAIK, 8 is the smallest burst guaranteed to work everywhere, but I think
just about everything bar the AVX2 i40e code path also supports 4 as a
burst size. Therefore adjusting to 4 as min-burst might well be reasonable.

/Bruce

Morten Brørup Aug. 28, 2020, 10:50 a.m. UTC | #9

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Friday, August 28, 2020 12:07 PM
> 
> On Fri, Aug 28, 2020 at 11:03:59AM +0200, Morten Brørup wrote:
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> Richardson
> > >
> <snip>
> > >
> > > Right. For now then, it seems like just documenting a minimum burst
> > > size is
> > > reasonable.
> >
> > I agree. It is so far from the spirit of DPDK to call
> rte_eth_rx_burst() with a small nb_pkts that the driver developers
> didn't even consider it. The API documentation needs fixing, not the
> drivers.
> >
> > It doesn't take care of your example 4 packet latency sensitive
> application, though. Which BTW also doesn’t work today on drivers with
> vector support. So it might not be a real world scenario anyway. :-)
> >
> AFAIK, 8 is the smallest burst guaranteed to work everywhere, but I
> think
> just about everything bar the AVX2 i40e code path also supports 4 as a
> burst size. Therefore adjusting to 4 as min-burst might well be
> reasonable.
> 
> /Bruce

There must be a reason the i40e AVX2 driver chose to step up to 8 from the previous convention of 4.

Considering Intel's stance on the controversial vector instructions, a larger numbers seems more future proof. HPC benefits from the vector instructions, and DPDK seems to benefit from them too. Let's not prevent that.

Since I don't have insight into Intel's (or any other CPU vendors') plans for future vector instructions, I will assume that 8 suffices for the foreseeable future, and thus I am leaning towards 8 rather than 4.

-Morten

Morten Brørup Aug. 29, 2020, 10:15 a.m. UTC | #10

> From: Morten Brørup
> Sent: Friday, August 28, 2020 12:51 PM
> 
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Friday, August 28, 2020 12:07 PM
> >
> > On Fri, Aug 28, 2020 at 11:03:59AM +0200, Morten Brørup wrote:
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > Richardson
> > > >
> > <snip>
> > > >
> > > > Right. For now then, it seems like just documenting a minimum burst
> > > > size is
> > > > reasonable.
> > >
> > > I agree. It is so far from the spirit of DPDK to call
> > rte_eth_rx_burst() with a small nb_pkts that the driver developers
> > didn't even consider it. The API documentation needs fixing, not the
> > drivers.
> > >
> > > It doesn't take care of your example 4 packet latency sensitive
> > application, though. Which BTW also doesn’t work today on drivers with
> > vector support. So it might not be a real world scenario anyway. :-)
> > >
> > AFAIK, 8 is the smallest burst guaranteed to work everywhere, but I
> > think
> > just about everything bar the AVX2 i40e code path also supports 4 as a
> > burst size. Therefore adjusting to 4 as min-burst might well be
> > reasonable.
> >
> > /Bruce
> 
> There must be a reason the i40e AVX2 driver chose to step up to 8 from the
> previous convention of 4.
> 
> Considering Intel's stance on the controversial vector instructions, a
> larger numbers seems more future proof. HPC benefits from the vector
> instructions, and DPDK seems to benefit from them too. Let's not prevent
> that.
> 
> Since I don't have insight into Intel's (or any other CPU vendors') plans
> for future vector instructions, I will assume that 8 suffices for the
> foreseeable future, and thus I am leaning towards 8 rather than 4.
> 

nb_pkts must be >= 8 and divisible by 8.

Alternatively to being divisible by 8, would there be any benefit in requiring that it is a power-of-two?

-Morten