[v2,0/3] enable AVX512 for iavf

Message ID	1600306778-46470-1-git-send-email-wenzhuo.lu@intel.com (mailing list archive)
Headers	IronPort-SDR: 1pKKcfknxssu2wSY+vD7+I4hcMHi6T077aLhRm+2ufygV1kP5VtX9isTk4tUSmwaz890wGTCWD RkYxacvz/s4w== IronPort-SDR: Bl3tG2WwMGiEyxOpiRYRz2YOTM3HBAMUu4ELanrWLvoTCxrEBVJbGKNwg61Y60Aux49j/k9Tut BHqFYnPAU2KA== From: Wenzhuo Lu <wenzhuo.lu@intel.com> To: dev@dpdk.org Cc: Wenzhuo Lu <wenzhuo.lu@intel.com> Date: Thu, 17 Sep 2020 09:39:35 +0800 Message-Id: <1600306778-46470-1-git-send-email-wenzhuo.lu@intel.com> In-Reply-To: <1599717545-106571-1-git-send-email-wenzhuo.lu@intel.com> References: <1599717545-106571-1-git-send-email-wenzhuo.lu@intel.com> Subject: [dpdk-dev] [PATCH v2 0/3] enable AVX512 for iavf Precedence: list Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>
Series	enable AVX512 for iavf \| [v2,0/3] enable AVX512 for iavf [v2,1/3] net/iavf: enable AVX512 for legacy RX [v2,2/3] net/iavf: enable AVX512 for flexible RX [v2,3/3] net/iavf: enable AVX512 for TX

Message ID

1600306778-46470-1-git-send-email-wenzhuo.lu@intel.com (mailing list archive)

Headers

IronPort-SDR: 
 1pKKcfknxssu2wSY+vD7+I4hcMHi6T077aLhRm+2ufygV1kP5VtX9isTk4tUSmwaz890wGTCWD
 RkYxacvz/s4w==
IronPort-SDR: 
 Bl3tG2WwMGiEyxOpiRYRz2YOTM3HBAMUu4ELanrWLvoTCxrEBVJbGKNwg61Y60Aux49j/k9Tut
 BHqFYnPAU2KA==
From: Wenzhuo Lu <wenzhuo.lu@intel.com>
To: dev@dpdk.org
Cc: Wenzhuo Lu <wenzhuo.lu@intel.com>
Date: Thu, 17 Sep 2020 09:39:35 +0800
Message-Id: <1600306778-46470-1-git-send-email-wenzhuo.lu@intel.com>
In-Reply-To: <1599717545-106571-1-git-send-email-wenzhuo.lu@intel.com>
References: <1599717545-106571-1-git-send-email-wenzhuo.lu@intel.com>
Subject: [dpdk-dev] [PATCH v2 0/3] enable AVX512 for iavf
Precedence: list
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Series

enable AVX512 for iavf |

Message

Wenzhuo Lu Sept. 17, 2020, 1:39 a.m. UTC

  AVX512 instructions is supported by more and more platforms. These instructions
can be used in the data path to enhance the per-core performance of packet
processing.
Comparing with the existing implementation, this path set introduces some AVX512
instructions into the iavf data path, and we get a better per-code throughput.

v2:
Update meson.build.
Repalce the deprecated 'buf_physaddr' by 'buf_iova'.

Wenzhuo Lu (3):
  net/iavf: enable AVX512 for legacy RX
  net/iavf: enable AVX512 for flexible RX
  net/iavf: enable AVX512 for TX

 doc/guides/rel_notes/release_20_11.rst  |    3 +
 drivers/net/iavf/iavf_ethdev.c          |    3 +-
 drivers/net/iavf/iavf_rxtx.c            |   69 +-
 drivers/net/iavf/iavf_rxtx.h            |   18 +
 drivers/net/iavf/iavf_rxtx_vec_avx512.c | 1720 +++++++++++++++++++++++++++++++
 drivers/net/iavf/meson.build            |   17 +
 6 files changed, 1818 insertions(+), 12 deletions(-)
 create mode 100644 drivers/net/iavf/iavf_rxtx_vec_avx512.c

Comments

Morten Brørup Sept. 17, 2020, 7:37 a.m. UTC | #1

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wenzhuo Lu
> Sent: Thursday, September 17, 2020 3:40 AM
> 
> AVX512 instructions is supported by more and more platforms. These
> instructions
> can be used in the data path to enhance the per-core performance of
> packet
> processing.
> Comparing with the existing implementation, this path set introduces
> some AVX512
> instructions into the iavf data path, and we get a better per-code
> throughput.
> 
> v2:
> Update meson.build.
> Repalce the deprecated 'buf_physaddr' by 'buf_iova'.
> 
> Wenzhuo Lu (3):
>   net/iavf: enable AVX512 for legacy RX
>   net/iavf: enable AVX512 for flexible RX
>   net/iavf: enable AVX512 for TX
> 
>  doc/guides/rel_notes/release_20_11.rst  |    3 +
>  drivers/net/iavf/iavf_ethdev.c          |    3 +-
>  drivers/net/iavf/iavf_rxtx.c            |   69 +-
>  drivers/net/iavf/iavf_rxtx.h            |   18 +
>  drivers/net/iavf/iavf_rxtx_vec_avx512.c | 1720
> +++++++++++++++++++++++++++++++
>  drivers/net/iavf/meson.build            |   17 +
>  6 files changed, 1818 insertions(+), 12 deletions(-)
>  create mode 100644 drivers/net/iavf/iavf_rxtx_vec_avx512.c
> 
> --
> 1.9.3
> 

I am not sure I understand the full context here, so please bear with me if I'm completely off...

With this patch set, it looks like the driver manipulates the mempool cache directly, bypassing the libararies encapsulating it.

Isn't that going deeper into a library than expected... What if the implementation of the mempool library changes radically?

And if there are performance gains to be achieved by using vector instructions for manipulating the mempool, perhaps your vector optimizations should go into the mempool library instead?


Med venlig hilsen / kind regards
- Morten Brørup

Bruce Richardson Sept. 17, 2020, 9:13 a.m. UTC | #2

On Thu, Sep 17, 2020 at 09:37:29AM +0200, Morten Brørup wrote:
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wenzhuo Lu
> > Sent: Thursday, September 17, 2020 3:40 AM
> > 
> > AVX512 instructions is supported by more and more platforms. These
> > instructions
> > can be used in the data path to enhance the per-core performance of
> > packet
> > processing.
> > Comparing with the existing implementation, this path set introduces
> > some AVX512
> > instructions into the iavf data path, and we get a better per-code
> > throughput.
> > 
> > v2:
> > Update meson.build.
> > Repalce the deprecated 'buf_physaddr' by 'buf_iova'.
> > 
> > Wenzhuo Lu (3):
> >   net/iavf: enable AVX512 for legacy RX
> >   net/iavf: enable AVX512 for flexible RX
> >   net/iavf: enable AVX512 for TX
> > 
> >  doc/guides/rel_notes/release_20_11.rst  |    3 +
> >  drivers/net/iavf/iavf_ethdev.c          |    3 +-
> >  drivers/net/iavf/iavf_rxtx.c            |   69 +-
> >  drivers/net/iavf/iavf_rxtx.h            |   18 +
> >  drivers/net/iavf/iavf_rxtx_vec_avx512.c | 1720
> > +++++++++++++++++++++++++++++++
> >  drivers/net/iavf/meson.build            |   17 +
> >  6 files changed, 1818 insertions(+), 12 deletions(-)
> >  create mode 100644 drivers/net/iavf/iavf_rxtx_vec_avx512.c
> > 
> > --
> > 1.9.3
> > 
> 
> I am not sure I understand the full context here, so please bear with me if I'm completely off...
> 
> With this patch set, it looks like the driver manipulates the mempool cache directly, bypassing the libararies encapsulating it.
> 
> Isn't that going deeper into a library than expected... What if the implementation of the mempool library changes radically?
> 
> And if there are performance gains to be achieved by using vector instructions for manipulating the mempool, perhaps your vector optimizations should go into the mempool library instead?
> 

Looking specifically at the descriptor re-arm code, the benefit from
working off the mempool cache directly comes from saving loads by merging
the code blocks, rather than directly from the vectorization itself -
though the vectorization doesn't hurt. The original code having a separate
mempool function worked roughly like below:

1. mempool code loads mbuf pointers from cache
2. mempool code writes mbuf pointers to the SW ring for the NIC
3. driver code loads the mempool pointers from the SW ring
4. driver code then does the rest of the descriptor re-arm.

The benefit comes from eliminating step 3, the loads in the driver, which
are dependent upon the previous stores. By having the driver itself read
from the mempool cache (the code still uses mempool functions for every
other part, since everything beyond the cache depends on the
ring/stack/bucket implementation), we can have the stores go out, and while
they are completing reuse the already-loaded data to do the descriptor
rearm.

Hope this clarifies things.

/Bruce

Morten Brørup Sept. 17, 2020, 9:35 a.m. UTC | #3

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Thursday, September 17, 2020 11:13 AM
> 
> On Thu, Sep 17, 2020 at 09:37:29AM +0200, Morten Brørup wrote:
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wenzhuo Lu
> > > Sent: Thursday, September 17, 2020 3:40 AM
> > >
> > > AVX512 instructions is supported by more and more platforms. These
> > > instructions
> > > can be used in the data path to enhance the per-core performance of
> > > packet
> > > processing.
> > > Comparing with the existing implementation, this path set
> introduces
> > > some AVX512
> > > instructions into the iavf data path, and we get a better per-code
> > > throughput.
> > >
> > > v2:
> > > Update meson.build.
> > > Repalce the deprecated 'buf_physaddr' by 'buf_iova'.
> > >
> > > Wenzhuo Lu (3):
> > >   net/iavf: enable AVX512 for legacy RX
> > >   net/iavf: enable AVX512 for flexible RX
> > >   net/iavf: enable AVX512 for TX
> > >
> > >  doc/guides/rel_notes/release_20_11.rst  |    3 +
> > >  drivers/net/iavf/iavf_ethdev.c          |    3 +-
> > >  drivers/net/iavf/iavf_rxtx.c            |   69 +-
> > >  drivers/net/iavf/iavf_rxtx.h            |   18 +
> > >  drivers/net/iavf/iavf_rxtx_vec_avx512.c | 1720
> > > +++++++++++++++++++++++++++++++
> > >  drivers/net/iavf/meson.build            |   17 +
> > >  6 files changed, 1818 insertions(+), 12 deletions(-)
> > >  create mode 100644 drivers/net/iavf/iavf_rxtx_vec_avx512.c
> > >
> > > --
> > > 1.9.3
> > >
> >
> > I am not sure I understand the full context here, so please bear with
> me if I'm completely off...
> >
> > With this patch set, it looks like the driver manipulates the mempool
> cache directly, bypassing the libararies encapsulating it.
> >
> > Isn't that going deeper into a library than expected... What if the
> implementation of the mempool library changes radically?
> >
> > And if there are performance gains to be achieved by using vector
> instructions for manipulating the mempool, perhaps your vector
> optimizations should go into the mempool library instead?
> >
> 
> Looking specifically at the descriptor re-arm code, the benefit from
> working off the mempool cache directly comes from saving loads by
> merging
> the code blocks, rather than directly from the vectorization itself -
> though the vectorization doesn't hurt. The original code having a
> separate
> mempool function worked roughly like below:
> 
> 1. mempool code loads mbuf pointers from cache
> 2. mempool code writes mbuf pointers to the SW ring for the NIC
> 3. driver code loads the mempool pointers from the SW ring
> 4. driver code then does the rest of the descriptor re-arm.
> 
> The benefit comes from eliminating step 3, the loads in the driver,
> which
> are dependent upon the previous stores. By having the driver itself
> read
> from the mempool cache (the code still uses mempool functions for every
> other part, since everything beyond the cache depends on the
> ring/stack/bucket implementation), we can have the stores go out, and
> while
> they are completing reuse the already-loaded data to do the descriptor
> rearm.
> 
> Hope this clarifies things.
> 
> /Bruce
> 

Thank you for the detailed explanation, Bruce.

It makes sense to me now. So,

Acked-By: Morten Brørup <mb@smartsharesystems.com>


Med venlig hilsen / kind regards
- Morten Brørup