[dpdk-dev,10/17] librte_acl: add AVX2 as new rte_acl_classify() method

Message ID 1418580659-12595-11-git-send-email-konstantin.ananyev@intel.com (mailing list archive)
State Superseded, archived
Headers

Commit Message

Ananyev, Konstantin Dec. 14, 2014, 6:10 p.m. UTC
  Introduce new classify() method that uses AVX2 instructions.
From my measurements:
On HSW boards when processing >= 16 packets per call,
AVX2 method outperforms it's SSE counterpart by 10-25%,
(depending on the ruleset).
At runtime, this method is selected as default one on HW that supports AVX2.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/Makefile       |   9 +
 lib/librte_acl/acl.h          |   4 +
 lib/librte_acl/acl_run.h      |   2 +-
 lib/librte_acl/acl_run_avx2.c |  58 +++++
 lib/librte_acl/acl_run_avx2.h | 305 ++++++++++++++++++++++++
 lib/librte_acl/acl_run_sse.c  | 537 +-----------------------------------------
 lib/librte_acl/acl_run_sse.h  | 533 +++++++++++++++++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c      |   5 +-
 lib/librte_acl/rte_acl.h      |   2 +
 9 files changed, 917 insertions(+), 538 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx2.c
 create mode 100644 lib/librte_acl/acl_run_avx2.h
 create mode 100644 lib/librte_acl/acl_run_sse.h
  

Comments

Neil Horman Dec. 15, 2014, 4 p.m. UTC | #1
On Sun, Dec 14, 2014 at 06:10:52PM +0000, Konstantin Ananyev wrote:
> Introduce new classify() method that uses AVX2 instructions.
> From my measurements:
> On HSW boards when processing >= 16 packets per call,
> AVX2 method outperforms it's SSE counterpart by 10-25%,
> (depending on the ruleset).
> At runtime, this method is selected as default one on HW that supports AVX2.
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  lib/librte_acl/Makefile       |   9 +
>  lib/librte_acl/acl.h          |   4 +
>  lib/librte_acl/acl_run.h      |   2 +-
>  lib/librte_acl/acl_run_avx2.c |  58 +++++
>  lib/librte_acl/acl_run_avx2.h | 305 ++++++++++++++++++++++++
>  lib/librte_acl/acl_run_sse.c  | 537 +-----------------------------------------
>  lib/librte_acl/acl_run_sse.h  | 533 +++++++++++++++++++++++++++++++++++++++++
>  lib/librte_acl/rte_acl.c      |   5 +-
>  lib/librte_acl/rte_acl.h      |   2 +
>  9 files changed, 917 insertions(+), 538 deletions(-)
>  create mode 100644 lib/librte_acl/acl_run_avx2.c
>  create mode 100644 lib/librte_acl/acl_run_avx2.h
>  create mode 100644 lib/librte_acl/acl_run_sse.h
> 
> diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> index 65e566d..223ec31 100644
> --- a/lib/librte_acl/Makefile
> +++ b/lib/librte_acl/Makefile
> @@ -45,8 +45,17 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx2.c
>  
>  CFLAGS_acl_run_sse.o += -msse4.1
> +ifeq ($(CC), icc)
> +CFLAGS_acl_run_avx2.o += -march=core-avx2
> +else ifneq ($(shell \
> +test $(GCC_MAJOR_VERSION) -le 4 -a $(GCC_MINOR_VERSION) -le 6 && echo 1), 1)
> +CFLAGS_acl_run_avx2.o += -mavx2
> +else
> +CFLAGS_acl_run_avx2.o += -msse4.1
> +endif
>  
This seems broken.  You've unilaterally included acl_run_avx2.c in the build
list above, but only enable -mavx2 if the compiler is at least gcc 4.6.  Unless
you want to make gcc 4.6 a requirement for building, you need to also exclude
the file above from the build list.  That in turn I think allows you to remove a
bunch of the ifdeffing that you've done in some of the avx2 specific files.

Neil
  
Ananyev, Konstantin Dec. 15, 2014, 4:33 p.m. UTC | #2
Hi Neil,

> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Monday, December 15, 2014 4:00 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> 
> On Sun, Dec 14, 2014 at 06:10:52PM +0000, Konstantin Ananyev wrote:
> > Introduce new classify() method that uses AVX2 instructions.
> > From my measurements:
> > On HSW boards when processing >= 16 packets per call,
> > AVX2 method outperforms it's SSE counterpart by 10-25%,
> > (depending on the ruleset).
> > At runtime, this method is selected as default one on HW that supports AVX2.
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > ---
> >  lib/librte_acl/Makefile       |   9 +
> >  lib/librte_acl/acl.h          |   4 +
> >  lib/librte_acl/acl_run.h      |   2 +-
> >  lib/librte_acl/acl_run_avx2.c |  58 +++++
> >  lib/librte_acl/acl_run_avx2.h | 305 ++++++++++++++++++++++++
> >  lib/librte_acl/acl_run_sse.c  | 537 +-----------------------------------------
> >  lib/librte_acl/acl_run_sse.h  | 533 +++++++++++++++++++++++++++++++++++++++++
> >  lib/librte_acl/rte_acl.c      |   5 +-
> >  lib/librte_acl/rte_acl.h      |   2 +
> >  9 files changed, 917 insertions(+), 538 deletions(-)
> >  create mode 100644 lib/librte_acl/acl_run_avx2.c
> >  create mode 100644 lib/librte_acl/acl_run_avx2.h
> >  create mode 100644 lib/librte_acl/acl_run_sse.h
> >
> > diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> > index 65e566d..223ec31 100644
> > --- a/lib/librte_acl/Makefile
> > +++ b/lib/librte_acl/Makefile
> > @@ -45,8 +45,17 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> > +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx2.c
> >
> >  CFLAGS_acl_run_sse.o += -msse4.1
> > +ifeq ($(CC), icc)
> > +CFLAGS_acl_run_avx2.o += -march=core-avx2
> > +else ifneq ($(shell \
> > +test $(GCC_MAJOR_VERSION) -le 4 -a $(GCC_MINOR_VERSION) -le 6 && echo 1), 1)
> > +CFLAGS_acl_run_avx2.o += -mavx2
> > +else
> > +CFLAGS_acl_run_avx2.o += -msse4.1
> > +endif
> >
> This seems broken.  You've unilaterally included acl_run_avx2.c in the build
> list above, but only enable -mavx2 if the compiler is at least gcc 4.6.

Actually 4.7 (before that version, as I know,  gcc doesn't support avx2) 

>  Unless
> you want to make gcc 4.6 a requirement for building,

I believe DPDK is required to be buildable by gcc 4.6
As I remember, we have to support it all way down to gcc 4.3.

> you need to also exclude
> the file above from the build list.

That means that for  gcc 4.6 and below rte_acl_classify_avx2() would not be defined.
And then at runtime, I have to check for that somehow and (re)populate classify_fns[]. 
Doesn't seems like a good way to me.
Instead, I prefer to always build acl_run_avx2.c,
but for old compilers that don't support AVX2 -
rte_acl_classify_avx2() would simply be identical to rte_acl_classify_sse(). 

>  That in turn I think allows you to remove a
> bunch of the ifdeffing that you've done in some of the avx2 specific files.

Actually there are not many of them.
One in acl_run_avx2.h and another in acl_run_avx2.c.

Konstantin

> 
> Neil
  
Neil Horman Dec. 15, 2014, 8:20 p.m. UTC | #3
On Mon, Dec 15, 2014 at 04:33:47PM +0000, Ananyev, Konstantin wrote:
> Hi Neil,
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Monday, December 15, 2014 4:00 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > 
> > On Sun, Dec 14, 2014 at 06:10:52PM +0000, Konstantin Ananyev wrote:
> > > Introduce new classify() method that uses AVX2 instructions.
> > > From my measurements:
> > > On HSW boards when processing >= 16 packets per call,
> > > AVX2 method outperforms it's SSE counterpart by 10-25%,
> > > (depending on the ruleset).
> > > At runtime, this method is selected as default one on HW that supports AVX2.
> > >
> > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > ---
> > >  lib/librte_acl/Makefile       |   9 +
> > >  lib/librte_acl/acl.h          |   4 +
> > >  lib/librte_acl/acl_run.h      |   2 +-
> > >  lib/librte_acl/acl_run_avx2.c |  58 +++++
> > >  lib/librte_acl/acl_run_avx2.h | 305 ++++++++++++++++++++++++
> > >  lib/librte_acl/acl_run_sse.c  | 537 +-----------------------------------------
> > >  lib/librte_acl/acl_run_sse.h  | 533 +++++++++++++++++++++++++++++++++++++++++
> > >  lib/librte_acl/rte_acl.c      |   5 +-
> > >  lib/librte_acl/rte_acl.h      |   2 +
> > >  9 files changed, 917 insertions(+), 538 deletions(-)
> > >  create mode 100644 lib/librte_acl/acl_run_avx2.c
> > >  create mode 100644 lib/librte_acl/acl_run_avx2.h
> > >  create mode 100644 lib/librte_acl/acl_run_sse.h
> > >
> > > diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> > > index 65e566d..223ec31 100644
> > > --- a/lib/librte_acl/Makefile
> > > +++ b/lib/librte_acl/Makefile
> > > @@ -45,8 +45,17 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
> > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
> > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
> > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> > > +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx2.c
> > >
> > >  CFLAGS_acl_run_sse.o += -msse4.1
> > > +ifeq ($(CC), icc)
> > > +CFLAGS_acl_run_avx2.o += -march=core-avx2
> > > +else ifneq ($(shell \
> > > +test $(GCC_MAJOR_VERSION) -le 4 -a $(GCC_MINOR_VERSION) -le 6 && echo 1), 1)
> > > +CFLAGS_acl_run_avx2.o += -mavx2
> > > +else
> > > +CFLAGS_acl_run_avx2.o += -msse4.1
> > > +endif
> > >
> > This seems broken.  You've unilaterally included acl_run_avx2.c in the build
> > list above, but only enable -mavx2 if the compiler is at least gcc 4.6.
> 
> Actually 4.7 (before that version, as I know,  gcc doesn't support avx2) 
> 
> >  Unless
> > you want to make gcc 4.6 a requirement for building,
> 
> I believe DPDK is required to be buildable by gcc 4.6
> As I remember, we have to support it all way down to gcc 4.3.
> 
> > you need to also exclude
> > the file above from the build list.
> 
> That means that for  gcc 4.6 and below rte_acl_classify_avx2() would not be defined.
> And then at runtime, I have to check for that somehow and (re)populate classify_fns[]. 
> Doesn't seems like a good way to me.
There are plenty of ways around that.

At a minimum you could make the classify_fns array the one place that you need
to add an ifdef __AVX__ call.

You could also create a secondary definition of rte_acl_classify_avx2, and mark
it as a weak symbol, which only returns -EOPNOTSUPP.  That would be good, since
the right thing will just automatically happen then if you don't build the
actual avx2 classification code

> Instead, I prefer to always build acl_run_avx2.c,
But you can't do that.  You just said above that you need to support down to gcc
4.3.  I see you've worked around that with some additional ifdef __AVX__
instructions, but in so doing you ignore the possibiity that sse isn't
supported, so you need to add __SSE__ checks now as well.  ifdeffing that much
just isn't scalable.  And for your effort, you get an AVX2 classification path
that potentially doesn't actually do vectorized classification.

It really seems better to me to not build the code if the compiler doesn't
support the instruction set it was meant to enable, and change the
classification function pointer to something that informs the user of the lack
of support at run time.

> but for old compilers that don't support AVX2 -
> rte_acl_classify_avx2() would simply be identical to rte_acl_classify_sse(). 
> 
That doesn't make sense to me, for two reasons:

1) What if the machine being targeted doesn't support sse either?

2) If an application selects an AVX2 classifier, I as a developer expect to
either get AVX2 based classification, or an error indicating that I can't do
AVX2 classification, not a silent performance degradation down to scalar
classifcation.

> >  That in turn I think allows you to remove a
> > bunch of the ifdeffing that you've done in some of the avx2 specific files.
> 
> Actually there are not many of them.
> One in acl_run_avx2.h and another in acl_run_avx2.c.
> 
2 in acl_run_avx2.h and 1 in rte_acl_osdep_alone.h, which is really 3 more than
you need if you just do an intellegent weak classifier function defintion.

Neil
  
Ananyev, Konstantin Dec. 16, 2014, 4:16 p.m. UTC | #4
> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Monday, December 15, 2014 8:21 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> 
> On Mon, Dec 15, 2014 at 04:33:47PM +0000, Ananyev, Konstantin wrote:
> > Hi Neil,
> >
> > > -----Original Message-----
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Monday, December 15, 2014 4:00 PM
> > > To: Ananyev, Konstantin
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > >
> > > On Sun, Dec 14, 2014 at 06:10:52PM +0000, Konstantin Ananyev wrote:
> > > > Introduce new classify() method that uses AVX2 instructions.
> > > > From my measurements:
> > > > On HSW boards when processing >= 16 packets per call,
> > > > AVX2 method outperforms it's SSE counterpart by 10-25%,
> > > > (depending on the ruleset).
> > > > At runtime, this method is selected as default one on HW that supports AVX2.
> > > >
> > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > > ---
> > > >  lib/librte_acl/Makefile       |   9 +
> > > >  lib/librte_acl/acl.h          |   4 +
> > > >  lib/librte_acl/acl_run.h      |   2 +-
> > > >  lib/librte_acl/acl_run_avx2.c |  58 +++++
> > > >  lib/librte_acl/acl_run_avx2.h | 305 ++++++++++++++++++++++++
> > > >  lib/librte_acl/acl_run_sse.c  | 537 +-----------------------------------------
> > > >  lib/librte_acl/acl_run_sse.h  | 533 +++++++++++++++++++++++++++++++++++++++++
> > > >  lib/librte_acl/rte_acl.c      |   5 +-
> > > >  lib/librte_acl/rte_acl.h      |   2 +
> > > >  9 files changed, 917 insertions(+), 538 deletions(-)
> > > >  create mode 100644 lib/librte_acl/acl_run_avx2.c
> > > >  create mode 100644 lib/librte_acl/acl_run_avx2.h
> > > >  create mode 100644 lib/librte_acl/acl_run_sse.h
> > > >
> > > > diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> > > > index 65e566d..223ec31 100644
> > > > --- a/lib/librte_acl/Makefile
> > > > +++ b/lib/librte_acl/Makefile
> > > > @@ -45,8 +45,17 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
> > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
> > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
> > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> > > > +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx2.c
> > > >
> > > >  CFLAGS_acl_run_sse.o += -msse4.1
> > > > +ifeq ($(CC), icc)
> > > > +CFLAGS_acl_run_avx2.o += -march=core-avx2
> > > > +else ifneq ($(shell \
> > > > +test $(GCC_MAJOR_VERSION) -le 4 -a $(GCC_MINOR_VERSION) -le 6 && echo 1), 1)
> > > > +CFLAGS_acl_run_avx2.o += -mavx2
> > > > +else
> > > > +CFLAGS_acl_run_avx2.o += -msse4.1
> > > > +endif
> > > >
> > > This seems broken.  You've unilaterally included acl_run_avx2.c in the build
> > > list above, but only enable -mavx2 if the compiler is at least gcc 4.6.
> >
> > Actually 4.7 (before that version, as I know,  gcc doesn't support avx2)
> >
> > >  Unless
> > > you want to make gcc 4.6 a requirement for building,
> >
> > I believe DPDK is required to be buildable by gcc 4.6
> > As I remember, we have to support it all way down to gcc 4.3.
> >
> > > you need to also exclude
> > > the file above from the build list.
> >
> > That means that for  gcc 4.6 and below rte_acl_classify_avx2() would not be defined.
> > And then at runtime, I have to check for that somehow and (re)populate classify_fns[].
> > Doesn't seems like a good way to me.
> There are plenty of ways around that.
> 
> At a minimum you could make the classify_fns array the one place that you need
> to add an ifdef __AVX__ call.
> 
> You could also create a secondary definition of rte_acl_classify_avx2, and mark
> it as a weak symbol, which only returns -EOPNOTSUPP.  That would be good, since
> the right thing will just automatically happen then if you don't build the
> actual avx2 classification code
> 
> > Instead, I prefer to always build acl_run_avx2.c,


> But you can't do that.  You just said above that you need to support down to gcc
> 4.3.  I see you've worked around that with some additional ifdef __AVX__
> instructions, but in so doing you ignore the possibiity that sse isn't
> supported, so you need to add __SSE__ checks now as well.  ifdeffing that much
> just isn't scalable.

We don't need to worry about compiler without SSE4.1 support.
I believe that all compilers that DDPDK has to build with, do support SSE4.1.
So for SSE4.1 we only has to worry about situation when target CPU doesn't support it
We manage it by runtime selection.
For AVX2 - situation is a bit different: it could be both compiler and target CPU that don't support it.

>  And for your effort, you get an AVX2 classification path
> that potentially doesn't actually do vectorized classification.
> 
> It really seems better to me to not build the code if the compiler doesn't
> support the instruction set it was meant to enable, and change the
> classification function pointer to something that informs the user of the lack
> of support at run time.
> 
> > but for old compilers that don't support AVX2 -
> > rte_acl_classify_avx2() would simply be identical to rte_acl_classify_sse().
> >
> That doesn't make sense to me, for two reasons:
> 
> 1) What if the machine being targeted doesn't support sse either?
> 

Exactly the same what is happening now on the machine with now SSE4.1 support.
There is absolutely no difference here.

> 2) If an application selects an AVX2 classifier, I as a developer expect to
> either get AVX2 based classification, or an error indicating that I can't do
> AVX2 classification, not a silent performance degradation down to scalar
> classification.

In fact I was considering both variants for compilers not supporting AVX2:
1. silently degrade to SSE method.
2. create  a dummy function rte_acl_classify_error() and put it  into classify_fns[RTE_ACL_CLASSIFY_AVX2].

I choose #1 because it seems like a less distraction for the user -
all would keep working as before, user just wouldn't see any improvement comparing to SSE method. 
Again didn't want to spread "ifdef __AVX2__" into rte_acl.c
Though I don't have any strong opinion here.
So if you can provide some good reason why #2 is preferable, I am ok to switch to #2. 

> 
> > >  That in turn I think allows you to remove a
> > > bunch of the ifdeffing that you've done in some of the avx2 specific files.
> >
> > Actually there are not many of them.
> > One in acl_run_avx2.h and another in acl_run_avx2.c.
> >
> 2 in acl_run_avx2.h and 1 in rte_acl_osdep_alone.h, which is really 3 more than
> you need if you just do an intellegent weak classifier function defintion.

grep -n __AVX2__ lib/librte_acl/*.[c,h] | grep -v endif
lib/librte_acl/acl_run_avx2.c:45:#ifdef __AVX2__
lib/librte_acl/acl_run_avx2.h:36:#ifdef __AVX2__

rte_acl_osdep_alone.h - is a different story.
It needs to be there anyway, as in rte_common_vect.h.
In fact  rte_acl_osdep_alone.h is only needed for cases when RTE_LIBRTE_ACL_STANDALONE=y.
That comes from the old days, when we had to to support building librte_acl library without the rest of DPDK.  
I think we don't need it anymore and plan to remove it.
Just thought it should  be in a separate patch. 
Konstantin

> 
> Neil
  
Ananyev, Konstantin Dec. 17, 2014, 12:38 a.m. UTC | #5
> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Tuesday, December 16, 2014 4:17 PM
> To: 'Neil Horman'
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> 
> 
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Monday, December 15, 2014 8:21 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> >
> > On Mon, Dec 15, 2014 at 04:33:47PM +0000, Ananyev, Konstantin wrote:
> > > Hi Neil,
> > >
> > > > -----Original Message-----
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Monday, December 15, 2014 4:00 PM
> > > > To: Ananyev, Konstantin
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > > >
> > > > On Sun, Dec 14, 2014 at 06:10:52PM +0000, Konstantin Ananyev wrote:
> > > > > Introduce new classify() method that uses AVX2 instructions.
> > > > > From my measurements:
> > > > > On HSW boards when processing >= 16 packets per call,
> > > > > AVX2 method outperforms it's SSE counterpart by 10-25%,
> > > > > (depending on the ruleset).
> > > > > At runtime, this method is selected as default one on HW that supports AVX2.
> > > > >
> > > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > > > ---
> > > > >  lib/librte_acl/Makefile       |   9 +
> > > > >  lib/librte_acl/acl.h          |   4 +
> > > > >  lib/librte_acl/acl_run.h      |   2 +-
> > > > >  lib/librte_acl/acl_run_avx2.c |  58 +++++
> > > > >  lib/librte_acl/acl_run_avx2.h | 305 ++++++++++++++++++++++++
> > > > >  lib/librte_acl/acl_run_sse.c  | 537 +-----------------------------------------
> > > > >  lib/librte_acl/acl_run_sse.h  | 533 +++++++++++++++++++++++++++++++++++++++++
> > > > >  lib/librte_acl/rte_acl.c      |   5 +-
> > > > >  lib/librte_acl/rte_acl.h      |   2 +
> > > > >  9 files changed, 917 insertions(+), 538 deletions(-)
> > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.c
> > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.h
> > > > >  create mode 100644 lib/librte_acl/acl_run_sse.h
> > > > >
> > > > > diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> > > > > index 65e566d..223ec31 100644
> > > > > --- a/lib/librte_acl/Makefile
> > > > > +++ b/lib/librte_acl/Makefile
> > > > > @@ -45,8 +45,17 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
> > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
> > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
> > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> > > > > +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx2.c
> > > > >
> > > > >  CFLAGS_acl_run_sse.o += -msse4.1
> > > > > +ifeq ($(CC), icc)
> > > > > +CFLAGS_acl_run_avx2.o += -march=core-avx2
> > > > > +else ifneq ($(shell \
> > > > > +test $(GCC_MAJOR_VERSION) -le 4 -a $(GCC_MINOR_VERSION) -le 6 && echo 1), 1)
> > > > > +CFLAGS_acl_run_avx2.o += -mavx2
> > > > > +else
> > > > > +CFLAGS_acl_run_avx2.o += -msse4.1
> > > > > +endif
> > > > >
> > > > This seems broken.  You've unilaterally included acl_run_avx2.c in the build
> > > > list above, but only enable -mavx2 if the compiler is at least gcc 4.6.
> > >
> > > Actually 4.7 (before that version, as I know,  gcc doesn't support avx2)
> > >
> > > >  Unless
> > > > you want to make gcc 4.6 a requirement for building,
> > >
> > > I believe DPDK is required to be buildable by gcc 4.6
> > > As I remember, we have to support it all way down to gcc 4.3.
> > >
> > > > you need to also exclude
> > > > the file above from the build list.
> > >
> > > That means that for  gcc 4.6 and below rte_acl_classify_avx2() would not be defined.
> > > And then at runtime, I have to check for that somehow and (re)populate classify_fns[].
> > > Doesn't seems like a good way to me.
> > There are plenty of ways around that.
> >
> > At a minimum you could make the classify_fns array the one place that you need
> > to add an ifdef __AVX__ call.
> >
> > You could also create a secondary definition of rte_acl_classify_avx2, and mark
> > it as a weak symbol, which only returns -EOPNOTSUPP.  That would be good, since
> > the right thing will just automatically happen then if you don't build the
> > actual avx2 classification code
> >
> > > Instead, I prefer to always build acl_run_avx2.c,
> 
> 
> > But you can't do that.  You just said above that you need to support down to gcc
> > 4.3.  I see you've worked around that with some additional ifdef __AVX__
> > instructions, but in so doing you ignore the possibiity that sse isn't
> > supported, so you need to add __SSE__ checks now as well.  ifdeffing that much
> > just isn't scalable.
> 
> We don't need to worry about compiler without SSE4.1 support.
> I believe that all compilers that DDPDK has to build with, do support SSE4.1.
> So for SSE4.1 we only has to worry about situation when target CPU doesn't support it
> We manage it by runtime selection.
> For AVX2 - situation is a bit different: it could be both compiler and target CPU that don't support it.
> 
> >  And for your effort, you get an AVX2 classification path
> > that potentially doesn't actually do vectorized classification.
> >
> > It really seems better to me to not build the code if the compiler doesn't
> > support the instruction set it was meant to enable, and change the
> > classification function pointer to something that informs the user of the lack
> > of support at run time.
> >
> > > but for old compilers that don't support AVX2 -
> > > rte_acl_classify_avx2() would simply be identical to rte_acl_classify_sse().
> > >
> > That doesn't make sense to me, for two reasons:
> >
> > 1) What if the machine being targeted doesn't support sse either?
> >
> 
> Exactly the same what is happening now on the machine with now SSE4.1 support.
> There is absolutely no difference here.
> 
> > 2) If an application selects an AVX2 classifier, I as a developer expect to
> > either get AVX2 based classification, or an error indicating that I can't do
> > AVX2 classification, not a silent performance degradation down to scalar
> > classification.
> 
> In fact I was considering both variants for compilers not supporting AVX2:
> 1. silently degrade to SSE method.
> 2. create  a dummy function rte_acl_classify_error() and put it  into classify_fns[RTE_ACL_CLASSIFY_AVX2].
> 
> I choose #1 because it seems like a less distraction for the user -
> all would keep working as before, user just wouldn't see any improvement comparing to SSE method.
> Again didn't want to spread "ifdef __AVX2__" into rte_acl.c

One correction: we can't rely on __AVX2__ define in rte_acl.c, as it always would be off for rte_acl.c on 'default' target.
We'll have to check against GNUC_MINOR here.

> Though I don't have any strong opinion here.
> So if you can provide some good reason why #2 is preferable, I am ok to switch to #2.
> 
> >
> > > >  That in turn I think allows you to remove a
> > > > bunch of the ifdeffing that you've done in some of the avx2 specific files.
> > >
> > > Actually there are not many of them.
> > > One in acl_run_avx2.h and another in acl_run_avx2.c.
> > >
> > 2 in acl_run_avx2.h and 1 in rte_acl_osdep_alone.h, which is really 3 more than
> > you need if you just do an intellegent weak classifier function defintion.
> 
> grep -n __AVX2__ lib/librte_acl/*.[c,h] | grep -v endif
> lib/librte_acl/acl_run_avx2.c:45:#ifdef __AVX2__
> lib/librte_acl/acl_run_avx2.h:36:#ifdef __AVX2__
> 
> rte_acl_osdep_alone.h - is a different story.
> It needs to be there anyway, as in rte_common_vect.h.
> In fact  rte_acl_osdep_alone.h is only needed for cases when RTE_LIBRTE_ACL_STANDALONE=y.
> That comes from the old days, when we had to to support building librte_acl library without the rest of DPDK.
> I think we don't need it anymore and plan to remove it.
> Just thought it should  be in a separate patch.
> Konstantin
> 
> >
> > Neil
  
Neil Horman Dec. 17, 2014, 3:32 p.m. UTC | #6
On Tue, Dec 16, 2014 at 04:16:48PM +0000, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Monday, December 15, 2014 8:21 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > 
> > On Mon, Dec 15, 2014 at 04:33:47PM +0000, Ananyev, Konstantin wrote:
> > > Hi Neil,
> > >
> > > > -----Original Message-----
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Monday, December 15, 2014 4:00 PM
> > > > To: Ananyev, Konstantin
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > > >
> > > > On Sun, Dec 14, 2014 at 06:10:52PM +0000, Konstantin Ananyev wrote:
> > > > > Introduce new classify() method that uses AVX2 instructions.
> > > > > From my measurements:
> > > > > On HSW boards when processing >= 16 packets per call,
> > > > > AVX2 method outperforms it's SSE counterpart by 10-25%,
> > > > > (depending on the ruleset).
> > > > > At runtime, this method is selected as default one on HW that supports AVX2.
> > > > >
> > > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > > > ---
> > > > >  lib/librte_acl/Makefile       |   9 +
> > > > >  lib/librte_acl/acl.h          |   4 +
> > > > >  lib/librte_acl/acl_run.h      |   2 +-
> > > > >  lib/librte_acl/acl_run_avx2.c |  58 +++++
> > > > >  lib/librte_acl/acl_run_avx2.h | 305 ++++++++++++++++++++++++
> > > > >  lib/librte_acl/acl_run_sse.c  | 537 +-----------------------------------------
> > > > >  lib/librte_acl/acl_run_sse.h  | 533 +++++++++++++++++++++++++++++++++++++++++
> > > > >  lib/librte_acl/rte_acl.c      |   5 +-
> > > > >  lib/librte_acl/rte_acl.h      |   2 +
> > > > >  9 files changed, 917 insertions(+), 538 deletions(-)
> > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.c
> > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.h
> > > > >  create mode 100644 lib/librte_acl/acl_run_sse.h
> > > > >
> > > > > diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> > > > > index 65e566d..223ec31 100644
> > > > > --- a/lib/librte_acl/Makefile
> > > > > +++ b/lib/librte_acl/Makefile
> > > > > @@ -45,8 +45,17 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
> > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
> > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
> > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> > > > > +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx2.c
> > > > >
> > > > >  CFLAGS_acl_run_sse.o += -msse4.1
> > > > > +ifeq ($(CC), icc)
> > > > > +CFLAGS_acl_run_avx2.o += -march=core-avx2
> > > > > +else ifneq ($(shell \
> > > > > +test $(GCC_MAJOR_VERSION) -le 4 -a $(GCC_MINOR_VERSION) -le 6 && echo 1), 1)
> > > > > +CFLAGS_acl_run_avx2.o += -mavx2
> > > > > +else
> > > > > +CFLAGS_acl_run_avx2.o += -msse4.1
> > > > > +endif
> > > > >
> > > > This seems broken.  You've unilaterally included acl_run_avx2.c in the build
> > > > list above, but only enable -mavx2 if the compiler is at least gcc 4.6.
> > >
> > > Actually 4.7 (before that version, as I know,  gcc doesn't support avx2)
> > >
> > > >  Unless
> > > > you want to make gcc 4.6 a requirement for building,
> > >
> > > I believe DPDK is required to be buildable by gcc 4.6
> > > As I remember, we have to support it all way down to gcc 4.3.
> > >
> > > > you need to also exclude
> > > > the file above from the build list.
> > >
> > > That means that for  gcc 4.6 and below rte_acl_classify_avx2() would not be defined.
> > > And then at runtime, I have to check for that somehow and (re)populate classify_fns[].
> > > Doesn't seems like a good way to me.
> > There are plenty of ways around that.
> > 
> > At a minimum you could make the classify_fns array the one place that you need
> > to add an ifdef __AVX__ call.
> > 
> > You could also create a secondary definition of rte_acl_classify_avx2, and mark
> > it as a weak symbol, which only returns -EOPNOTSUPP.  That would be good, since
> > the right thing will just automatically happen then if you don't build the
> > actual avx2 classification code
> > 
> > > Instead, I prefer to always build acl_run_avx2.c,
> 
> 
> > But you can't do that.  You just said above that you need to support down to gcc
> > 4.3.  I see you've worked around that with some additional ifdef __AVX__
> > instructions, but in so doing you ignore the possibiity that sse isn't
> > supported, so you need to add __SSE__ checks now as well.  ifdeffing that much
> > just isn't scalable.
> 
> We don't need to worry about compiler without SSE4.1 support.
> I believe that all compilers that DDPDK has to build with, do support SSE4.1.
> So for SSE4.1 we only has to worry about situation when target CPU doesn't support it
> We manage it by runtime selection.
> For AVX2 - situation is a bit different: it could be both compiler and target CPU that don't support it.
> 
> >  And for your effort, you get an AVX2 classification path
> > that potentially doesn't actually do vectorized classification.
> > 
> > It really seems better to me to not build the code if the compiler doesn't
> > support the instruction set it was meant to enable, and change the
> > classification function pointer to something that informs the user of the lack
> > of support at run time.
> > 
> > > but for old compilers that don't support AVX2 -
> > > rte_acl_classify_avx2() would simply be identical to rte_acl_classify_sse().
> > >
> > That doesn't make sense to me, for two reasons:
> > 
> > 1) What if the machine being targeted doesn't support sse either?
> > 
> 
> Exactly the same what is happening now on the machine with now SSE4.1 support.
> There is absolutely no difference here.
> 
> > 2) If an application selects an AVX2 classifier, I as a developer expect to
> > either get AVX2 based classification, or an error indicating that I can't do
> > AVX2 classification, not a silent performance degradation down to scalar
> > classification.
> 
> In fact I was considering both variants for compilers not supporting AVX2:
> 1. silently degrade to SSE method.
> 2. create  a dummy function rte_acl_classify_error() and put it  into classify_fns[RTE_ACL_CLASSIFY_AVX2].
> 
> I choose #1 because it seems like a less distraction for the user -
> all would keep working as before, user just wouldn't see any improvement comparing to SSE method. 
> Again didn't want to spread "ifdef __AVX2__" into rte_acl.c
> Though I don't have any strong opinion here.
> So if you can provide some good reason why #2 is preferable, I am ok to switch to #2. 
> 
Because 2 doesn't require any ifdeffing.  As you note above the problem here is
that AVX2 support is both compiler and machine dependent.  If you make a weak
symbol version of rte_acl_classify_avx2 that always gets built, then you've
reduced the problem to just being compiler support, which you can check in the
makefile.

> > 
> > > >  That in turn I think allows you to remove a
> > > > bunch of the ifdeffing that you've done in some of the avx2 specific files.
> > >
> > > Actually there are not many of them.
> > > One in acl_run_avx2.h and another in acl_run_avx2.c.
> > >
> > 2 in acl_run_avx2.h and 1 in rte_acl_osdep_alone.h, which is really 3 more than
> > you need if you just do an intellegent weak classifier function defintion.
> 
> grep -n __AVX2__ lib/librte_acl/*.[c,h] | grep -v endif
> lib/librte_acl/acl_run_avx2.c:45:#ifdef __AVX2__
> lib/librte_acl/acl_run_avx2.h:36:#ifdef __AVX2__
> 
> rte_acl_osdep_alone.h - is a different story.
> It needs to be there anyway, as in rte_common_vect.h.
> In fact  rte_acl_osdep_alone.h is only needed for cases when RTE_LIBRTE_ACL_STANDALONE=y.
> That comes from the old days, when we had to to support building librte_acl library without the rest of DPDK.  
> I think we don't need it anymore and plan to remove it.
> Just thought it should  be in a separate patch. 
> Konstantin
> 
> > 
> > Neil
>
  
Ananyev, Konstantin Dec. 17, 2014, 7:22 p.m. UTC | #7
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Wednesday, December 17, 2014 3:33 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> 
> On Tue, Dec 16, 2014 at 04:16:48PM +0000, Ananyev, Konstantin wrote:
> >
> >
> > > -----Original Message-----
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Monday, December 15, 2014 8:21 PM
> > > To: Ananyev, Konstantin
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > >
> > > On Mon, Dec 15, 2014 at 04:33:47PM +0000, Ananyev, Konstantin wrote:
> > > > Hi Neil,
> > > >
> > > > > -----Original Message-----
> > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > Sent: Monday, December 15, 2014 4:00 PM
> > > > > To: Ananyev, Konstantin
> > > > > Cc: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > > > >
> > > > > On Sun, Dec 14, 2014 at 06:10:52PM +0000, Konstantin Ananyev wrote:
> > > > > > Introduce new classify() method that uses AVX2 instructions.
> > > > > > From my measurements:
> > > > > > On HSW boards when processing >= 16 packets per call,
> > > > > > AVX2 method outperforms it's SSE counterpart by 10-25%,
> > > > > > (depending on the ruleset).
> > > > > > At runtime, this method is selected as default one on HW that supports AVX2.
> > > > > >
> > > > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > > > > ---
> > > > > >  lib/librte_acl/Makefile       |   9 +
> > > > > >  lib/librte_acl/acl.h          |   4 +
> > > > > >  lib/librte_acl/acl_run.h      |   2 +-
> > > > > >  lib/librte_acl/acl_run_avx2.c |  58 +++++
> > > > > >  lib/librte_acl/acl_run_avx2.h | 305 ++++++++++++++++++++++++
> > > > > >  lib/librte_acl/acl_run_sse.c  | 537 +-----------------------------------------
> > > > > >  lib/librte_acl/acl_run_sse.h  | 533 +++++++++++++++++++++++++++++++++++++++++
> > > > > >  lib/librte_acl/rte_acl.c      |   5 +-
> > > > > >  lib/librte_acl/rte_acl.h      |   2 +
> > > > > >  9 files changed, 917 insertions(+), 538 deletions(-)
> > > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.c
> > > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.h
> > > > > >  create mode 100644 lib/librte_acl/acl_run_sse.h
> > > > > >
> > > > > > diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> > > > > > index 65e566d..223ec31 100644
> > > > > > --- a/lib/librte_acl/Makefile
> > > > > > +++ b/lib/librte_acl/Makefile
> > > > > > @@ -45,8 +45,17 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
> > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
> > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
> > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> > > > > > +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx2.c
> > > > > >
> > > > > >  CFLAGS_acl_run_sse.o += -msse4.1
> > > > > > +ifeq ($(CC), icc)
> > > > > > +CFLAGS_acl_run_avx2.o += -march=core-avx2
> > > > > > +else ifneq ($(shell \
> > > > > > +test $(GCC_MAJOR_VERSION) -le 4 -a $(GCC_MINOR_VERSION) -le 6 && echo 1), 1)
> > > > > > +CFLAGS_acl_run_avx2.o += -mavx2
> > > > > > +else
> > > > > > +CFLAGS_acl_run_avx2.o += -msse4.1
> > > > > > +endif
> > > > > >
> > > > > This seems broken.  You've unilaterally included acl_run_avx2.c in the build
> > > > > list above, but only enable -mavx2 if the compiler is at least gcc 4.6.
> > > >
> > > > Actually 4.7 (before that version, as I know,  gcc doesn't support avx2)
> > > >
> > > > >  Unless
> > > > > you want to make gcc 4.6 a requirement for building,
> > > >
> > > > I believe DPDK is required to be buildable by gcc 4.6
> > > > As I remember, we have to support it all way down to gcc 4.3.
> > > >
> > > > > you need to also exclude
> > > > > the file above from the build list.
> > > >
> > > > That means that for  gcc 4.6 and below rte_acl_classify_avx2() would not be defined.
> > > > And then at runtime, I have to check for that somehow and (re)populate classify_fns[].
> > > > Doesn't seems like a good way to me.
> > > There are plenty of ways around that.
> > >
> > > At a minimum you could make the classify_fns array the one place that you need
> > > to add an ifdef __AVX__ call.
> > >
> > > You could also create a secondary definition of rte_acl_classify_avx2, and mark
> > > it as a weak symbol, which only returns -EOPNOTSUPP.  That would be good, since
> > > the right thing will just automatically happen then if you don't build the
> > > actual avx2 classification code
> > >
> > > > Instead, I prefer to always build acl_run_avx2.c,
> >
> >
> > > But you can't do that.  You just said above that you need to support down to gcc
> > > 4.3.  I see you've worked around that with some additional ifdef __AVX__
> > > instructions, but in so doing you ignore the possibiity that sse isn't
> > > supported, so you need to add __SSE__ checks now as well.  ifdeffing that much
> > > just isn't scalable.
> >
> > We don't need to worry about compiler without SSE4.1 support.
> > I believe that all compilers that DDPDK has to build with, do support SSE4.1.
> > So for SSE4.1 we only has to worry about situation when target CPU doesn't support it
> > We manage it by runtime selection.
> > For AVX2 - situation is a bit different: it could be both compiler and target CPU that don't support it.
> >
> > >  And for your effort, you get an AVX2 classification path
> > > that potentially doesn't actually do vectorized classification.
> > >
> > > It really seems better to me to not build the code if the compiler doesn't
> > > support the instruction set it was meant to enable, and change the
> > > classification function pointer to something that informs the user of the lack
> > > of support at run time.
> > >
> > > > but for old compilers that don't support AVX2 -
> > > > rte_acl_classify_avx2() would simply be identical to rte_acl_classify_sse().
> > > >
> > > That doesn't make sense to me, for two reasons:
> > >
> > > 1) What if the machine being targeted doesn't support sse either?
> > >
> >
> > Exactly the same what is happening now on the machine with now SSE4.1 support.
> > There is absolutely no difference here.
> >
> > > 2) If an application selects an AVX2 classifier, I as a developer expect to
> > > either get AVX2 based classification, or an error indicating that I can't do
> > > AVX2 classification, not a silent performance degradation down to scalar
> > > classification.
> >
> > In fact I was considering both variants for compilers not supporting AVX2:
> > 1. silently degrade to SSE method.
> > 2. create  a dummy function rte_acl_classify_error() and put it  into classify_fns[RTE_ACL_CLASSIFY_AVX2].
> >
> > I choose #1 because it seems like a less distraction for the user -
> > all would keep working as before, user just wouldn't see any improvement comparing to SSE method.
> > Again didn't want to spread "ifdef __AVX2__" into rte_acl.c
> > Though I don't have any strong opinion here.
> > So if you can provide some good reason why #2 is preferable, I am ok to switch to #2.
> >
> Because 2 doesn't require any ifdeffing.  As you note above the problem here is
> that AVX2 support is both compiler and machine dependent.  If you make a weak
> symbol version of rte_acl_classify_avx2 that always gets built, then you've
> reduced the problem to just being compiler support, which you can check in the
> makefile.

I don't think we'll get rid of ifdefing with #2.
We'll  remove 2 ifdefs in acl_run_avx2.h, but then we have to introduce 2 new in rte_acl.c instead.
From my understanding, we we'll need something like that:

static const rte_acl_classify_t classify_fns[] = {
        [RTE_ACL_CLASSIFY_DEFAULT] = rte_acl_classify_scalar,
        [RTE_ACL_CLASSIFY_SCALAR] = rte_acl_classify_scalar,
        [RTE_ACL_CLASSIFY_SSE] = rte_acl_classify_sse,
+#if (defined __GNUC__ &&  __GNUC__ <= 4 && __GNUC_MINOR__ < 7)
+      [RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_error,
+#else  
      [RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
+#endif
};

static void __attribute__((constructor))
rte_acl_init(void)
{
        enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;

+#if (defined __GNUC__ &&  __GNUC__ <= 4 && __GNUC_MINOR__ < 7)
        if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
                alg = RTE_ACL_CLASSIFY_AVX2;
        else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+#else
+      if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
                alg = RTE_ACL_CLASSIFY_SSE;
+#endif
        rte_acl_set_default_classify(alg);
}

Correct?
Konstantin

> 
> > >
> > > > >  That in turn I think allows you to remove a
> > > > > bunch of the ifdeffing that you've done in some of the avx2 specific files.
> > > >
> > > > Actually there are not many of them.
> > > > One in acl_run_avx2.h and another in acl_run_avx2.c.
> > > >
> > > 2 in acl_run_avx2.h and 1 in rte_acl_osdep_alone.h, which is really 3 more than
> > > you need if you just do an intellegent weak classifier function defintion.
> >
> > grep -n __AVX2__ lib/librte_acl/*.[c,h] | grep -v endif
> > lib/librte_acl/acl_run_avx2.c:45:#ifdef __AVX2__
> > lib/librte_acl/acl_run_avx2.h:36:#ifdef __AVX2__
> >
> > rte_acl_osdep_alone.h - is a different story.
> > It needs to be there anyway, as in rte_common_vect.h.
> > In fact  rte_acl_osdep_alone.h is only needed for cases when RTE_LIBRTE_ACL_STANDALONE=y.
> > That comes from the old days, when we had to to support building librte_acl library without the rest of DPDK.
> > I think we don't need it anymore and plan to remove it.
> > Just thought it should  be in a separate patch.
> > Konstantin
> >
> > >
> > > Neil
> >
  
Neil Horman Dec. 17, 2014, 8:27 p.m. UTC | #8
On Wed, Dec 17, 2014 at 07:22:06PM +0000, Ananyev, Konstantin wrote:
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Wednesday, December 17, 2014 3:33 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > 
> > On Tue, Dec 16, 2014 at 04:16:48PM +0000, Ananyev, Konstantin wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Monday, December 15, 2014 8:21 PM
> > > > To: Ananyev, Konstantin
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > > >
> > > > On Mon, Dec 15, 2014 at 04:33:47PM +0000, Ananyev, Konstantin wrote:
> > > > > Hi Neil,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > Sent: Monday, December 15, 2014 4:00 PM
> > > > > > To: Ananyev, Konstantin
> > > > > > Cc: dev@dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > > > > >
> > > > > > On Sun, Dec 14, 2014 at 06:10:52PM +0000, Konstantin Ananyev wrote:
> > > > > > > Introduce new classify() method that uses AVX2 instructions.
> > > > > > > From my measurements:
> > > > > > > On HSW boards when processing >= 16 packets per call,
> > > > > > > AVX2 method outperforms it's SSE counterpart by 10-25%,
> > > > > > > (depending on the ruleset).
> > > > > > > At runtime, this method is selected as default one on HW that supports AVX2.
> > > > > > >
> > > > > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > > > > > ---
> > > > > > >  lib/librte_acl/Makefile       |   9 +
> > > > > > >  lib/librte_acl/acl.h          |   4 +
> > > > > > >  lib/librte_acl/acl_run.h      |   2 +-
> > > > > > >  lib/librte_acl/acl_run_avx2.c |  58 +++++
> > > > > > >  lib/librte_acl/acl_run_avx2.h | 305 ++++++++++++++++++++++++
> > > > > > >  lib/librte_acl/acl_run_sse.c  | 537 +-----------------------------------------
> > > > > > >  lib/librte_acl/acl_run_sse.h  | 533 +++++++++++++++++++++++++++++++++++++++++
> > > > > > >  lib/librte_acl/rte_acl.c      |   5 +-
> > > > > > >  lib/librte_acl/rte_acl.h      |   2 +
> > > > > > >  9 files changed, 917 insertions(+), 538 deletions(-)
> > > > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.c
> > > > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.h
> > > > > > >  create mode 100644 lib/librte_acl/acl_run_sse.h
> > > > > > >
> > > > > > > diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> > > > > > > index 65e566d..223ec31 100644
> > > > > > > --- a/lib/librte_acl/Makefile
> > > > > > > +++ b/lib/librte_acl/Makefile
> > > > > > > @@ -45,8 +45,17 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
> > > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
> > > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
> > > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> > > > > > > +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx2.c
> > > > > > >
> > > > > > >  CFLAGS_acl_run_sse.o += -msse4.1
> > > > > > > +ifeq ($(CC), icc)
> > > > > > > +CFLAGS_acl_run_avx2.o += -march=core-avx2
> > > > > > > +else ifneq ($(shell \
> > > > > > > +test $(GCC_MAJOR_VERSION) -le 4 -a $(GCC_MINOR_VERSION) -le 6 && echo 1), 1)
> > > > > > > +CFLAGS_acl_run_avx2.o += -mavx2
> > > > > > > +else
> > > > > > > +CFLAGS_acl_run_avx2.o += -msse4.1
> > > > > > > +endif
> > > > > > >
> > > > > > This seems broken.  You've unilaterally included acl_run_avx2.c in the build
> > > > > > list above, but only enable -mavx2 if the compiler is at least gcc 4.6.
> > > > >
> > > > > Actually 4.7 (before that version, as I know,  gcc doesn't support avx2)
> > > > >
> > > > > >  Unless
> > > > > > you want to make gcc 4.6 a requirement for building,
> > > > >
> > > > > I believe DPDK is required to be buildable by gcc 4.6
> > > > > As I remember, we have to support it all way down to gcc 4.3.
> > > > >
> > > > > > you need to also exclude
> > > > > > the file above from the build list.
> > > > >
> > > > > That means that for  gcc 4.6 and below rte_acl_classify_avx2() would not be defined.
> > > > > And then at runtime, I have to check for that somehow and (re)populate classify_fns[].
> > > > > Doesn't seems like a good way to me.
> > > > There are plenty of ways around that.
> > > >
> > > > At a minimum you could make the classify_fns array the one place that you need
> > > > to add an ifdef __AVX__ call.
> > > >
> > > > You could also create a secondary definition of rte_acl_classify_avx2, and mark
> > > > it as a weak symbol, which only returns -EOPNOTSUPP.  That would be good, since
> > > > the right thing will just automatically happen then if you don't build the
> > > > actual avx2 classification code
> > > >
> > > > > Instead, I prefer to always build acl_run_avx2.c,
> > >
> > >
> > > > But you can't do that.  You just said above that you need to support down to gcc
> > > > 4.3.  I see you've worked around that with some additional ifdef __AVX__
> > > > instructions, but in so doing you ignore the possibiity that sse isn't
> > > > supported, so you need to add __SSE__ checks now as well.  ifdeffing that much
> > > > just isn't scalable.
> > >
> > > We don't need to worry about compiler without SSE4.1 support.
> > > I believe that all compilers that DDPDK has to build with, do support SSE4.1.
> > > So for SSE4.1 we only has to worry about situation when target CPU doesn't support it
> > > We manage it by runtime selection.
> > > For AVX2 - situation is a bit different: it could be both compiler and target CPU that don't support it.
> > >
> > > >  And for your effort, you get an AVX2 classification path
> > > > that potentially doesn't actually do vectorized classification.
> > > >
> > > > It really seems better to me to not build the code if the compiler doesn't
> > > > support the instruction set it was meant to enable, and change the
> > > > classification function pointer to something that informs the user of the lack
> > > > of support at run time.
> > > >
> > > > > but for old compilers that don't support AVX2 -
> > > > > rte_acl_classify_avx2() would simply be identical to rte_acl_classify_sse().
> > > > >
> > > > That doesn't make sense to me, for two reasons:
> > > >
> > > > 1) What if the machine being targeted doesn't support sse either?
> > > >
> > >
> > > Exactly the same what is happening now on the machine with now SSE4.1 support.
> > > There is absolutely no difference here.
> > >
> > > > 2) If an application selects an AVX2 classifier, I as a developer expect to
> > > > either get AVX2 based classification, or an error indicating that I can't do
> > > > AVX2 classification, not a silent performance degradation down to scalar
> > > > classification.
> > >
> > > In fact I was considering both variants for compilers not supporting AVX2:
> > > 1. silently degrade to SSE method.
> > > 2. create  a dummy function rte_acl_classify_error() and put it  into classify_fns[RTE_ACL_CLASSIFY_AVX2].
> > >
> > > I choose #1 because it seems like a less distraction for the user -
> > > all would keep working as before, user just wouldn't see any improvement comparing to SSE method.
> > > Again didn't want to spread "ifdef __AVX2__" into rte_acl.c
> > > Though I don't have any strong opinion here.
> > > So if you can provide some good reason why #2 is preferable, I am ok to switch to #2.
> > >
> > Because 2 doesn't require any ifdeffing.  As you note above the problem here is
> > that AVX2 support is both compiler and machine dependent.  If you make a weak
> > symbol version of rte_acl_classify_avx2 that always gets built, then you've
> > reduced the problem to just being compiler support, which you can check in the
> > makefile.
> 
> I don't think we'll get rid of ifdefing with #2.
> We'll  remove 2 ifdefs in acl_run_avx2.h, but then we have to introduce 2 new in rte_acl.c instead.
> From my understanding, we we'll need something like that:
> 
> static const rte_acl_classify_t classify_fns[] = {
>         [RTE_ACL_CLASSIFY_DEFAULT] = rte_acl_classify_scalar,
>         [RTE_ACL_CLASSIFY_SCALAR] = rte_acl_classify_scalar,
>         [RTE_ACL_CLASSIFY_SSE] = rte_acl_classify_sse,
> +#if (defined __GNUC__ &&  __GNUC__ <= 4 && __GNUC_MINOR__ < 7)
> +      [RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_error,
> +#else  
>       [RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
> +#endif


You don't need to do this, you need to use a weak symbol:
static int rte_acl_classify_avx2(...) __attributes__(weak)
{
	return -EOPNOTSUP
}


Then in the rte_acl_avx2.c file define it again without the weak symbol

That way, you do conditional compilation, and when you do the "real" symbol
overrides the weak one.

> };
> 
> static void __attribute__((constructor))
> rte_acl_init(void)
> {
>         enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
> 
> +#if (defined __GNUC__ &&  __GNUC__ <= 4 && __GNUC_MINOR__ < 7)
>         if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
>                 alg = RTE_ACL_CLASSIFY_AVX2;
>         else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> +#else
> +      if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
>                 alg = RTE_ACL_CLASSIFY_SSE;
> +#endif
>         rte_acl_set_default_classify(alg);
> }
> 
Why would you do this, this cpu feature flag definitions aren't matched to
compiler support, it should always be defined.  You should still be able to
check for AVX2 support in code that doesn't support emitting the instruction.

Neil

> Correct?
> Konstantin
> 
> > 
> > > >
> > > > > >  That in turn I think allows you to remove a
> > > > > > bunch of the ifdeffing that you've done in some of the avx2 specific files.
> > > > >
> > > > > Actually there are not many of them.
> > > > > One in acl_run_avx2.h and another in acl_run_avx2.c.
> > > > >
> > > > 2 in acl_run_avx2.h and 1 in rte_acl_osdep_alone.h, which is really 3 more than
> > > > you need if you just do an intellegent weak classifier function defintion.
> > >
> > > grep -n __AVX2__ lib/librte_acl/*.[c,h] | grep -v endif
> > > lib/librte_acl/acl_run_avx2.c:45:#ifdef __AVX2__
> > > lib/librte_acl/acl_run_avx2.h:36:#ifdef __AVX2__
> > >
> > > rte_acl_osdep_alone.h - is a different story.
> > > It needs to be there anyway, as in rte_common_vect.h.
> > > In fact  rte_acl_osdep_alone.h is only needed for cases when RTE_LIBRTE_ACL_STANDALONE=y.
> > > That comes from the old days, when we had to to support building librte_acl library without the rest of DPDK.
> > > I think we don't need it anymore and plan to remove it.
> > > Just thought it should  be in a separate patch.
> > > Konstantin
> > >
> > > >
> > > > Neil
> > >
>
  
Ananyev, Konstantin Dec. 18, 2014, 3:01 p.m. UTC | #9
> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Wednesday, December 17, 2014 8:28 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> 
> On Wed, Dec 17, 2014 at 07:22:06PM +0000, Ananyev, Konstantin wrote:
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Wednesday, December 17, 2014 3:33 PM
> > > To: Ananyev, Konstantin
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > >
> > > On Tue, Dec 16, 2014 at 04:16:48PM +0000, Ananyev, Konstantin wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > Sent: Monday, December 15, 2014 8:21 PM
> > > > > To: Ananyev, Konstantin
> > > > > Cc: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > > > >
> > > > > On Mon, Dec 15, 2014 at 04:33:47PM +0000, Ananyev, Konstantin wrote:
> > > > > > Hi Neil,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > > Sent: Monday, December 15, 2014 4:00 PM
> > > > > > > To: Ananyev, Konstantin
> > > > > > > Cc: dev@dpdk.org
> > > > > > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > > > > > >
> > > > > > > On Sun, Dec 14, 2014 at 06:10:52PM +0000, Konstantin Ananyev wrote:
> > > > > > > > Introduce new classify() method that uses AVX2 instructions.
> > > > > > > > From my measurements:
> > > > > > > > On HSW boards when processing >= 16 packets per call,
> > > > > > > > AVX2 method outperforms it's SSE counterpart by 10-25%,
> > > > > > > > (depending on the ruleset).
> > > > > > > > At runtime, this method is selected as default one on HW that supports AVX2.
> > > > > > > >
> > > > > > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > > > > > > ---
> > > > > > > >  lib/librte_acl/Makefile       |   9 +
> > > > > > > >  lib/librte_acl/acl.h          |   4 +
> > > > > > > >  lib/librte_acl/acl_run.h      |   2 +-
> > > > > > > >  lib/librte_acl/acl_run_avx2.c |  58 +++++
> > > > > > > >  lib/librte_acl/acl_run_avx2.h | 305 ++++++++++++++++++++++++
> > > > > > > >  lib/librte_acl/acl_run_sse.c  | 537 +-----------------------------------------
> > > > > > > >  lib/librte_acl/acl_run_sse.h  | 533 +++++++++++++++++++++++++++++++++++++++++
> > > > > > > >  lib/librte_acl/rte_acl.c      |   5 +-
> > > > > > > >  lib/librte_acl/rte_acl.h      |   2 +
> > > > > > > >  9 files changed, 917 insertions(+), 538 deletions(-)
> > > > > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.c
> > > > > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.h
> > > > > > > >  create mode 100644 lib/librte_acl/acl_run_sse.h
> > > > > > > >
> > > > > > > > diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> > > > > > > > index 65e566d..223ec31 100644
> > > > > > > > --- a/lib/librte_acl/Makefile
> > > > > > > > +++ b/lib/librte_acl/Makefile
> > > > > > > > @@ -45,8 +45,17 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
> > > > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
> > > > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
> > > > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> > > > > > > > +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx2.c
> > > > > > > >
> > > > > > > >  CFLAGS_acl_run_sse.o += -msse4.1
> > > > > > > > +ifeq ($(CC), icc)
> > > > > > > > +CFLAGS_acl_run_avx2.o += -march=core-avx2
> > > > > > > > +else ifneq ($(shell \
> > > > > > > > +test $(GCC_MAJOR_VERSION) -le 4 -a $(GCC_MINOR_VERSION) -le 6 && echo 1), 1)
> > > > > > > > +CFLAGS_acl_run_avx2.o += -mavx2
> > > > > > > > +else
> > > > > > > > +CFLAGS_acl_run_avx2.o += -msse4.1
> > > > > > > > +endif
> > > > > > > >
> > > > > > > This seems broken.  You've unilaterally included acl_run_avx2.c in the build
> > > > > > > list above, but only enable -mavx2 if the compiler is at least gcc 4.6.
> > > > > >
> > > > > > Actually 4.7 (before that version, as I know,  gcc doesn't support avx2)
> > > > > >
> > > > > > >  Unless
> > > > > > > you want to make gcc 4.6 a requirement for building,
> > > > > >
> > > > > > I believe DPDK is required to be buildable by gcc 4.6
> > > > > > As I remember, we have to support it all way down to gcc 4.3.
> > > > > >
> > > > > > > you need to also exclude
> > > > > > > the file above from the build list.
> > > > > >
> > > > > > That means that for  gcc 4.6 and below rte_acl_classify_avx2() would not be defined.
> > > > > > And then at runtime, I have to check for that somehow and (re)populate classify_fns[].
> > > > > > Doesn't seems like a good way to me.
> > > > > There are plenty of ways around that.
> > > > >
> > > > > At a minimum you could make the classify_fns array the one place that you need
> > > > > to add an ifdef __AVX__ call.
> > > > >
> > > > > You could also create a secondary definition of rte_acl_classify_avx2, and mark
> > > > > it as a weak symbol, which only returns -EOPNOTSUPP.  That would be good, since
> > > > > the right thing will just automatically happen then if you don't build the
> > > > > actual avx2 classification code
> > > > >
> > > > > > Instead, I prefer to always build acl_run_avx2.c,
> > > >
> > > >
> > > > > But you can't do that.  You just said above that you need to support down to gcc
> > > > > 4.3.  I see you've worked around that with some additional ifdef __AVX__
> > > > > instructions, but in so doing you ignore the possibiity that sse isn't
> > > > > supported, so you need to add __SSE__ checks now as well.  ifdeffing that much
> > > > > just isn't scalable.
> > > >
> > > > We don't need to worry about compiler without SSE4.1 support.
> > > > I believe that all compilers that DDPDK has to build with, do support SSE4.1.
> > > > So for SSE4.1 we only has to worry about situation when target CPU doesn't support it
> > > > We manage it by runtime selection.
> > > > For AVX2 - situation is a bit different: it could be both compiler and target CPU that don't support it.
> > > >
> > > > >  And for your effort, you get an AVX2 classification path
> > > > > that potentially doesn't actually do vectorized classification.
> > > > >
> > > > > It really seems better to me to not build the code if the compiler doesn't
> > > > > support the instruction set it was meant to enable, and change the
> > > > > classification function pointer to something that informs the user of the lack
> > > > > of support at run time.
> > > > >
> > > > > > but for old compilers that don't support AVX2 -
> > > > > > rte_acl_classify_avx2() would simply be identical to rte_acl_classify_sse().
> > > > > >
> > > > > That doesn't make sense to me, for two reasons:
> > > > >
> > > > > 1) What if the machine being targeted doesn't support sse either?
> > > > >
> > > >
> > > > Exactly the same what is happening now on the machine with now SSE4.1 support.
> > > > There is absolutely no difference here.
> > > >
> > > > > 2) If an application selects an AVX2 classifier, I as a developer expect to
> > > > > either get AVX2 based classification, or an error indicating that I can't do
> > > > > AVX2 classification, not a silent performance degradation down to scalar
> > > > > classification.
> > > >
> > > > In fact I was considering both variants for compilers not supporting AVX2:
> > > > 1. silently degrade to SSE method.
> > > > 2. create  a dummy function rte_acl_classify_error() and put it  into classify_fns[RTE_ACL_CLASSIFY_AVX2].
> > > >
> > > > I choose #1 because it seems like a less distraction for the user -
> > > > all would keep working as before, user just wouldn't see any improvement comparing to SSE method.
> > > > Again didn't want to spread "ifdef __AVX2__" into rte_acl.c
> > > > Though I don't have any strong opinion here.
> > > > So if you can provide some good reason why #2 is preferable, I am ok to switch to #2.
> > > >
> > > Because 2 doesn't require any ifdeffing.  As you note above the problem here is
> > > that AVX2 support is both compiler and machine dependent.  If you make a weak
> > > symbol version of rte_acl_classify_avx2 that always gets built, then you've
> > > reduced the problem to just being compiler support, which you can check in the
> > > makefile.
> >
> > I don't think we'll get rid of ifdefing with #2.
> > We'll  remove 2 ifdefs in acl_run_avx2.h, but then we have to introduce 2 new in rte_acl.c instead.
> > From my understanding, we we'll need something like that:
> >
> > static const rte_acl_classify_t classify_fns[] = {
> >         [RTE_ACL_CLASSIFY_DEFAULT] = rte_acl_classify_scalar,
> >         [RTE_ACL_CLASSIFY_SCALAR] = rte_acl_classify_scalar,
> >         [RTE_ACL_CLASSIFY_SSE] = rte_acl_classify_sse,
> > +#if (defined __GNUC__ &&  __GNUC__ <= 4 && __GNUC_MINOR__ < 7)
> > +      [RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_error,
> > +#else
> >       [RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
> > +#endif
> 
> 
> You don't need to do this, you need to use a weak symbol:
> static int rte_acl_classify_avx2(...) __attributes__(weak)
> {
> 	return -EOPNOTSUP
> }
> 
> 
> Then in the rte_acl_avx2.c file define it again without the weak symbol
> 
> That way, you do conditional compilation, and when you do the "real" symbol
> overrides the weak one.

Ah yes, you right - not need for ifdef here, thought I still think we need one below, in rte_acl_init().

> 
> > };
> >
> > static void __attribute__((constructor))
> > rte_acl_init(void)
> > {
> >         enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
> >
> > +#if (defined __GNUC__ &&  __GNUC__ <= 4 && __GNUC_MINOR__ < 7)
> >         if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> >                 alg = RTE_ACL_CLASSIFY_AVX2;
> >         else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > +#else
> > +      if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> >                 alg = RTE_ACL_CLASSIFY_SSE;
> > +#endif
> >         rte_acl_set_default_classify(alg);
> > }
> >
> Why would you do this, this cpu feature flag definitions aren't matched to
> compiler support, it should always be defined.

Because if we don't do this, then on machine that does support AVX2,
we'll always set CLASSIFY_AVX2 as default method, no matter was compiler  
able to produce a proper code for it or not.
We should set  CLASSIFY_AVX2 as default method only if both conditions are met:
at build time compiler supports AVX2 and target cpu supports AVX2.  

Konstantin

>  You should still be able to
> check for AVX2 support in code that doesn't support emitting the instruction.
> 
> Neil
> 
> > Correct?
> > Konstantin
> >
> > >
> > > > >
> > > > > > >  That in turn I think allows you to remove a
> > > > > > > bunch of the ifdeffing that you've done in some of the avx2 specific files.
> > > > > >
> > > > > > Actually there are not many of them.
> > > > > > One in acl_run_avx2.h and another in acl_run_avx2.c.
> > > > > >
> > > > > 2 in acl_run_avx2.h and 1 in rte_acl_osdep_alone.h, which is really 3 more than
> > > > > you need if you just do an intellegent weak classifier function defintion.
> > > >
> > > > grep -n __AVX2__ lib/librte_acl/*.[c,h] | grep -v endif
> > > > lib/librte_acl/acl_run_avx2.c:45:#ifdef __AVX2__
> > > > lib/librte_acl/acl_run_avx2.h:36:#ifdef __AVX2__
> > > >
> > > > rte_acl_osdep_alone.h - is a different story.
> > > > It needs to be there anyway, as in rte_common_vect.h.
> > > > In fact  rte_acl_osdep_alone.h is only needed for cases when RTE_LIBRTE_ACL_STANDALONE=y.
> > > > That comes from the old days, when we had to to support building librte_acl library without the rest of DPDK.
> > > > I think we don't need it anymore and plan to remove it.
> > > > Just thought it should  be in a separate patch.
> > > > Konstantin
> > > >
> > > > >
> > > > > Neil
> > > >
> >
  
Ananyev, Konstantin Jan. 6, 2015, 9:57 a.m. UTC | #10
Hi Neil,
Any further comments on that one?
Konstantin

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin
> Sent: Thursday, December 18, 2014 3:02 PM
> To: Neil Horman
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> 
> 
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Wednesday, December 17, 2014 8:28 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> >
> > On Wed, Dec 17, 2014 at 07:22:06PM +0000, Ananyev, Konstantin wrote:
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Wednesday, December 17, 2014 3:33 PM
> > > > To: Ananyev, Konstantin
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > > >
> > > > On Tue, Dec 16, 2014 at 04:16:48PM +0000, Ananyev, Konstantin wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > Sent: Monday, December 15, 2014 8:21 PM
> > > > > > To: Ananyev, Konstantin
> > > > > > Cc: dev@dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > > > > >
> > > > > > On Mon, Dec 15, 2014 at 04:33:47PM +0000, Ananyev, Konstantin wrote:
> > > > > > > Hi Neil,
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > > > Sent: Monday, December 15, 2014 4:00 PM
> > > > > > > > To: Ananyev, Konstantin
> > > > > > > > Cc: dev@dpdk.org
> > > > > > > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > > > > > > >
> > > > > > > > On Sun, Dec 14, 2014 at 06:10:52PM +0000, Konstantin Ananyev wrote:
> > > > > > > > > Introduce new classify() method that uses AVX2 instructions.
> > > > > > > > > From my measurements:
> > > > > > > > > On HSW boards when processing >= 16 packets per call,
> > > > > > > > > AVX2 method outperforms it's SSE counterpart by 10-25%,
> > > > > > > > > (depending on the ruleset).
> > > > > > > > > At runtime, this method is selected as default one on HW that supports AVX2.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > > > > > > > ---
> > > > > > > > >  lib/librte_acl/Makefile       |   9 +
> > > > > > > > >  lib/librte_acl/acl.h          |   4 +
> > > > > > > > >  lib/librte_acl/acl_run.h      |   2 +-
> > > > > > > > >  lib/librte_acl/acl_run_avx2.c |  58 +++++
> > > > > > > > >  lib/librte_acl/acl_run_avx2.h | 305 ++++++++++++++++++++++++
> > > > > > > > >  lib/librte_acl/acl_run_sse.c  | 537 +-----------------------------------------
> > > > > > > > >  lib/librte_acl/acl_run_sse.h  | 533 +++++++++++++++++++++++++++++++++++++++++
> > > > > > > > >  lib/librte_acl/rte_acl.c      |   5 +-
> > > > > > > > >  lib/librte_acl/rte_acl.h      |   2 +
> > > > > > > > >  9 files changed, 917 insertions(+), 538 deletions(-)
> > > > > > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.c
> > > > > > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.h
> > > > > > > > >  create mode 100644 lib/librte_acl/acl_run_sse.h
> > > > > > > > >
> > > > > > > > > diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> > > > > > > > > index 65e566d..223ec31 100644
> > > > > > > > > --- a/lib/librte_acl/Makefile
> > > > > > > > > +++ b/lib/librte_acl/Makefile
> > > > > > > > > @@ -45,8 +45,17 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
> > > > > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
> > > > > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
> > > > > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> > > > > > > > > +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx2.c
> > > > > > > > >
> > > > > > > > >  CFLAGS_acl_run_sse.o += -msse4.1
> > > > > > > > > +ifeq ($(CC), icc)
> > > > > > > > > +CFLAGS_acl_run_avx2.o += -march=core-avx2
> > > > > > > > > +else ifneq ($(shell \
> > > > > > > > > +test $(GCC_MAJOR_VERSION) -le 4 -a $(GCC_MINOR_VERSION) -le 6 && echo 1), 1)
> > > > > > > > > +CFLAGS_acl_run_avx2.o += -mavx2
> > > > > > > > > +else
> > > > > > > > > +CFLAGS_acl_run_avx2.o += -msse4.1
> > > > > > > > > +endif
> > > > > > > > >
> > > > > > > > This seems broken.  You've unilaterally included acl_run_avx2.c in the build
> > > > > > > > list above, but only enable -mavx2 if the compiler is at least gcc 4.6.
> > > > > > >
> > > > > > > Actually 4.7 (before that version, as I know,  gcc doesn't support avx2)
> > > > > > >
> > > > > > > >  Unless
> > > > > > > > you want to make gcc 4.6 a requirement for building,
> > > > > > >
> > > > > > > I believe DPDK is required to be buildable by gcc 4.6
> > > > > > > As I remember, we have to support it all way down to gcc 4.3.
> > > > > > >
> > > > > > > > you need to also exclude
> > > > > > > > the file above from the build list.
> > > > > > >
> > > > > > > That means that for  gcc 4.6 and below rte_acl_classify_avx2() would not be defined.
> > > > > > > And then at runtime, I have to check for that somehow and (re)populate classify_fns[].
> > > > > > > Doesn't seems like a good way to me.
> > > > > > There are plenty of ways around that.
> > > > > >
> > > > > > At a minimum you could make the classify_fns array the one place that you need
> > > > > > to add an ifdef __AVX__ call.
> > > > > >
> > > > > > You could also create a secondary definition of rte_acl_classify_avx2, and mark
> > > > > > it as a weak symbol, which only returns -EOPNOTSUPP.  That would be good, since
> > > > > > the right thing will just automatically happen then if you don't build the
> > > > > > actual avx2 classification code
> > > > > >
> > > > > > > Instead, I prefer to always build acl_run_avx2.c,
> > > > >
> > > > >
> > > > > > But you can't do that.  You just said above that you need to support down to gcc
> > > > > > 4.3.  I see you've worked around that with some additional ifdef __AVX__
> > > > > > instructions, but in so doing you ignore the possibiity that sse isn't
> > > > > > supported, so you need to add __SSE__ checks now as well.  ifdeffing that much
> > > > > > just isn't scalable.
> > > > >
> > > > > We don't need to worry about compiler without SSE4.1 support.
> > > > > I believe that all compilers that DDPDK has to build with, do support SSE4.1.
> > > > > So for SSE4.1 we only has to worry about situation when target CPU doesn't support it
> > > > > We manage it by runtime selection.
> > > > > For AVX2 - situation is a bit different: it could be both compiler and target CPU that don't support it.
> > > > >
> > > > > >  And for your effort, you get an AVX2 classification path
> > > > > > that potentially doesn't actually do vectorized classification.
> > > > > >
> > > > > > It really seems better to me to not build the code if the compiler doesn't
> > > > > > support the instruction set it was meant to enable, and change the
> > > > > > classification function pointer to something that informs the user of the lack
> > > > > > of support at run time.
> > > > > >
> > > > > > > but for old compilers that don't support AVX2 -
> > > > > > > rte_acl_classify_avx2() would simply be identical to rte_acl_classify_sse().
> > > > > > >
> > > > > > That doesn't make sense to me, for two reasons:
> > > > > >
> > > > > > 1) What if the machine being targeted doesn't support sse either?
> > > > > >
> > > > >
> > > > > Exactly the same what is happening now on the machine with now SSE4.1 support.
> > > > > There is absolutely no difference here.
> > > > >
> > > > > > 2) If an application selects an AVX2 classifier, I as a developer expect to
> > > > > > either get AVX2 based classification, or an error indicating that I can't do
> > > > > > AVX2 classification, not a silent performance degradation down to scalar
> > > > > > classification.
> > > > >
> > > > > In fact I was considering both variants for compilers not supporting AVX2:
> > > > > 1. silently degrade to SSE method.
> > > > > 2. create  a dummy function rte_acl_classify_error() and put it  into classify_fns[RTE_ACL_CLASSIFY_AVX2].
> > > > >
> > > > > I choose #1 because it seems like a less distraction for the user -
> > > > > all would keep working as before, user just wouldn't see any improvement comparing to SSE method.
> > > > > Again didn't want to spread "ifdef __AVX2__" into rte_acl.c
> > > > > Though I don't have any strong opinion here.
> > > > > So if you can provide some good reason why #2 is preferable, I am ok to switch to #2.
> > > > >
> > > > Because 2 doesn't require any ifdeffing.  As you note above the problem here is
> > > > that AVX2 support is both compiler and machine dependent.  If you make a weak
> > > > symbol version of rte_acl_classify_avx2 that always gets built, then you've
> > > > reduced the problem to just being compiler support, which you can check in the
> > > > makefile.
> > >
> > > I don't think we'll get rid of ifdefing with #2.
> > > We'll  remove 2 ifdefs in acl_run_avx2.h, but then we have to introduce 2 new in rte_acl.c instead.
> > > From my understanding, we we'll need something like that:
> > >
> > > static const rte_acl_classify_t classify_fns[] = {
> > >         [RTE_ACL_CLASSIFY_DEFAULT] = rte_acl_classify_scalar,
> > >         [RTE_ACL_CLASSIFY_SCALAR] = rte_acl_classify_scalar,
> > >         [RTE_ACL_CLASSIFY_SSE] = rte_acl_classify_sse,
> > > +#if (defined __GNUC__ &&  __GNUC__ <= 4 && __GNUC_MINOR__ < 7)
> > > +      [RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_error,
> > > +#else
> > >       [RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
> > > +#endif
> >
> >
> > You don't need to do this, you need to use a weak symbol:
> > static int rte_acl_classify_avx2(...) __attributes__(weak)
> > {
> > 	return -EOPNOTSUP
> > }
> >
> >
> > Then in the rte_acl_avx2.c file define it again without the weak symbol
> >
> > That way, you do conditional compilation, and when you do the "real" symbol
> > overrides the weak one.
> 
> Ah yes, you right - not need for ifdef here, thought I still think we need one below, in rte_acl_init().
> 
> >
> > > };
> > >
> > > static void __attribute__((constructor))
> > > rte_acl_init(void)
> > > {
> > >         enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
> > >
> > > +#if (defined __GNUC__ &&  __GNUC__ <= 4 && __GNUC_MINOR__ < 7)
> > >         if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > >                 alg = RTE_ACL_CLASSIFY_AVX2;
> > >         else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > > +#else
> > > +      if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > >                 alg = RTE_ACL_CLASSIFY_SSE;
> > > +#endif
> > >         rte_acl_set_default_classify(alg);
> > > }
> > >
> > Why would you do this, this cpu feature flag definitions aren't matched to
> > compiler support, it should always be defined.
> 
> Because if we don't do this, then on machine that does support AVX2,
> we'll always set CLASSIFY_AVX2 as default method, no matter was compiler
> able to produce a proper code for it or not.
> We should set  CLASSIFY_AVX2 as default method only if both conditions are met:
> at build time compiler supports AVX2 and target cpu supports AVX2.
> 
> Konstantin
> 
> >  You should still be able to
> > check for AVX2 support in code that doesn't support emitting the instruction.
> >
> > Neil
> >
> > > Correct?
> > > Konstantin
> > >
> > > >
> > > > > >
> > > > > > > >  That in turn I think allows you to remove a
> > > > > > > > bunch of the ifdeffing that you've done in some of the avx2 specific files.
> > > > > > >
> > > > > > > Actually there are not many of them.
> > > > > > > One in acl_run_avx2.h and another in acl_run_avx2.c.
> > > > > > >
> > > > > > 2 in acl_run_avx2.h and 1 in rte_acl_osdep_alone.h, which is really 3 more than
> > > > > > you need if you just do an intellegent weak classifier function defintion.
> > > > >
> > > > > grep -n __AVX2__ lib/librte_acl/*.[c,h] | grep -v endif
> > > > > lib/librte_acl/acl_run_avx2.c:45:#ifdef __AVX2__
> > > > > lib/librte_acl/acl_run_avx2.h:36:#ifdef __AVX2__
> > > > >
> > > > > rte_acl_osdep_alone.h - is a different story.
> > > > > It needs to be there anyway, as in rte_common_vect.h.
> > > > > In fact  rte_acl_osdep_alone.h is only needed for cases when RTE_LIBRTE_ACL_STANDALONE=y.
> > > > > That comes from the old days, when we had to to support building librte_acl library without the rest of DPDK.
> > > > > I think we don't need it anymore and plan to remove it.
> > > > > Just thought it should  be in a separate patch.
> > > > > Konstantin
> > > > >
> > > > > >
> > > > > > Neil
> > > > >
> > >
  
Neil Horman Jan. 6, 2015, 12:40 p.m. UTC | #11
On Tue, Jan 06, 2015 at 09:57:40AM +0000, Ananyev, Konstantin wrote:
> 
> Hi Neil,
> Any further comments on that one?
> Konstantin
> 
No, I'm good.  You're comment regarding compiler support makes sense (though its
really unfortunate that we have to do that).  Still need to address the ifdefery
around the method array however.
Neil

> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin
> > Sent: Thursday, December 18, 2014 3:02 PM
> > To: Neil Horman
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > 
> > 
> > 
> > > -----Original Message-----
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Wednesday, December 17, 2014 8:28 PM
> > > To: Ananyev, Konstantin
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > >
> > > On Wed, Dec 17, 2014 at 07:22:06PM +0000, Ananyev, Konstantin wrote:
> > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > Sent: Wednesday, December 17, 2014 3:33 PM
> > > > > To: Ananyev, Konstantin
> > > > > Cc: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > > > >
> > > > > On Tue, Dec 16, 2014 at 04:16:48PM +0000, Ananyev, Konstantin wrote:
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > > Sent: Monday, December 15, 2014 8:21 PM
> > > > > > > To: Ananyev, Konstantin
> > > > > > > Cc: dev@dpdk.org
> > > > > > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > > > > > >
> > > > > > > On Mon, Dec 15, 2014 at 04:33:47PM +0000, Ananyev, Konstantin wrote:
> > > > > > > > Hi Neil,
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > > > > Sent: Monday, December 15, 2014 4:00 PM
> > > > > > > > > To: Ananyev, Konstantin
> > > > > > > > > Cc: dev@dpdk.org
> > > > > > > > > Subject: Re: [dpdk-dev] [PATCH 10/17] librte_acl: add AVX2 as new rte_acl_classify() method
> > > > > > > > >
> > > > > > > > > On Sun, Dec 14, 2014 at 06:10:52PM +0000, Konstantin Ananyev wrote:
> > > > > > > > > > Introduce new classify() method that uses AVX2 instructions.
> > > > > > > > > > From my measurements:
> > > > > > > > > > On HSW boards when processing >= 16 packets per call,
> > > > > > > > > > AVX2 method outperforms it's SSE counterpart by 10-25%,
> > > > > > > > > > (depending on the ruleset).
> > > > > > > > > > At runtime, this method is selected as default one on HW that supports AVX2.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > > > > > > > > ---
> > > > > > > > > >  lib/librte_acl/Makefile       |   9 +
> > > > > > > > > >  lib/librte_acl/acl.h          |   4 +
> > > > > > > > > >  lib/librte_acl/acl_run.h      |   2 +-
> > > > > > > > > >  lib/librte_acl/acl_run_avx2.c |  58 +++++
> > > > > > > > > >  lib/librte_acl/acl_run_avx2.h | 305 ++++++++++++++++++++++++
> > > > > > > > > >  lib/librte_acl/acl_run_sse.c  | 537 +-----------------------------------------
> > > > > > > > > >  lib/librte_acl/acl_run_sse.h  | 533 +++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > >  lib/librte_acl/rte_acl.c      |   5 +-
> > > > > > > > > >  lib/librte_acl/rte_acl.h      |   2 +
> > > > > > > > > >  9 files changed, 917 insertions(+), 538 deletions(-)
> > > > > > > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.c
> > > > > > > > > >  create mode 100644 lib/librte_acl/acl_run_avx2.h
> > > > > > > > > >  create mode 100644 lib/librte_acl/acl_run_sse.h
> > > > > > > > > >
> > > > > > > > > > diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> > > > > > > > > > index 65e566d..223ec31 100644
> > > > > > > > > > --- a/lib/librte_acl/Makefile
> > > > > > > > > > +++ b/lib/librte_acl/Makefile
> > > > > > > > > > @@ -45,8 +45,17 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
> > > > > > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
> > > > > > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
> > > > > > > > > >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> > > > > > > > > > +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx2.c
> > > > > > > > > >
> > > > > > > > > >  CFLAGS_acl_run_sse.o += -msse4.1
> > > > > > > > > > +ifeq ($(CC), icc)
> > > > > > > > > > +CFLAGS_acl_run_avx2.o += -march=core-avx2
> > > > > > > > > > +else ifneq ($(shell \
> > > > > > > > > > +test $(GCC_MAJOR_VERSION) -le 4 -a $(GCC_MINOR_VERSION) -le 6 && echo 1), 1)
> > > > > > > > > > +CFLAGS_acl_run_avx2.o += -mavx2
> > > > > > > > > > +else
> > > > > > > > > > +CFLAGS_acl_run_avx2.o += -msse4.1
> > > > > > > > > > +endif
> > > > > > > > > >
> > > > > > > > > This seems broken.  You've unilaterally included acl_run_avx2.c in the build
> > > > > > > > > list above, but only enable -mavx2 if the compiler is at least gcc 4.6.
> > > > > > > >
> > > > > > > > Actually 4.7 (before that version, as I know,  gcc doesn't support avx2)
> > > > > > > >
> > > > > > > > >  Unless
> > > > > > > > > you want to make gcc 4.6 a requirement for building,
> > > > > > > >
> > > > > > > > I believe DPDK is required to be buildable by gcc 4.6
> > > > > > > > As I remember, we have to support it all way down to gcc 4.3.
> > > > > > > >
> > > > > > > > > you need to also exclude
> > > > > > > > > the file above from the build list.
> > > > > > > >
> > > > > > > > That means that for  gcc 4.6 and below rte_acl_classify_avx2() would not be defined.
> > > > > > > > And then at runtime, I have to check for that somehow and (re)populate classify_fns[].
> > > > > > > > Doesn't seems like a good way to me.
> > > > > > > There are plenty of ways around that.
> > > > > > >
> > > > > > > At a minimum you could make the classify_fns array the one place that you need
> > > > > > > to add an ifdef __AVX__ call.
> > > > > > >
> > > > > > > You could also create a secondary definition of rte_acl_classify_avx2, and mark
> > > > > > > it as a weak symbol, which only returns -EOPNOTSUPP.  That would be good, since
> > > > > > > the right thing will just automatically happen then if you don't build the
> > > > > > > actual avx2 classification code
> > > > > > >
> > > > > > > > Instead, I prefer to always build acl_run_avx2.c,
> > > > > >
> > > > > >
> > > > > > > But you can't do that.  You just said above that you need to support down to gcc
> > > > > > > 4.3.  I see you've worked around that with some additional ifdef __AVX__
> > > > > > > instructions, but in so doing you ignore the possibiity that sse isn't
> > > > > > > supported, so you need to add __SSE__ checks now as well.  ifdeffing that much
> > > > > > > just isn't scalable.
> > > > > >
> > > > > > We don't need to worry about compiler without SSE4.1 support.
> > > > > > I believe that all compilers that DDPDK has to build with, do support SSE4.1.
> > > > > > So for SSE4.1 we only has to worry about situation when target CPU doesn't support it
> > > > > > We manage it by runtime selection.
> > > > > > For AVX2 - situation is a bit different: it could be both compiler and target CPU that don't support it.
> > > > > >
> > > > > > >  And for your effort, you get an AVX2 classification path
> > > > > > > that potentially doesn't actually do vectorized classification.
> > > > > > >
> > > > > > > It really seems better to me to not build the code if the compiler doesn't
> > > > > > > support the instruction set it was meant to enable, and change the
> > > > > > > classification function pointer to something that informs the user of the lack
> > > > > > > of support at run time.
> > > > > > >
> > > > > > > > but for old compilers that don't support AVX2 -
> > > > > > > > rte_acl_classify_avx2() would simply be identical to rte_acl_classify_sse().
> > > > > > > >
> > > > > > > That doesn't make sense to me, for two reasons:
> > > > > > >
> > > > > > > 1) What if the machine being targeted doesn't support sse either?
> > > > > > >
> > > > > >
> > > > > > Exactly the same what is happening now on the machine with now SSE4.1 support.
> > > > > > There is absolutely no difference here.
> > > > > >
> > > > > > > 2) If an application selects an AVX2 classifier, I as a developer expect to
> > > > > > > either get AVX2 based classification, or an error indicating that I can't do
> > > > > > > AVX2 classification, not a silent performance degradation down to scalar
> > > > > > > classification.
> > > > > >
> > > > > > In fact I was considering both variants for compilers not supporting AVX2:
> > > > > > 1. silently degrade to SSE method.
> > > > > > 2. create  a dummy function rte_acl_classify_error() and put it  into classify_fns[RTE_ACL_CLASSIFY_AVX2].
> > > > > >
> > > > > > I choose #1 because it seems like a less distraction for the user -
> > > > > > all would keep working as before, user just wouldn't see any improvement comparing to SSE method.
> > > > > > Again didn't want to spread "ifdef __AVX2__" into rte_acl.c
> > > > > > Though I don't have any strong opinion here.
> > > > > > So if you can provide some good reason why #2 is preferable, I am ok to switch to #2.
> > > > > >
> > > > > Because 2 doesn't require any ifdeffing.  As you note above the problem here is
> > > > > that AVX2 support is both compiler and machine dependent.  If you make a weak
> > > > > symbol version of rte_acl_classify_avx2 that always gets built, then you've
> > > > > reduced the problem to just being compiler support, which you can check in the
> > > > > makefile.
> > > >
> > > > I don't think we'll get rid of ifdefing with #2.
> > > > We'll  remove 2 ifdefs in acl_run_avx2.h, but then we have to introduce 2 new in rte_acl.c instead.
> > > > From my understanding, we we'll need something like that:
> > > >
> > > > static const rte_acl_classify_t classify_fns[] = {
> > > >         [RTE_ACL_CLASSIFY_DEFAULT] = rte_acl_classify_scalar,
> > > >         [RTE_ACL_CLASSIFY_SCALAR] = rte_acl_classify_scalar,
> > > >         [RTE_ACL_CLASSIFY_SSE] = rte_acl_classify_sse,
> > > > +#if (defined __GNUC__ &&  __GNUC__ <= 4 && __GNUC_MINOR__ < 7)
> > > > +      [RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_error,
> > > > +#else
> > > >       [RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
> > > > +#endif
> > >
> > >
> > > You don't need to do this, you need to use a weak symbol:
> > > static int rte_acl_classify_avx2(...) __attributes__(weak)
> > > {
> > > 	return -EOPNOTSUP
> > > }
> > >
> > >
> > > Then in the rte_acl_avx2.c file define it again without the weak symbol
> > >
> > > That way, you do conditional compilation, and when you do the "real" symbol
> > > overrides the weak one.
> > 
> > Ah yes, you right - not need for ifdef here, thought I still think we need one below, in rte_acl_init().
> > 
> > >
> > > > };
> > > >
> > > > static void __attribute__((constructor))
> > > > rte_acl_init(void)
> > > > {
> > > >         enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
> > > >
> > > > +#if (defined __GNUC__ &&  __GNUC__ <= 4 && __GNUC_MINOR__ < 7)
> > > >         if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > > >                 alg = RTE_ACL_CLASSIFY_AVX2;
> > > >         else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > > > +#else
> > > > +      if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > > >                 alg = RTE_ACL_CLASSIFY_SSE;
> > > > +#endif
> > > >         rte_acl_set_default_classify(alg);
> > > > }
> > > >
> > > Why would you do this, this cpu feature flag definitions aren't matched to
> > > compiler support, it should always be defined.
> > 
> > Because if we don't do this, then on machine that does support AVX2,
> > we'll always set CLASSIFY_AVX2 as default method, no matter was compiler
> > able to produce a proper code for it or not.
> > We should set  CLASSIFY_AVX2 as default method only if both conditions are met:
> > at build time compiler supports AVX2 and target cpu supports AVX2.
> > 
> > Konstantin
> > 
> > >  You should still be able to
> > > check for AVX2 support in code that doesn't support emitting the instruction.
> > >
> > > Neil
> > >
> > > > Correct?
> > > > Konstantin
> > > >
> > > > >
> > > > > > >
> > > > > > > > >  That in turn I think allows you to remove a
> > > > > > > > > bunch of the ifdeffing that you've done in some of the avx2 specific files.
> > > > > > > >
> > > > > > > > Actually there are not many of them.
> > > > > > > > One in acl_run_avx2.h and another in acl_run_avx2.c.
> > > > > > > >
> > > > > > > 2 in acl_run_avx2.h and 1 in rte_acl_osdep_alone.h, which is really 3 more than
> > > > > > > you need if you just do an intellegent weak classifier function defintion.
> > > > > >
> > > > > > grep -n __AVX2__ lib/librte_acl/*.[c,h] | grep -v endif
> > > > > > lib/librte_acl/acl_run_avx2.c:45:#ifdef __AVX2__
> > > > > > lib/librte_acl/acl_run_avx2.h:36:#ifdef __AVX2__
> > > > > >
> > > > > > rte_acl_osdep_alone.h - is a different story.
> > > > > > It needs to be there anyway, as in rte_common_vect.h.
> > > > > > In fact  rte_acl_osdep_alone.h is only needed for cases when RTE_LIBRTE_ACL_STANDALONE=y.
> > > > > > That comes from the old days, when we had to to support building librte_acl library without the rest of DPDK.
> > > > > > I think we don't need it anymore and plan to remove it.
> > > > > > Just thought it should  be in a separate patch.
> > > > > > Konstantin
> > > > > >
> > > > > > >
> > > > > > > Neil
> > > > > >
> > > >
>
  

Patch

diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index 65e566d..223ec31 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -45,8 +45,17 @@  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx2.c
 
 CFLAGS_acl_run_sse.o += -msse4.1
+ifeq ($(CC), icc)
+CFLAGS_acl_run_avx2.o += -march=core-avx2
+else ifneq ($(shell \
+test $(GCC_MAJOR_VERSION) -le 4 -a $(GCC_MINOR_VERSION) -le 6 && echo 1), 1)
+CFLAGS_acl_run_avx2.o += -mavx2
+else
+CFLAGS_acl_run_avx2.o += -msse4.1
+endif
 
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 96bb318..d33d7ad 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -196,6 +196,10 @@  int
 rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
 
+int
+rte_acl_classify_avx2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
 #ifdef __cplusplus
 }
 #endif /* __cplusplus */
diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
index 4c843c1..850bc81 100644
--- a/lib/librte_acl/acl_run.h
+++ b/lib/librte_acl/acl_run.h
@@ -35,9 +35,9 @@ 
 #define	_ACL_RUN_H_
 
 #include <rte_acl.h>
-#include "acl_vect.h"
 #include "acl.h"
 
+#define MAX_SEARCHES_AVX16	16
 #define MAX_SEARCHES_SSE8	8
 #define MAX_SEARCHES_SSE4	4
 #define MAX_SEARCHES_SSE2	2
diff --git a/lib/librte_acl/acl_run_avx2.c b/lib/librte_acl/acl_run_avx2.c
new file mode 100644
index 0000000..8419d5d
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx2.c
@@ -0,0 +1,58 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+
+#include "acl_run_avx2.h"
+
+/*
+ * If compiled with the compiler, that don't support AVX2 instructions,
+ * that function will be indetical to rte_acl_classify_sse.
+ */
+int
+rte_acl_classify_avx2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+#ifdef __AVX2__
+	if (likely(num >= MAX_SEARCHES_AVX16))
+		return search_avx2x16(ctx, data, results, num, categories);
+	else if (num >= MAX_SEARCHES_SSE8)
+#else
+	if (likely(num >= MAX_SEARCHES_SSE8))
+#endif /* __AVX2__ */
+		return search_sse_8(ctx, data, results, num, categories);
+	else if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+	else
+		return search_sse_2(ctx, data, results, num,
+			categories);
+}
diff --git a/lib/librte_acl/acl_run_avx2.h b/lib/librte_acl/acl_run_avx2.h
new file mode 100644
index 0000000..f679030
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx2.h
@@ -0,0 +1,305 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run_sse.h"
+
+#ifdef __AVX2__
+
+static const rte_ymm_t ymm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_ymm_t ymm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_ymm_t ymm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const rte_ymm_t ymm_ones_16 = {
+	.u16 = {
+		1, 1, 1, 1, 1, 1, 1, 1,
+		1, 1, 1, 1, 1, 1, 1, 1,
+	},
+};
+
+static inline __attribute__((always_inline)) ymm_t
+calc_addr_avx2(ymm_t index_mask, ymm_t next_input, ymm_t shuffle_input,
+	ymm_t ones_16, ymm_t tr_lo, ymm_t tr_hi)
+{
+	ymm_t in, node_type, r, t;
+	ymm_t dfa_msk, dfa_ofs, quad_ofs;
+	ymm_t addr;
+
+	const ymm_t range_base = _mm256_set_epi32(
+		0xffffff0c, 0xffffff08, 0xffffff04, 0xffffff00,
+		0xffffff0c, 0xffffff08, 0xffffff04, 0xffffff00);
+
+	t = _mm256_xor_si256(index_mask, index_mask);
+	in = _mm256_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm256_andnot_si256(index_mask, tr_lo);
+	addr = _mm256_and_si256(index_mask, tr_lo);
+
+	/* DFA calculations. */
+
+	dfa_msk = _mm256_cmpeq_epi32(node_type, t);
+
+	r = _mm256_srli_epi32(in, 30);
+	r = _mm256_add_epi8(r, range_base);
+
+	t = _mm256_srli_epi32(in, 24);
+	r = _mm256_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm256_sub_epi32(t, r);
+
+	/* QUAD/SINGLE caluclations. */
+
+	t = _mm256_cmpgt_epi8(in, tr_hi);
+	t = _mm256_sign_epi8(t, t);
+	t = _mm256_maddubs_epi16(t, t);
+	quad_ofs = _mm256_madd_epi16(t, ones_16);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm256_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
+
+	addr = _mm256_add_epi32(addr, t);
+	return addr;
+}
+
+static inline __attribute__((always_inline)) ymm_t
+transition8(ymm_t next_input, const uint64_t *trans, ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	const int32_t *tr;
+	ymm_t addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	addr = calc_addr_avx2(ymm_index_mask.y, next_input, ymm_shuffle_input.y,
+		ymm_ones_16.y, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0]));
+
+	next_input = _mm256_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm256_i32gather_epi32(tr + 1, addr, sizeof(trans[0]));
+
+	return next_input;
+}
+
+static inline void
+acl_process_matches_avx2x8(const struct rte_acl_ctx *ctx,
+	struct parms *parms, struct acl_flow_data *flows, uint32_t slot,
+	ymm_t matches, ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	ymm_t t0, t1;
+	ymm_t lo, hi;
+	xmm_t l0, l1;
+	uint32_t i;
+	uint64_t tr[MAX_SEARCHES_SSE8];
+
+	l1 = _mm256_extracti128_si256(*tr_lo, 1);
+	l0 = _mm256_castsi256_si128(*tr_lo);
+
+	for (i = 0; i != RTE_DIM(tr) / 2; i++) {
+		tr[i] = (uint32_t)_mm_cvtsi128_si32(l0);
+		tr[i + 4] = (uint32_t)_mm_cvtsi128_si32(l1);
+
+		l0 = _mm_srli_si128(l0, sizeof(uint32_t));
+		l1 = _mm_srli_si128(l1, sizeof(uint32_t));
+
+		tr[i] = acl_match_check(tr[i], slot + i,
+			ctx, parms, flows, resolve_priority_sse);
+		tr[i + 4] = acl_match_check(tr[i + 4], slot + i + 4,
+			ctx, parms, flows, resolve_priority_sse);
+	}
+
+	t0 = _mm256_set_epi64x(tr[5], tr[4], tr[1], tr[0]);
+	t1 = _mm256_set_epi64x(tr[7], tr[6], tr[3], tr[2]);
+
+	lo = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0x88);
+	hi = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0xdd);
+
+	*tr_lo = _mm256_blendv_epi8(*tr_lo, lo, matches);
+	*tr_hi = _mm256_blendv_epi8(*tr_hi, hi, matches);
+}
+
+static inline void
+acl_match_check_avx2x8(const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, uint32_t slot,
+	ymm_t *tr_lo, ymm_t *tr_hi, ymm_t match_mask)
+{
+	uint32_t msk;
+	ymm_t matches, temp;
+
+	/* test for match node */
+	temp = _mm256_and_si256(match_mask, *tr_lo);
+	matches = _mm256_cmpeq_epi32(temp, match_mask);
+	msk = _mm256_movemask_epi8(matches);
+
+	while (msk != 0) {
+
+		acl_process_matches_avx2x8(ctx, parms, flows, slot,
+			matches, tr_lo, tr_hi);
+		temp = _mm256_and_si256(match_mask, *tr_lo);
+		matches = _mm256_cmpeq_epi32(temp, match_mask);
+		msk = _mm256_movemask_epi8(matches);
+	}
+}
+
+static inline int
+search_avx2x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_AVX16];
+	struct completion cmplt[MAX_SEARCHES_AVX16];
+	struct parms parms[MAX_SEARCHES_AVX16];
+	ymm_t input[2], tr_lo[2], tr_hi[2];
+	ymm_t t0, t1;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < RTE_DIM(cmplt); n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	t0 = _mm256_set_epi64x(index_array[5], index_array[4],
+		index_array[1], index_array[0]);
+	t1 = _mm256_set_epi64x(index_array[7], index_array[6],
+		index_array[3], index_array[2]);
+
+	tr_lo[0] = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0x88);
+	tr_hi[0] = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0xdd);
+
+	t0 = _mm256_set_epi64x(index_array[13], index_array[12],
+		index_array[9], index_array[8]);
+	t1 = _mm256_set_epi64x(index_array[15], index_array[14],
+		index_array[11], index_array[10]);
+
+	tr_lo[1] = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0x88);
+	tr_hi[1] = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0xdd);
+
+	 /* Check for any matches. */
+	acl_match_check_avx2x8(ctx, parms, &flows, 0, &tr_lo[0], &tr_hi[0],
+		ymm_match_mask.y);
+	acl_match_check_avx2x8(ctx, parms, &flows, 8, &tr_lo[1], &tr_hi[1],
+		ymm_match_mask.y);
+
+	while (flows.started > 0) {
+
+		uint32_t in[MAX_SEARCHES_SSE8];
+
+		/* Gather 4 bytes of input data for first 8 flows. */
+		in[0] = GET_NEXT_4BYTES(parms, 0);
+		in[4] = GET_NEXT_4BYTES(parms, 4);
+		in[1] = GET_NEXT_4BYTES(parms, 1);
+		in[5] = GET_NEXT_4BYTES(parms, 5);
+		in[2] = GET_NEXT_4BYTES(parms, 2);
+		in[6] = GET_NEXT_4BYTES(parms, 6);
+		in[3] = GET_NEXT_4BYTES(parms, 3);
+		in[7] = GET_NEXT_4BYTES(parms, 7);
+		input[0] = _mm256_set_epi32(in[7], in[6], in[5], in[4],
+			in[3], in[2], in[1], in[0]);
+
+		/* Gather 4 bytes of input data for last 8 flows. */
+		in[0] = GET_NEXT_4BYTES(parms, 8);
+		in[4] = GET_NEXT_4BYTES(parms, 12);
+		in[1] = GET_NEXT_4BYTES(parms, 9);
+		in[5] = GET_NEXT_4BYTES(parms, 13);
+		in[2] = GET_NEXT_4BYTES(parms, 10);
+		in[6] = GET_NEXT_4BYTES(parms, 14);
+		in[3] = GET_NEXT_4BYTES(parms, 11);
+		in[7] = GET_NEXT_4BYTES(parms, 15);
+		input[1] = _mm256_set_epi32(in[7], in[6], in[5], in[4],
+			in[3], in[2], in[1], in[0]);
+
+		input[0] = transition8(input[0], flows.trans,
+			&tr_lo[0], &tr_hi[0]);
+		input[1] = transition8(input[1], flows.trans,
+			&tr_lo[1], &tr_hi[1]);
+
+		input[0] = transition8(input[0], flows.trans,
+			&tr_lo[0], &tr_hi[0]);
+		input[1] = transition8(input[1], flows.trans,
+			&tr_lo[1], &tr_hi[1]);
+
+		input[0] = transition8(input[0], flows.trans,
+			&tr_lo[0], &tr_hi[0]);
+		input[1] = transition8(input[1], flows.trans,
+			&tr_lo[1], &tr_hi[1]);
+
+		input[0] = transition8(input[0], flows.trans,
+			&tr_lo[0], &tr_hi[0]);
+		input[1] = transition8(input[1], flows.trans,
+			&tr_lo[1], &tr_hi[1]);
+
+		 /* Check for any matches. */
+		acl_match_check_avx2x8(ctx, parms, &flows, 0,
+			&tr_lo[0], &tr_hi[0], ymm_match_mask.y);
+		acl_match_check_avx2x8(ctx, parms, &flows, 8,
+			&tr_lo[1], &tr_hi[1], ymm_match_mask.y);
+	}
+
+	return 0;
+}
+
+#endif /* __AVX2__ */
diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
index 4605b58..77b32b3 100644
--- a/lib/librte_acl/acl_run_sse.c
+++ b/lib/librte_acl/acl_run_sse.c
@@ -31,542 +31,7 @@ 
  *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
-#include "acl_run.h"
-
-enum {
-	SHUFFLE32_SLOT1 = 0xe5,
-	SHUFFLE32_SLOT2 = 0xe6,
-	SHUFFLE32_SLOT3 = 0xe7,
-	SHUFFLE32_SWAP64 = 0x4e,
-};
-
-static const rte_xmm_t mm_shuffle_input = {
-	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
-};
-
-static const rte_xmm_t mm_shuffle_input64 = {
-	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
-};
-
-static const rte_xmm_t mm_ones_16 = {
-	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
-};
-
-static const rte_xmm_t mm_match_mask = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-	},
-};
-
-static const rte_xmm_t mm_match_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		0,
-		RTE_ACL_NODE_MATCH,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_index_mask = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-	},
-};
-
-static const rte_xmm_t mm_index_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		0,
-		0,
-	},
-};
-
-
-/*
- * Resolve priority for multiple results (sse version).
- * This consists comparing the priority of the current traversal with the
- * running set of results for the packet.
- * For each result, keep a running array of the result (rule number) and
- * its priority for each category.
- */
-static inline void
-resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
-	struct parms *parms, const struct rte_acl_match_results *p,
-	uint32_t categories)
-{
-	uint32_t x;
-	xmm_t results, priority, results1, priority1, selector;
-	xmm_t *saved_results, *saved_priority;
-
-	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
-
-		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
-		saved_priority =
-			(xmm_t *)(&parms[n].cmplt->priority[x]);
-
-		/* get results and priorities for completed trie */
-		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
-		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
-
-		/* if this is not the first completed trie */
-		if (parms[n].cmplt->count != ctx->num_tries) {
-
-			/* get running best results and their priorities */
-			results1 = MM_LOADU(saved_results);
-			priority1 = MM_LOADU(saved_priority);
-
-			/* select results that are highest priority */
-			selector = MM_CMPGT32(priority1, priority);
-			results = MM_BLENDV8(results, results1, selector);
-			priority = MM_BLENDV8(priority, priority1, selector);
-		}
-
-		/* save running best results and their priorities */
-		MM_STOREU(saved_results, results);
-		MM_STOREU(saved_priority, priority);
-	}
-}
-
-/*
- * Extract transitions from an XMM register and check for any matches
- */
-static void
-acl_process_matches(xmm_t *indices, int slot, const struct rte_acl_ctx *ctx,
-	struct parms *parms, struct acl_flow_data *flows)
-{
-	uint64_t transition1, transition2;
-
-	/* extract transition from low 64 bits. */
-	transition1 = MM_CVT64(*indices);
-
-	/* extract transition from high 64 bits. */
-	*indices = MM_SHUFFLE32(*indices, SHUFFLE32_SWAP64);
-	transition2 = MM_CVT64(*indices);
-
-	transition1 = acl_match_check(transition1, slot, ctx,
-		parms, flows, resolve_priority_sse);
-	transition2 = acl_match_check(transition2, slot + 1, ctx,
-		parms, flows, resolve_priority_sse);
-
-	/* update indices with new transitions. */
-	*indices = MM_SET64(transition2, transition1);
-}
-
-/*
- * Check for a match in 2 transitions (contained in SSE register)
- */
-static inline void
-acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indices, xmm_t match_mask)
-{
-	xmm_t temp;
-
-	temp = MM_AND(match_mask, *indices);
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indices, slot, ctx, parms, flows);
-		temp = MM_AND(match_mask, *indices);
-	}
-}
-
-/*
- * Check for any match in 4 transitions (contained in 2 SSE registers)
- */
-static inline void
-acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indices1, xmm_t *indices2,
-	xmm_t match_mask)
-{
-	xmm_t temp;
-
-	/* put low 32 bits of each transition into one register */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indices1, (__m128)*indices2,
-		0x88);
-	/* test for match node */
-	temp = MM_AND(match_mask, temp);
-
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indices1, slot, ctx, parms, flows);
-		acl_process_matches(indices2, slot + 2, ctx, parms, flows);
-
-		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indices1,
-					(__m128)*indices2,
-					0x88);
-		temp = MM_AND(match_mask, temp);
-	}
-}
-
-/*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes don't move.
- */
-static inline xmm_t
-acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t indices1, xmm_t indices2)
-{
-	xmm_t addr, node_types, range, temp;
-	xmm_t dfa_msk, dfa_ofs, quad_ofs;
-	xmm_t in, r, t;
-
-	const xmm_t range_base = _mm_set_epi32(0xffffff0c, 0xffffff08,
-		0xffffff04, 0xffffff00);
-
-	/*
-	 * Note that no transition is done for a match
-	 * node and therefore a stream freezes when
-	 * it reaches a match.
-	 */
-
-	/* Shuffle low 32 into temp and high 32 into indices2 */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)indices1, (__m128)indices2, 0x88);
-	range = (xmm_t)MM_SHUFFLEPS((__m128)indices1, (__m128)indices2, 0xdd);
-
-	t = MM_XOR(index_mask, index_mask);
-
-	/* shuffle input byte to all 4 positions of 32 bit value */
-	in = MM_SHUFFLE8(next_input, shuffle_input);
-
-	/* Calc node type and node addr */
-	node_types = MM_ANDNOT(index_mask, temp);
-	addr = MM_AND(index_mask, temp);
-
-	/*
-	 * Calc addr for DFAs - addr = dfa_index + input_byte
-	 */
-
-	/* mask for DFA type (0) nodes */
-	dfa_msk = MM_CMPEQ32(node_types, t);
-
-	r = _mm_srli_epi32(in, 30);
-	r = _mm_add_epi8(r, range_base);
-
-	t = _mm_srli_epi32(in, 24);
-	r = _mm_shuffle_epi8(range, r);
-
-	dfa_ofs = _mm_sub_epi32(t, r);
-
-	/*
-	 * Calculate number of range boundaries that are less than the
-	 * input value. Range boundaries for each node are in signed 8 bit,
-	 * ordered from -128 to 127 in the indices2 register.
-	 * This is effectively a popcnt of bytes that are greater than the
-	 * input byte.
-	 */
-
-	/* check ranges */
-	temp = MM_CMPGT8(in, range);
-
-	/* convert -1 to 1 (bytes greater than input byte */
-	temp = MM_SIGN8(temp, temp);
-
-	/* horizontal add pairs of bytes into words */
-	temp = MM_MADD8(temp, temp);
-
-	/* horizontal add pairs of words into dwords */
-	quad_ofs = MM_MADD16(temp, ones_16);
-
-	/* mask to range type nodes */
-	temp = _mm_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
-
-	/* add index into node position */
-	return MM_ADD32(addr, temp);
-}
-
-/*
- * Process 4 transitions (in 2 SIMD registers) in parallel
- */
-static inline xmm_t
-transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, const uint64_t *trans,
-	xmm_t *indices1, xmm_t *indices2)
-{
-	xmm_t addr;
-	uint64_t trans0, trans2;
-
-	 /* Calculate the address (array index) for all 4 transitions. */
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		*indices1, *indices2);
-
-	 /* Gather 64 bit transitions and pack back into 2 registers. */
-
-	trans0 = trans[MM_CVT32(addr)];
-
-	/* get slot 2 */
-
-	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
-	trans2 = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-
-	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indices1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
-
-	/* get slot 3 */
-
-	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
-	*indices2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
-
-	return MM_SRL32(next_input, 8);
-}
-
-/*
- * Execute trie traversal with 8 traversals in parallel
- */
-static inline int
-search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE8];
-	struct completion cmplt[MAX_SEARCHES_SSE8];
-	struct parms parms[MAX_SEARCHES_SSE8];
-	xmm_t input0, input1;
-	xmm_t indices1, indices2, indices3, indices4;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	/*
-	 * indices1 contains index_array[0,1]
-	 * indices2 contains index_array[2,3]
-	 * indices3 contains index_array[4,5]
-	 * indices4 contains index_array[6,7]
-	 */
-
-	indices1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indices2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	indices3 = MM_LOADU((xmm_t *) &index_array[4]);
-	indices4 = MM_LOADU((xmm_t *) &index_array[6]);
-
-	 /* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indices1, &indices2, mm_match_mask.x);
-	acl_match_check_x4(4, ctx, parms, &flows,
-		&indices3, &indices4, mm_match_mask.x);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input0 = MM_INSERT32(mm_ones_16.x, GET_NEXT_4BYTES(parms, 0),
-			0);
-		input1 = MM_INSERT32(mm_ones_16.x, GET_NEXT_4BYTES(parms, 4),
-			0);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
-
-		 /* Process the 4 bytes of input on each stream. */
-
-		input0 = transition4(mm_index_mask.x, input0,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		input1 = transition4(mm_index_mask.x, input1,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices3, &indices4);
-
-		input0 = transition4(mm_index_mask.x, input0,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		input1 = transition4(mm_index_mask.x, input1,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices3, &indices4);
-
-		input0 = transition4(mm_index_mask.x, input0,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		input1 = transition4(mm_index_mask.x, input1,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices3, &indices4);
-
-		input0 = transition4(mm_index_mask.x, input0,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		input1 = transition4(mm_index_mask.x, input1,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices3, &indices4);
-
-		 /* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indices1, &indices2, mm_match_mask.x);
-		acl_match_check_x4(4, ctx, parms, &flows,
-			&indices3, &indices4, mm_match_mask.x);
-	}
-
-	return 0;
-}
-
-/*
- * Execute trie traversal with 4 traversals in parallel
- */
-static inline int
-search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	 uint32_t *results, int total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE4];
-	struct completion cmplt[MAX_SEARCHES_SSE4];
-	struct parms parms[MAX_SEARCHES_SSE4];
-	xmm_t input, indices1, indices2;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indices1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indices2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	/* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indices1, &indices2, mm_match_mask.x);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.x, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
-
-		/* Process the 4 bytes of input on each stream. */
-		input = transition4(mm_index_mask.x, input,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		 input = transition4(mm_index_mask.x, input,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		 input = transition4(mm_index_mask.x, input,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		 input = transition4(mm_index_mask.x, input,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		/* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indices1, &indices2, mm_match_mask.x);
-	}
-
-	return 0;
-}
-
-static inline xmm_t
-transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, const uint64_t *trans, xmm_t *indices1)
-{
-	uint64_t t;
-	xmm_t addr, indices2;
-
-	indices2 = MM_XOR(ones_16, ones_16);
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		*indices1, indices2);
-
-	/* Gather 64 bit transitions and pack 2 per register. */
-
-	t = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indices1 = MM_SET64(trans[MM_CVT32(addr)], t);
-
-	return MM_SRL32(next_input, 8);
-}
-
-/*
- * Execute trie traversal with 2 traversals in parallel.
- */
-static inline int
-search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE2];
-	struct completion cmplt[MAX_SEARCHES_SSE2];
-	struct parms parms[MAX_SEARCHES_SSE2];
-	xmm_t input, indices;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indices = MM_LOADU((xmm_t *) &index_array[0]);
-
-	/* Check for any matches. */
-	acl_match_check_x2(0, ctx, parms, &flows, &indices, mm_match_mask64.x);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.x, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-
-		/* Process the 4 bytes of input on each stream. */
-
-		input = transition2(mm_index_mask64.x, input,
-			mm_shuffle_input64.x, mm_ones_16.x,
-			flows.trans, &indices);
-
-		input = transition2(mm_index_mask64.x, input,
-			mm_shuffle_input64.x, mm_ones_16.x,
-			flows.trans, &indices);
-
-		input = transition2(mm_index_mask64.x, input,
-			mm_shuffle_input64.x, mm_ones_16.x,
-			flows.trans, &indices);
-
-		input = transition2(mm_index_mask64.x, input,
-			mm_shuffle_input64.x, mm_ones_16.x,
-			flows.trans, &indices);
-
-		/* Check for any matches. */
-		acl_match_check_x2(0, ctx, parms, &flows, &indices,
-			mm_match_mask64.x);
-	}
-
-	return 0;
-}
+#include "acl_run_sse.h"
 
 int
 rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
diff --git a/lib/librte_acl/acl_run_sse.h b/lib/librte_acl/acl_run_sse.h
new file mode 100644
index 0000000..e33e16b
--- /dev/null
+++ b/lib/librte_acl/acl_run_sse.h
@@ -0,0 +1,533 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+#include "acl_vect.h"
+
+enum {
+	SHUFFLE32_SLOT1 = 0xe5,
+	SHUFFLE32_SLOT2 = 0xe6,
+	SHUFFLE32_SLOT3 = 0xe7,
+	SHUFFLE32_SWAP64 = 0x4e,
+};
+
+static const rte_xmm_t xmm_shuffle_input = {
+	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
+};
+
+static const rte_xmm_t xmm_shuffle_input64 = {
+	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
+};
+
+static const rte_xmm_t xmm_ones_16 = {
+	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
+};
+
+static const rte_xmm_t xmm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_xmm_t xmm_match_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		0,
+		RTE_ACL_NODE_MATCH,
+		0,
+	},
+};
+
+static const rte_xmm_t xmm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_xmm_t xmm_index_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		0,
+		0,
+	},
+};
+
+
+/*
+ * Resolve priority for multiple results (sse version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+	struct parms *parms, const struct rte_acl_match_results *p,
+	uint32_t categories)
+{
+	uint32_t x;
+	xmm_t results, priority, results1, priority1, selector;
+	xmm_t *saved_results, *saved_priority;
+
+	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
+
+		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
+		saved_priority =
+			(xmm_t *)(&parms[n].cmplt->priority[x]);
+
+		/* get results and priorities for completed trie */
+		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
+		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
+
+		/* if this is not the first completed trie */
+		if (parms[n].cmplt->count != ctx->num_tries) {
+
+			/* get running best results and their priorities */
+			results1 = MM_LOADU(saved_results);
+			priority1 = MM_LOADU(saved_priority);
+
+			/* select results that are highest priority */
+			selector = MM_CMPGT32(priority1, priority);
+			results = MM_BLENDV8(results, results1, selector);
+			priority = MM_BLENDV8(priority, priority1, selector);
+		}
+
+		/* save running best results and their priorities */
+		MM_STOREU(saved_results, results);
+		MM_STOREU(saved_priority, priority);
+	}
+}
+
+/*
+ * Extract transitions from an XMM register and check for any matches
+ */
+static void
+acl_process_matches(xmm_t *indices, int slot, const struct rte_acl_ctx *ctx,
+	struct parms *parms, struct acl_flow_data *flows)
+{
+	uint64_t transition1, transition2;
+
+	/* extract transition from low 64 bits. */
+	transition1 = MM_CVT64(*indices);
+
+	/* extract transition from high 64 bits. */
+	*indices = MM_SHUFFLE32(*indices, SHUFFLE32_SWAP64);
+	transition2 = MM_CVT64(*indices);
+
+	transition1 = acl_match_check(transition1, slot, ctx,
+		parms, flows, resolve_priority_sse);
+	transition2 = acl_match_check(transition2, slot + 1, ctx,
+		parms, flows, resolve_priority_sse);
+
+	/* update indices with new transitions. */
+	*indices = MM_SET64(transition2, transition1);
+}
+
+/*
+ * Check for a match in 2 transitions (contained in SSE register)
+ */
+static inline __attribute__((always_inline)) void
+acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indices, xmm_t match_mask)
+{
+	xmm_t temp;
+
+	temp = MM_AND(match_mask, *indices);
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indices, slot, ctx, parms, flows);
+		temp = MM_AND(match_mask, *indices);
+	}
+}
+
+/*
+ * Check for any match in 4 transitions (contained in 2 SSE registers)
+ */
+static inline __attribute__((always_inline)) void
+acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indices1, xmm_t *indices2,
+	xmm_t match_mask)
+{
+	xmm_t temp;
+
+	/* put low 32 bits of each transition into one register */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indices1, (__m128)*indices2,
+		0x88);
+	/* test for match node */
+	temp = MM_AND(match_mask, temp);
+
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indices1, slot, ctx, parms, flows);
+		acl_process_matches(indices2, slot + 2, ctx, parms, flows);
+
+		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indices1,
+					(__m128)*indices2,
+					0x88);
+		temp = MM_AND(match_mask, temp);
+	}
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes don't move.
+ */
+static inline __attribute__((always_inline)) xmm_t
+calc_addr_sse(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t indices1, xmm_t indices2)
+{
+	xmm_t addr, node_types, range, temp;
+	xmm_t dfa_msk, dfa_ofs, quad_ofs;
+	xmm_t in, r, t;
+
+	const xmm_t range_base = _mm_set_epi32(0xffffff0c, 0xffffff08,
+		0xffffff04, 0xffffff00);
+
+	/*
+	 * Note that no transition is done for a match
+	 * node and therefore a stream freezes when
+	 * it reaches a match.
+	 */
+
+	/* Shuffle low 32 into temp and high 32 into indices2 */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)indices1, (__m128)indices2, 0x88);
+	range = (xmm_t)MM_SHUFFLEPS((__m128)indices1, (__m128)indices2, 0xdd);
+
+	t = MM_XOR(index_mask, index_mask);
+
+	/* shuffle input byte to all 4 positions of 32 bit value */
+	in = MM_SHUFFLE8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_types = MM_ANDNOT(index_mask, temp);
+	addr = MM_AND(index_mask, temp);
+
+	/*
+	 * Calc addr for DFAs - addr = dfa_index + input_byte
+	 */
+
+	/* mask for DFA type (0) nodes */
+	dfa_msk = MM_CMPEQ32(node_types, t);
+
+	r = _mm_srli_epi32(in, 30);
+	r = _mm_add_epi8(r, range_base);
+
+	t = _mm_srli_epi32(in, 24);
+	r = _mm_shuffle_epi8(range, r);
+
+	dfa_ofs = _mm_sub_epi32(t, r);
+
+	/*
+	 * Calculate number of range boundaries that are less than the
+	 * input value. Range boundaries for each node are in signed 8 bit,
+	 * ordered from -128 to 127 in the indices2 register.
+	 * This is effectively a popcnt of bytes that are greater than the
+	 * input byte.
+	 */
+
+	/* check ranges */
+	temp = MM_CMPGT8(in, range);
+
+	/* convert -1 to 1 (bytes greater than input byte */
+	temp = MM_SIGN8(temp, temp);
+
+	/* horizontal add pairs of bytes into words */
+	temp = MM_MADD8(temp, temp);
+
+	/* horizontal add pairs of words into dwords */
+	quad_ofs = MM_MADD16(temp, ones_16);
+
+	/* mask to range type nodes */
+	temp = _mm_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
+
+	/* add index into node position */
+	return MM_ADD32(addr, temp);
+}
+
+/*
+ * Process 4 transitions (in 2 SIMD registers) in parallel
+ */
+static inline __attribute__((always_inline)) xmm_t
+transition4(xmm_t next_input, const uint64_t *trans,
+	xmm_t *indices1, xmm_t *indices2)
+{
+	xmm_t addr;
+	uint64_t trans0, trans2;
+
+	 /* Calculate the address (array index) for all 4 transitions. */
+
+	addr = calc_addr_sse(xmm_index_mask.x, next_input, xmm_shuffle_input.x,
+		xmm_ones_16.x, *indices1, *indices2);
+
+	 /* Gather 64 bit transitions and pack back into 2 registers. */
+
+	trans0 = trans[MM_CVT32(addr)];
+
+	/* get slot 2 */
+
+	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
+	trans2 = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+
+	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indices1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
+
+	/* get slot 3 */
+
+	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
+	*indices2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
+
+	return MM_SRL32(next_input, CHAR_BIT);
+}
+
+/*
+ * Execute trie traversal with 8 traversals in parallel
+ */
+static inline int
+search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE8];
+	struct completion cmplt[MAX_SEARCHES_SSE8];
+	struct parms parms[MAX_SEARCHES_SSE8];
+	xmm_t input0, input1;
+	xmm_t indices1, indices2, indices3, indices4;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	/*
+	 * indices1 contains index_array[0,1]
+	 * indices2 contains index_array[2,3]
+	 * indices3 contains index_array[4,5]
+	 * indices4 contains index_array[6,7]
+	 */
+
+	indices1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indices2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	indices3 = MM_LOADU((xmm_t *) &index_array[4]);
+	indices4 = MM_LOADU((xmm_t *) &index_array[6]);
+
+	 /* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indices1, &indices2, xmm_match_mask.x);
+	acl_match_check_x4(4, ctx, parms, &flows,
+		&indices3, &indices4, xmm_match_mask.x);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input0 = _mm_cvtsi32_si128(GET_NEXT_4BYTES(parms, 0));
+		input1 = _mm_cvtsi32_si128(GET_NEXT_4BYTES(parms, 4));
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
+
+		 /* Process the 4 bytes of input on each stream. */
+
+		input0 = transition4(input0, flows.trans,
+			&indices1, &indices2);
+		input1 = transition4(input1, flows.trans,
+			&indices3, &indices4);
+
+		input0 = transition4(input0, flows.trans,
+			&indices1, &indices2);
+		input1 = transition4(input1, flows.trans,
+			&indices3, &indices4);
+
+		input0 = transition4(input0, flows.trans,
+			&indices1, &indices2);
+		input1 = transition4(input1, flows.trans,
+			&indices3, &indices4);
+
+		input0 = transition4(input0, flows.trans,
+			&indices1, &indices2);
+		input1 = transition4(input1, flows.trans,
+			&indices3, &indices4);
+
+		 /* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indices1, &indices2, xmm_match_mask.x);
+		acl_match_check_x4(4, ctx, parms, &flows,
+			&indices3, &indices4, xmm_match_mask.x);
+	}
+
+	return 0;
+}
+
+/*
+ * Execute trie traversal with 4 traversals in parallel
+ */
+static inline int
+search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	 uint32_t *results, int total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE4];
+	struct completion cmplt[MAX_SEARCHES_SSE4];
+	struct parms parms[MAX_SEARCHES_SSE4];
+	xmm_t input, indices1, indices2;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indices1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indices2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	/* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indices1, &indices2, xmm_match_mask.x);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = _mm_cvtsi32_si128(GET_NEXT_4BYTES(parms, 0));
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
+
+		/* Process the 4 bytes of input on each stream. */
+		input = transition4(input, flows.trans, &indices1, &indices2);
+		input = transition4(input, flows.trans, &indices1, &indices2);
+		input = transition4(input, flows.trans, &indices1, &indices2);
+		input = transition4(input, flows.trans, &indices1, &indices2);
+
+		/* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indices1, &indices2, xmm_match_mask.x);
+	}
+
+	return 0;
+}
+
+static inline __attribute__((always_inline)) xmm_t
+transition2(xmm_t next_input, const uint64_t *trans, xmm_t *indices1)
+{
+	uint64_t t;
+	xmm_t addr, indices2;
+
+	indices2 = _mm_setzero_si128();
+
+	addr = calc_addr_sse(xmm_index_mask.x, next_input, xmm_shuffle_input.x,
+		xmm_ones_16.x, *indices1, indices2);
+
+	/* Gather 64 bit transitions and pack 2 per register. */
+
+	t = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indices1 = MM_SET64(trans[MM_CVT32(addr)], t);
+
+	return MM_SRL32(next_input, CHAR_BIT);
+}
+
+/*
+ * Execute trie traversal with 2 traversals in parallel.
+ */
+static inline int
+search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE2];
+	struct completion cmplt[MAX_SEARCHES_SSE2];
+	struct parms parms[MAX_SEARCHES_SSE2];
+	xmm_t input, indices;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indices = MM_LOADU((xmm_t *) &index_array[0]);
+
+	/* Check for any matches. */
+	acl_match_check_x2(0, ctx, parms, &flows, &indices,
+		xmm_match_mask64.x);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = _mm_cvtsi32_si128(GET_NEXT_4BYTES(parms, 0));
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+
+		/* Process the 4 bytes of input on each stream. */
+
+		input = transition2(input, flows.trans, &indices);
+		input = transition2(input, flows.trans, &indices);
+		input = transition2(input, flows.trans, &indices);
+		input = transition2(input, flows.trans, &indices);
+
+		/* Check for any matches. */
+		acl_match_check_x2(0, ctx, parms, &flows, &indices,
+			xmm_match_mask64.x);
+	}
+
+	return 0;
+}
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index a16c4a4..2fa51cb 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -42,6 +42,7 @@  static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_DEFAULT] = rte_acl_classify_scalar,
 	[RTE_ACL_CLASSIFY_SCALAR] = rte_acl_classify_scalar,
 	[RTE_ACL_CLASSIFY_SSE] = rte_acl_classify_sse,
+	[RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
 };
 
 /* by default, use always available scalar code path. */
@@ -69,7 +70,9 @@  rte_acl_init(void)
 {
 	enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
 
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+		alg = RTE_ACL_CLASSIFY_AVX2;
+	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
 		alg = RTE_ACL_CLASSIFY_SSE;
 
 	rte_acl_set_default_classify(alg);
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index 0d913ee..652a234 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -265,6 +265,8 @@  enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_DEFAULT = 0,
 	RTE_ACL_CLASSIFY_SCALAR = 1,  /**< generic implementation. */
 	RTE_ACL_CLASSIFY_SSE = 2,     /**< requires SSE4.1 support. */
+	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
+	RTE_ACL_CLASSIFY_NUM          /* should always be the last one. */
 };
 
 /**