[1/3] acl: fix arm argument types

Message ID 20190408182420.4398-2-aconole@redhat.com (mailing list archive)
State Changes Requested, archived
Delegated to: Thomas Monjalon
Headers
Series librte_acl: fixes related to testing with the meson build |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/mellanox-Performance-Testing success Performance Testing PASS
ci/intel-Performance-Testing success Performance Testing PASS

Commit Message

Aaron Conole April 8, 2019, 6:24 p.m. UTC
  Compiler complains of argument type mismatch, like:

   ../lib/librte_acl/acl_run_neon.h: In function ‘transition4’:
   ../lib/librte_acl/acl_run_neon.h:115:2: note: use -flax-vector-conversions
      to permit conversions between vectors with differing element types
      or numbers of subparts
     node_type = vbicq_s32(tr_hi_lo.val[0], index_msk);
     ^
   ../lib/librte_acl/acl_run_neon.h:115:41: error: incompatible type for
      argument 2 of ‘vbicq_s32’

Signed-off-by: Aaron Conole <aconole@redhat.com>
---
 lib/librte_acl/acl_run_neon.h | 46 ++++++++++++++++++++---------------
 1 file changed, 27 insertions(+), 19 deletions(-)
  

Comments

Jerin Jacob Kollanukkaran April 10, 2019, 2:39 p.m. UTC | #1
On Mon, 2019-04-08 at 14:24 -0400, Aaron Conole wrote:
> -------------------------------------------------------------------
> ---
> Compiler complains of argument type mismatch, like:

Can you share more details on how to reproduce this issue?

We already have
CFLAGS_acl_run_neon.o += -flax-vector-conversions
in the Makefile.

If you are taking out -flax-vector-conversions the correct way to
fix will be use vreinterpret*.

For me the code looks clean, If unnecessary casting is avoided.


> 
>    ../lib/librte_acl/acl_run_neon.h: In function ‘transition4’:
>    ../lib/librte_acl/acl_run_neon.h:115:2: note: use -flax-vector-
> conversions
>       to permit conversions between vectors with differing element
> types
>       or numbers of subparts
>      node_type = vbicq_s32(tr_hi_lo.val[0], index_msk);
>      ^
>    ../lib/librte_acl/acl_run_neon.h:115:41: error: incompatible type
> for
>       argument 2 of ‘vbicq_s32’
> 
> Signed-off-by: Aaron Conole <aconole@redhat.com>
> ---
>  lib/librte_acl/acl_run_neon.h | 46 ++++++++++++++++++++-------------
> --
>  1 file changed, 27 insertions(+), 19 deletions(-)
> 
> 
>  
>  /*
> @@ -179,6 +183,9 @@ search_neon_8(const struct rte_acl_ctx *ctx,
> const uint8_t **data,
>  	acl_match_check_x4(0, ctx, parms, &flows, &index_array[0]);
>  	acl_match_check_x4(4, ctx, parms, &flows, &index_array[4]);
>  
> +	memset(&input0, 0, sizeof(input0));
> +	memset(&input1, 0, sizeof(input1));

Why this memset only required for arm64? If it real issue, Shouldn't
it required for x86 and ppc ?
  
Aaron Conole April 10, 2019, 3:52 p.m. UTC | #2
Jerin Jacob Kollanukkaran <jerinj@marvell.com> writes:

> On Mon, 2019-04-08 at 14:24 -0400, Aaron Conole wrote:
>> -------------------------------------------------------------------
>> ---
>> Compiler complains of argument type mismatch, like:
>
> Can you share more details on how to reproduce this issue?

It will be generated using the meson build after enabling the neon
extension support (which isn't currently happening on ARM using meson as
the build environment).

> We already have
> CFLAGS_acl_run_neon.o += -flax-vector-conversions
> in the Makefile.
>
> If you are taking out -flax-vector-conversions the correct way to
> fix will be use vreinterpret*.
>
> For me the code looks clean, If unnecessary casting is avoided.

I agree.  I merely make explicit the casts that the compiler will be
implicitly introducing.

>
>> 
>>    ../lib/librte_acl/acl_run_neon.h: In function ‘transition4’:
>>    ../lib/librte_acl/acl_run_neon.h:115:2: note: use -flax-vector-
>> conversions
>>       to permit conversions between vectors with differing element
>> types
>>       or numbers of subparts
>>      node_type = vbicq_s32(tr_hi_lo.val[0], index_msk);
>>      ^
>>    ../lib/librte_acl/acl_run_neon.h:115:41: error: incompatible type
>> for
>>       argument 2 of ‘vbicq_s32’
>> 
>> Signed-off-by: Aaron Conole <aconole@redhat.com>
>> ---
>>  lib/librte_acl/acl_run_neon.h | 46 ++++++++++++++++++++-------------
>> --
>>  1 file changed, 27 insertions(+), 19 deletions(-)
>> 
>> 
>>  
>>  /*
>> @@ -179,6 +183,9 @@ search_neon_8(const struct rte_acl_ctx *ctx,
>> const uint8_t **data,
>>  	acl_match_check_x4(0, ctx, parms, &flows, &index_array[0]);
>>  	acl_match_check_x4(4, ctx, parms, &flows, &index_array[4]);
>>  
>> +	memset(&input0, 0, sizeof(input0));
>> +	memset(&input1, 0, sizeof(input1));
>
> Why this memset only required for arm64? If it real issue, Shouldn't
> it required for x86 and ppc ?

No.  Please see the following lines (which is due to the ARM neon
intrinsic for setting individual lanes):

	while (flows.started > 0) {
		/* Gather 4 bytes of input data for each stream. */
		input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input0, 0);
		input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), input1, 0);

Note: the first time through this loop, input0 and input1 appear on the
rhs of the assignment before appearing on the lhs.  This will generate
an uninitialized value warning, even though the assignments are to
individual lanes of the vector.

I squelched the warning from the compiler in the most brute-force way
possible.  Perhaps it would be better to use a static initialization for
the vector but this code was intended to be RFC and to generate
feedback.

I guess one alternate approach could be:

   static const int32x4_t ZERO_VEC;
   int32x4_t input0 = ZERO_VEC, input1 = ZERO_VEC;

   ...

   int32x4_t input = ZERO_VEC;

This would have the benefit of keeping the initializer as 'fast' as
possible (although I recall a memset under a certain size threshold is
the same effect, but not certain).

Either way, I prefer it to squelching the warning, since the warning
has been found to catch legitimate errors many times.
  
Jerin Jacob Kollanukkaran April 10, 2019, 4:07 p.m. UTC | #3
On Wed, 2019-04-10 at 11:52 -0400, Aaron Conole wrote:
> Jerin Jacob Kollanukkaran <jerinj@marvell.com> writes:
> 
> > On Mon, 2019-04-08 at 14:24 -0400, Aaron Conole wrote:
> > > ---------------------------------------------------------------
> > > ----
> > > ---
> > > Compiler complains of argument type mismatch, like:
> > 
> > Can you share more details on how to reproduce this issue?
> 
> It will be generated using the meson build after enabling the neon
> extension support (which isn't currently happening on ARM using meson
> as
> the build environment).


Can you share the patch to enable this for testing.

Since the additional memcpy in fastpath, I need to check the overhead
and check the possibility to avoid the memcpy to case.


> 
> > We already have
> > CFLAGS_acl_run_neon.o += -flax-vector-conversions
> > in the Makefile.
> > 
> > If you are taking out -flax-vector-conversions the correct way to
> > fix will be use vreinterpret*.
> > 
> > For me the code looks clean, If unnecessary casting is avoided.
> 
> I agree.  I merely make explicit the casts that the compiler will be
> implicitly introducing.
> 
> > >    ../lib/librte_acl/acl_run_neon.h: In function ‘transition4’:
> > >    ../lib/librte_acl/acl_run_neon.h:115:2: note: use -flax-
> > > vector-
> > > conversions
> > >       to permit conversions between vectors with differing
> > > element
> > > types
> > >       or numbers of subparts
> > >      node_type = vbicq_s32(tr_hi_lo.val[0], index_msk);
> > >      ^
> > >    ../lib/librte_acl/acl_run_neon.h:115:41: error: incompatible
> > > type
> > > for
> > >       argument 2 of ‘vbicq_s32’
> > > 
> > > Signed-off-by: Aaron Conole <aconole@redhat.com>
> > > ---
> > >  lib/librte_acl/acl_run_neon.h | 46 ++++++++++++++++++++---------
> > > ----
> > > --
> > >  1 file changed, 27 insertions(+), 19 deletions(-)
> > > 
> > > 
> > >  
> > >  /*
> > > @@ -179,6 +183,9 @@ search_neon_8(const struct rte_acl_ctx *ctx,
> > > const uint8_t **data,
> > >  	acl_match_check_x4(0, ctx, parms, &flows, &index_array[0]);
> > >  	acl_match_check_x4(4, ctx, parms, &flows, &index_array[4]);
> > >  
> > > +	memset(&input0, 0, sizeof(input0));
> > > +	memset(&input1, 0, sizeof(input1));
> > 
> > Why this memset only required for arm64? If it real issue,
> > Shouldn't
> > it required for x86 and ppc ?
> 
> No.  Please see the following lines (which is due to the ARM neon
> intrinsic for setting individual lanes):
> 
> 	while (flows.started > 0) {
> 		/* Gather 4 bytes of input data for each stream. */
> 		input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0),
> input0, 0);
> 		input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4),
> input1, 0);
> 
> Note: the first time through this loop, input0 and input1 appear on
> the
> rhs of the assignment before appearing on the lhs.  This will
> generate
> an uninitialized value warning, even though the assignments are to
> individual lanes of the vector.
> 
> I squelched the warning from the compiler in the most brute-force way
> possible.  Perhaps it would be better to use a static initialization
> for
> the vector but this code was intended to be RFC and to generate
> feedback.
> 
> I guess one alternate approach could be:
> 
>    static const int32x4_t ZERO_VEC;
>    int32x4_t input0 = ZERO_VEC, input1 = ZERO_VEC;
> 
>    ...
> 
>    int32x4_t input = ZERO_VEC;
> 
> This would have the benefit of keeping the initializer as 'fast' as
> possible (although I recall a memset under a certain size threshold
> is
> the same effect, but not certain).
> 
> Either way, I prefer it to squelching the warning, since the warning
> has been found to catch legitimate errors many times.

I will get back to this after reproducing the issue locally.
  
Aaron Conole April 10, 2019, 5:20 p.m. UTC | #4
Jerin Jacob Kollanukkaran <jerinj@marvell.com> writes:

> On Wed, 2019-04-10 at 11:52 -0400, Aaron Conole wrote:
>> Jerin Jacob Kollanukkaran <jerinj@marvell.com> writes:
>> 
>> > On Mon, 2019-04-08 at 14:24 -0400, Aaron Conole wrote:
>> > > ---------------------------------------------------------------
>> > > ----
>> > > ---
>> > > Compiler complains of argument type mismatch, like:
>> > 
>> > Can you share more details on how to reproduce this issue?
>> 
>> It will be generated using the meson build after enabling the neon
>> extension support (which isn't currently happening on ARM using meson
>> as
>> the build environment).
>
>
> Can you share the patch to enable this for testing.

Sure - I'm using these:

(needed)
1/3 - http://mails.dpdk.org/archives/dev/2019-March/128304.html
2/3 - http://mails.dpdk.org/archives/dev/2019-March/128305.html

(following only needed for travis support)
3/3 - http://mails.dpdk.org/archives/dev/2019-March/128306.html

-Aaron

> Since the additional memcpy in fastpath, I need to check the overhead
> and check the possibility to avoid the memcpy to case.
>
>
>> 
>> > We already have
>> > CFLAGS_acl_run_neon.o += -flax-vector-conversions
>> > in the Makefile.
>> > 
>> > If you are taking out -flax-vector-conversions the correct way to
>> > fix will be use vreinterpret*.
>> > 
>> > For me the code looks clean, If unnecessary casting is avoided.
>> 
>> I agree.  I merely make explicit the casts that the compiler will be
>> implicitly introducing.
>> 
>> > >    ../lib/librte_acl/acl_run_neon.h: In function ‘transition4’:
>> > >    ../lib/librte_acl/acl_run_neon.h:115:2: note: use -flax-
>> > > vector-
>> > > conversions
>> > >       to permit conversions between vectors with differing
>> > > element
>> > > types
>> > >       or numbers of subparts
>> > >      node_type = vbicq_s32(tr_hi_lo.val[0], index_msk);
>> > >      ^
>> > >    ../lib/librte_acl/acl_run_neon.h:115:41: error: incompatible
>> > > type
>> > > for
>> > >       argument 2 of ‘vbicq_s32’
>> > > 
>> > > Signed-off-by: Aaron Conole <aconole@redhat.com>
>> > > ---
>> > >  lib/librte_acl/acl_run_neon.h | 46 ++++++++++++++++++++---------
>> > > ----
>> > > --
>> > >  1 file changed, 27 insertions(+), 19 deletions(-)
>> > > 
>> > > 
>> > >  
>> > >  /*
>> > > @@ -179,6 +183,9 @@ search_neon_8(const struct rte_acl_ctx *ctx,
>> > > const uint8_t **data,
>> > >  	acl_match_check_x4(0, ctx, parms, &flows, &index_array[0]);
>> > >  	acl_match_check_x4(4, ctx, parms, &flows, &index_array[4]);
>> > >  
>> > > +	memset(&input0, 0, sizeof(input0));
>> > > +	memset(&input1, 0, sizeof(input1));
>> > 
>> > Why this memset only required for arm64? If it real issue,
>> > Shouldn't
>> > it required for x86 and ppc ?
>> 
>> No.  Please see the following lines (which is due to the ARM neon
>> intrinsic for setting individual lanes):
>> 
>> 	while (flows.started > 0) {
>> 		/* Gather 4 bytes of input data for each stream. */
>> 		input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0),
>> input0, 0);
>> 		input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4),
>> input1, 0);
>> 
>> Note: the first time through this loop, input0 and input1 appear on
>> the
>> rhs of the assignment before appearing on the lhs.  This will
>> generate
>> an uninitialized value warning, even though the assignments are to
>> individual lanes of the vector.
>> 
>> I squelched the warning from the compiler in the most brute-force way
>> possible.  Perhaps it would be better to use a static initialization
>> for
>> the vector but this code was intended to be RFC and to generate
>> feedback.
>> 
>> I guess one alternate approach could be:
>> 
>>    static const int32x4_t ZERO_VEC;
>>    int32x4_t input0 = ZERO_VEC, input1 = ZERO_VEC;
>> 
>>    ...
>> 
>>    int32x4_t input = ZERO_VEC;
>> 
>> This would have the benefit of keeping the initializer as 'fast' as
>> possible (although I recall a memset under a certain size threshold
>> is
>> the same effect, but not certain).
>> 
>> Either way, I prefer it to squelching the warning, since the warning
>> has been found to catch legitimate errors many times.
>
> I will get back to this after reproducing the issue locally.

Awesome - thanks.
  
Aaron Conole April 30, 2019, 12:57 p.m. UTC | #5
Aaron Conole <aconole@redhat.com> writes:

> Jerin Jacob Kollanukkaran <jerinj@marvell.com> writes:
>
>> On Wed, 2019-04-10 at 11:52 -0400, Aaron Conole wrote:
>>> Jerin Jacob Kollanukkaran <jerinj@marvell.com> writes:
>>> 
>>> > On Mon, 2019-04-08 at 14:24 -0400, Aaron Conole wrote:
>>> > > ---------------------------------------------------------------
>>> > > ----
>>> > > ---
>>> > > Compiler complains of argument type mismatch, like:
>>> > 
>>> > Can you share more details on how to reproduce this issue?
>>> 
>>> It will be generated using the meson build after enabling the neon
>>> extension support (which isn't currently happening on ARM using meson
>>> as
>>> the build environment).
>>
>>
>> Can you share the patch to enable this for testing.
>
> Sure - I'm using these:
>
> (needed)
> 1/3 - http://mails.dpdk.org/archives/dev/2019-March/128304.html
> 2/3 - http://mails.dpdk.org/archives/dev/2019-March/128305.html
>
> (following only needed for travis support)
> 3/3 - http://mails.dpdk.org/archives/dev/2019-March/128306.html
>
> -Aaron
>
>> Since the additional memcpy in fastpath, I need to check the overhead
>> and check the possibility to avoid the memcpy to case.

Were you able to test this?

>>
>>> 
>>> > We already have
>>> > CFLAGS_acl_run_neon.o += -flax-vector-conversions
>>> > in the Makefile.
>>> > 
>>> > If you are taking out -flax-vector-conversions the correct way to
>>> > fix will be use vreinterpret*.
>>> > 
>>> > For me the code looks clean, If unnecessary casting is avoided.
>>> 
>>> I agree.  I merely make explicit the casts that the compiler will be
>>> implicitly introducing.
>>> 
>>> > >    ../lib/librte_acl/acl_run_neon.h: In function ‘transition4’:
>>> > >    ../lib/librte_acl/acl_run_neon.h:115:2: note: use -flax-
>>> > > vector-
>>> > > conversions
>>> > >       to permit conversions between vectors with differing
>>> > > element
>>> > > types
>>> > >       or numbers of subparts
>>> > >      node_type = vbicq_s32(tr_hi_lo.val[0], index_msk);
>>> > >      ^
>>> > >    ../lib/librte_acl/acl_run_neon.h:115:41: error: incompatible
>>> > > type
>>> > > for
>>> > >       argument 2 of ‘vbicq_s32’
>>> > > 
>>> > > Signed-off-by: Aaron Conole <aconole@redhat.com>
>>> > > ---
>>> > >  lib/librte_acl/acl_run_neon.h | 46 ++++++++++++++++++++---------
>>> > > ----
>>> > > --
>>> > >  1 file changed, 27 insertions(+), 19 deletions(-)
>>> > > 
>>> > > 
>>> > >  
>>> > >  /*
>>> > > @@ -179,6 +183,9 @@ search_neon_8(const struct rte_acl_ctx *ctx,
>>> > > const uint8_t **data,
>>> > >  	acl_match_check_x4(0, ctx, parms, &flows, &index_array[0]);
>>> > >  	acl_match_check_x4(4, ctx, parms, &flows, &index_array[4]);
>>> > >  
>>> > > +	memset(&input0, 0, sizeof(input0));
>>> > > +	memset(&input1, 0, sizeof(input1));
>>> > 
>>> > Why this memset only required for arm64? If it real issue,
>>> > Shouldn't
>>> > it required for x86 and ppc ?
>>> 
>>> No.  Please see the following lines (which is due to the ARM neon
>>> intrinsic for setting individual lanes):
>>> 
>>> 	while (flows.started > 0) {
>>> 		/* Gather 4 bytes of input data for each stream. */
>>> 		input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0),
>>> input0, 0);
>>> 		input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4),
>>> input1, 0);
>>> 
>>> Note: the first time through this loop, input0 and input1 appear on
>>> the
>>> rhs of the assignment before appearing on the lhs.  This will
>>> generate
>>> an uninitialized value warning, even though the assignments are to
>>> individual lanes of the vector.
>>> 
>>> I squelched the warning from the compiler in the most brute-force way
>>> possible.  Perhaps it would be better to use a static initialization
>>> for
>>> the vector but this code was intended to be RFC and to generate
>>> feedback.
>>> 
>>> I guess one alternate approach could be:
>>> 
>>>    static const int32x4_t ZERO_VEC;
>>>    int32x4_t input0 = ZERO_VEC, input1 = ZERO_VEC;
>>> 
>>>    ...
>>> 
>>>    int32x4_t input = ZERO_VEC;
>>> 
>>> This would have the benefit of keeping the initializer as 'fast' as
>>> possible (although I recall a memset under a certain size threshold
>>> is
>>> the same effect, but not certain).
>>> 
>>> Either way, I prefer it to squelching the warning, since the warning
>>> has been found to catch legitimate errors many times.
>>
>> I will get back to this after reproducing the issue locally.
>
> Awesome - thanks.
  
Jerin Jacob Kollanukkaran June 5, 2019, 3:16 p.m. UTC | #6
> -----Original Message-----
> From: Jerin Jacob Kollanukkaran
> Sent: Wednesday, April 10, 2019 8:10 PM
> To: dev@dpdk.org; aconole@redhat.com
> Cc: gavin.hu@arm.com; konstantin.ananyev@intel.com
> Subject: Re: [EXT] [PATCH 1/3] acl: fix arm argument types
> 
> On Mon, 2019-04-08 at 14:24 -0400, Aaron Conole wrote:
> > -------------------------------------------------------------------
> > ---
> > Compiler complains of argument type mismatch, like:
> 
> Can you share more details on how to reproduce this issue?
> 
> We already have
> CFLAGS_acl_run_neon.o += -flax-vector-conversions in the Makefile.
> 
> If you are taking out -flax-vector-conversions the correct way to fix will be
> use vreinterpret*.
> 
> For me the code looks clean, If unnecessary casting is avoided.


Considering the following patch is part of dpdk.org now. I think, We may not need this
patch in benefit to avoid a lot of typecasting.

https://git.dpdk.org/dpdk/commit/?id=e53ce4e4137974f46743e74bd9ab912e0166c8b1




> 
> 
> >
> >    ../lib/librte_acl/acl_run_neon.h: In function ‘transition4’:
> >    ../lib/librte_acl/acl_run_neon.h:115:2: note: use -flax-vector-
> > conversions
> >       to permit conversions between vectors with differing element
> > types
> >       or numbers of subparts
> >      node_type = vbicq_s32(tr_hi_lo.val[0], index_msk);
> >      ^
> >    ../lib/librte_acl/acl_run_neon.h:115:41: error: incompatible type
> > for
> >       argument 2 of ‘vbicq_s32’
> >
> > Signed-off-by: Aaron Conole <aconole@redhat.com>
> > ---
> >  lib/librte_acl/acl_run_neon.h | 46 ++++++++++++++++++++-------------
> > --
> >  1 file changed, 27 insertions(+), 19 deletions(-)
> >
> >
> >
> >  /*
> > @@ -179,6 +183,9 @@ search_neon_8(const struct rte_acl_ctx *ctx, const
> > uint8_t **data,
> >  	acl_match_check_x4(0, ctx, parms, &flows, &index_array[0]);
> >  	acl_match_check_x4(4, ctx, parms, &flows, &index_array[4]);
> >
> > +	memset(&input0, 0, sizeof(input0));
> > +	memset(&input1, 0, sizeof(input1));
> 
> Why this memset only required for arm64? If it real issue, Shouldn't it
> required for x86 and ppc ?
>
  
Aaron Conole June 5, 2019, 5:09 p.m. UTC | #7
Jerin Jacob Kollanukkaran <jerinj@marvell.com> writes:

>> -----Original Message-----
>> From: Jerin Jacob Kollanukkaran
>> Sent: Wednesday, April 10, 2019 8:10 PM
>> To: dev@dpdk.org; aconole@redhat.com
>> Cc: gavin.hu@arm.com; konstantin.ananyev@intel.com
>> Subject: Re: [EXT] [PATCH 1/3] acl: fix arm argument types
>> 
>> On Mon, 2019-04-08 at 14:24 -0400, Aaron Conole wrote:
>> > -------------------------------------------------------------------
>> > ---
>> > Compiler complains of argument type mismatch, like:
>> 
>> Can you share more details on how to reproduce this issue?
>> 
>> We already have
>> CFLAGS_acl_run_neon.o += -flax-vector-conversions in the Makefile.
>> 
>> If you are taking out -flax-vector-conversions the correct way to fix will be
>> use vreinterpret*.
>> 
>> For me the code looks clean, If unnecessary casting is avoided.
>
>
> Considering the following patch is part of dpdk.org now. I think, We may not need this
> patch in benefit to avoid a lot of typecasting.
>
> https://git.dpdk.org/dpdk/commit/?id=e53ce4e4137974f46743e74bd9ab912e0166c8b1

Correct, the lax conversions aren't needed.

>
>
>
>> 
>> 
>> >
>> >    ../lib/librte_acl/acl_run_neon.h: In function ‘transition4’:
>> >    ../lib/librte_acl/acl_run_neon.h:115:2: note: use -flax-vector-
>> > conversions
>> >       to permit conversions between vectors with differing element
>> > types
>> >       or numbers of subparts
>> >      node_type = vbicq_s32(tr_hi_lo.val[0], index_msk);
>> >      ^
>> >    ../lib/librte_acl/acl_run_neon.h:115:41: error: incompatible type
>> > for
>> >       argument 2 of ‘vbicq_s32’
>> >
>> > Signed-off-by: Aaron Conole <aconole@redhat.com>
>> > ---
>> >  lib/librte_acl/acl_run_neon.h | 46 ++++++++++++++++++++-------------
>> > --
>> >  1 file changed, 27 insertions(+), 19 deletions(-)
>> >
>> >
>> >
>> >  /*
>> > @@ -179,6 +183,9 @@ search_neon_8(const struct rte_acl_ctx *ctx, const
>> > uint8_t **data,
>> >  	acl_match_check_x4(0, ctx, parms, &flows, &index_array[0]);
>> >  	acl_match_check_x4(4, ctx, parms, &flows, &index_array[4]);
>> >
>> > +	memset(&input0, 0, sizeof(input0));
>> > +	memset(&input1, 0, sizeof(input1));
>> 
>> Why this memset only required for arm64? If it real issue, Shouldn't it
>> required for x86 and ppc ?
>> 

Something for this part is still needed (see for example:
https://travis-ci.com/DPDK/dpdk/jobs/205675369).

I have two alternate approaches, butneither have even been compile tested
(and the obvious '-Wno-maybe-uninitialized' - but I dislike that
 approach because it will afflict all routines):

1.  Something like this:

@@ -181,8 +181,8 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	while (flows.started > 0) {
 		/* Gather 4 bytes of input data for each stream. */
-		input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input0, 0);
-		input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), input1, 0);
+		input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), vdup_n_s32(0), 0);
+		input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), vdup_n_s32(0), 0);
 
 		input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input0, 1);
 		input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), input1, 1);
@@ -242,7 +242,7 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	while (flows.started > 0) {
 		/* Gather 4 bytes of input data for each stream. */
-		input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 0);
+		input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), vdup_n_s32(0), 0);
 		input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1);
 		input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2);
 		input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3);

---------

2: something like this

diff --git a/lib/librte_acl/acl_run_neon.h b/lib/librte_acl/acl_run_neon.h
index a055a8240..0eb42865a 100644
--- a/lib/librte_acl/acl_run_neon.h
+++ b/lib/librte_acl/acl_run_neon.h
@@ -165,7 +165,8 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
        uint64_t index_array[8];
        struct completion cmplt[8];
        struct parms parms[8];
-       int32x4_t input0, input1;
+       static int32x4_t ZERO_VAL;
+       int32x4_t input0 = ZERO_VAL, input1 = ZERO_VAL;
 
        acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
                     total_packets, categories, ctx->trans_table);
@@ -181,8 +182,8 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
        while (flows.started > 0) {
                /* Gather 4 bytes of input data for each stream. */
-               input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), vdup_n_s32(0), 0);
-               input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), vdup_n_s32(0), 0);
+               input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input0, 0);
+               input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), input1, 0);
 
                input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input0, 1);
                input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), input1, 1);
@@ -227,7 +228,8 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
        uint64_t index_array[4];
        struct completion cmplt[4];
        struct parms parms[4];
-       int32x4_t input;
+       static int32x4_t ZERO_VAL;
+       int32x4_t input = ZERO_VAL;
 
        acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
                     total_packets, categories, ctx->trans_table);
@@ -242,7 +244,7 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
        while (flows.started > 0) {
                /* Gather 4 bytes of input data for each stream. */
-               input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), vdup_n_s32(0), 0);
+               input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 0);
                input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1);
                input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2);
                input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3);
---

WDYT?
  

Patch

diff --git a/lib/librte_acl/acl_run_neon.h b/lib/librte_acl/acl_run_neon.h
index 01b9766d8..4a8e4b681 100644
--- a/lib/librte_acl/acl_run_neon.h
+++ b/lib/librte_acl/acl_run_neon.h
@@ -112,37 +112,41 @@  transition4(int32x4_t next_input, const uint64_t *trans, uint64_t transitions[])
 	index_msk = vld1q_u32((const uint32_t *)&neon_acl_const.xmm_index_mask);
 
 	/* Calc node type and node addr */
-	node_type = vbicq_s32(tr_hi_lo.val[0], index_msk);
-	addr = vandq_s32(tr_hi_lo.val[0], index_msk);
+	node_type = (uint32x4_t) vbicq_s32(tr_hi_lo.val[0],
+				       (int32x4_t)index_msk);
+	addr = (uint32x4_t) vandq_s32(tr_hi_lo.val[0], (int32x4_t) index_msk);
 
 	/* t = 0 */
-	t = veorq_s32(node_type, node_type);
+	t = veorq_s32((int32x4_t)node_type, (int32x4_t)node_type);
 
 	/* mask for DFA type(0) nodes */
-	dfa_msk = vceqq_u32(node_type, t);
+	dfa_msk = vceqq_u32(node_type, (uint32x4_t)t);
 
-	mask = vld1q_s32((const int32_t *)&neon_acl_const.xmm_shuffle_input);
-	in = vqtbl1q_u8((uint8x16_t)next_input, (uint8x16_t)mask);
+	mask = (uint32x4_t)
+	       vld1q_s32((const int32_t *)&neon_acl_const.xmm_shuffle_input);
+	in = (int32x4_t) vqtbl1q_u8((uint8x16_t)next_input, (uint8x16_t)mask);
 
 	/* DFA calculations. */
-	r = vshrq_n_u32(in, 30); /* div by 64 */
-	mask = vld1q_s32((const int32_t *)&neon_acl_const.range_base);
-	r = vaddq_u8(r, mask);
-	t = vshrq_n_u32(in, 24);
-	r = vqtbl1q_u8((uint8x16_t)tr_hi_lo.val[1], (uint8x16_t)r);
-	dfa_ofs = vsubq_s32(t, r);
+	r = (int32x4_t) vshrq_n_u32((uint32x4_t) in, 30); /* div by 64 */
+	mask = (uint32x4_t)
+	       vld1q_s32((const int32_t *)&neon_acl_const.range_base);
+	r = (int32x4_t) vaddq_u8((uint8x16_t)r, (uint8x16_t)mask);
+	t = (int32x4_t) vshrq_n_u32((uint32x4_t)in, 24);
+	r = (int32x4_t) vqtbl1q_u8((uint8x16_t)tr_hi_lo.val[1], (uint8x16_t)r);
+	dfa_ofs = (uint32x4_t) vsubq_s32(t, r);
 
 	/* QUAD/SINGLE calculations. */
-	t = vcgtq_s8(in, tr_hi_lo.val[1]);
-	t = vabsq_s8(t);
-	t = vpaddlq_u8(t);
-	quad_ofs = vpaddlq_u16(t);
+	t = (int32x4_t) vcgtq_s8((int8x16_t)in, (int8x16_t)tr_hi_lo.val[1]);
+	t = (int32x4_t) vabsq_s8((int8x16_t)t);
+	t = (int32x4_t) vpaddlq_u8((uint8x16_t)t);
+	quad_ofs = vpaddlq_u16((uint16x8_t)t);
 
 	/* blend DFA and QUAD/SINGLE. */
-	t = vbslq_u8(dfa_msk, dfa_ofs, quad_ofs);
+	t = (int32x4_t) vbslq_u8((uint8x16_t)dfa_msk, (uint8x16_t)dfa_ofs,
+				 (uint8x16_t)quad_ofs);
 
 	/* calculate address for next transitions */
-	addr = vaddq_u32(addr, t);
+	addr = vaddq_u32(addr, (uint32x4_t)t);
 
 	/* Fill next transitions */
 	transitions[0] = trans[vgetq_lane_u32(addr, 0)];
@@ -150,7 +154,7 @@  transition4(int32x4_t next_input, const uint64_t *trans, uint64_t transitions[])
 	transitions[2] = trans[vgetq_lane_u32(addr, 2)];
 	transitions[3] = trans[vgetq_lane_u32(addr, 3)];
 
-	return vshrq_n_u32(next_input, CHAR_BIT);
+	return (int32x4_t) vshrq_n_u32((uint32x4_t)next_input, CHAR_BIT);
 }
 
 /*
@@ -179,6 +183,9 @@  search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	acl_match_check_x4(0, ctx, parms, &flows, &index_array[0]);
 	acl_match_check_x4(4, ctx, parms, &flows, &index_array[4]);
 
+	memset(&input0, 0, sizeof(input0));
+	memset(&input1, 0, sizeof(input1));
+
 	while (flows.started > 0) {
 		/* Gather 4 bytes of input data for each stream. */
 		input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input0, 0);
@@ -240,6 +247,7 @@  search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	/* Check for any matches. */
 	acl_match_check_x4(0, ctx, parms, &flows, index_array);
 
+	memset(&input, 0, sizeof(input));
 	while (flows.started > 0) {
 		/* Gather 4 bytes of input data for each stream. */
 		input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 0);