[1/2] net/mlx5: remove redundant operations

Message ID 20210601083055.97261-2-ruifeng.wang@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Raslan Darawsheh
Headers
Series MLX5 PMD tuning |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Ruifeng Wang June 1, 2021, 8:30 a.m. UTC
  Some operations on mask are redundant and can be removed.
The change yielded 1.6% performance gain on N1SDP.
On ThunderX2, slight performance uplift was also observed.

Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
Cc: stable@dpdk.org

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)
  

Comments

Slava Ovsiienko July 2, 2021, 8:12 a.m. UTC | #1
Hi, Ruifeng

> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Tuesday, June 1, 2021 11:31
> To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>;
> stable@dpdk.org
> Subject: [PATCH 1/2] net/mlx5: remove redundant operations
> 
> Some operations on mask are redundant and can be removed.
> The change yielded 1.6% performance gain on N1SDP.
> On ThunderX2, slight performance uplift was also observed.
> 
> Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> index 2234fbe6b2..98a75b09c6 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> @@ -768,18 +768,11 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
>  					  comp_mask), 0)) /
>  					  (sizeof(uint16_t) * 8);
>  		/* D.6 mask out entries after the compressed CQE. */
> -		mask = vcreate_u16(comp_idx <
> MLX5_VPMD_DESCS_PER_LOOP ?
> -				   -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
> -				   0);
> -		invalid_mask = vorr_u16(invalid_mask, mask);
> +		invalid_mask = vorr_u16(invalid_mask, comp_mask);

Mmmm... I'm not sure we can drop the masking compressed (and following) CQE skip.
Let's consider the completion scenario (the series of 4 CQEs, each element is 64B long)

0: normal uncompressed CQE, ownership OK, format uncompressed, opcode OK, no error
1: compressed CQE, ownership OK, format compressed, opcode OK, no error
2: miniCQE array, format can be any!!, may be discovered as ownership OK, format uncompressed, opcode OK, no error
3: miniCQE array, format can be any!!, may be discovered as ownership OK, format uncompressed, opcode OK, no error

Obviously, we should unconditionally mask out 2 and 3, regardless of recognized their formats/opcode/error/etc.
I think we can get the diff above and skip diff below:

>  		/* D.7 count non-compressed valid CQEs. */
>  		n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
>  				   invalid_mask), 0)) / (sizeof(uint16_t) * 8);
>  		nocmp_n += n;
> -		/* D.2 get the final invalid mask. */
> -		mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
> -				   -1UL >> (n * sizeof(uint16_t) * 8) : 0);
> -		invalid_mask = vorr_u16(invalid_mask, mask);

and get the correct final invalid_mask - all compressed and invalid CQEs and following ones will be masked out.

With best regards,
Slava
  
Ruifeng Wang July 2, 2021, 10:30 a.m. UTC | #2
> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Friday, July 2, 2021 4:13 PM
> To: Ruifeng Wang <Ruifeng.Wang@arm.com>; Raslan Darawsheh
> <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org
> Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations
> 
> Hi, Ruifeng
Hi, Slava

> 
> > -----Original Message-----
> > From: Ruifeng Wang <ruifeng.wang@arm.com>
> > Sent: Tuesday, June 1, 2021 11:31
> > To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava
> > Ovsiienko <viacheslavo@nvidia.com>
> > Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> > honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>;
> > stable@dpdk.org
> > Subject: [PATCH 1/2] net/mlx5: remove redundant operations
> >
> > Some operations on mask are redundant and can be removed.
> > The change yielded 1.6% performance gain on N1SDP.
> > On ThunderX2, slight performance uplift was also observed.
> >
> > Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 9 +--------
> >  1 file changed, 1 insertion(+), 8 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > index 2234fbe6b2..98a75b09c6 100644
> > --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > @@ -768,18 +768,11 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> > volatile struct mlx5_cqe *cq,
> >  					  comp_mask), 0)) /
> >  					  (sizeof(uint16_t) * 8);
> >  		/* D.6 mask out entries after the compressed CQE. */
> > -		mask = vcreate_u16(comp_idx <
> > MLX5_VPMD_DESCS_PER_LOOP ?
> > -				   -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
> > -				   0);
> > -		invalid_mask = vorr_u16(invalid_mask, mask);
> > +		invalid_mask = vorr_u16(invalid_mask, comp_mask);
> 
> Mmmm... I'm not sure we can drop the masking compressed (and following)
> CQE skip.
> Let's consider the completion scenario (the series of 4 CQEs, each element is
> 64B long)
> 
> 0: normal uncompressed CQE, ownership OK, format uncompressed, opcode
> OK, no error
> 1: compressed CQE, ownership OK, format compressed, opcode OK, no error
> 2: miniCQE array, format can be any!!, may be discovered as ownership OK,
> format uncompressed, opcode OK, no error
> 3: miniCQE array, format can be any!!, may be discovered as ownership OK,
> format uncompressed, opcode OK, no error

Thanks for your review and explanation about CQE processing details.
I did the change based on the fact that some calculations doesn't change the data. 
So some intermediate calculations were removed.

In the above diff section, result of 'mask' always equals to the nearest 'comp_mask' that above it.
So I just remoed 'mask' and use 'comp_mask' instead.
> 
> Obviously, we should unconditionally mask out 2 and 3, regardless of
> recognized their formats/opcode/error/etc.
> I think we can get the diff above and skip diff below:
> 
> >  		/* D.7 count non-compressed valid CQEs. */
> >  		n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
> >  				   invalid_mask), 0)) / (sizeof(uint16_t) * 8);
> >  		nocmp_n += n;
> > -		/* D.2 get the final invalid mask. */
> > -		mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
> > -				   -1UL >> (n * sizeof(uint16_t) * 8) : 0);
> > -		invalid_mask = vorr_u16(invalid_mask, mask);
> 
> and get the correct final invalid_mask - all compressed and invalid CQEs and
> following ones will be masked out.

This diff section is similar to the previous one.
'mask' always equals to the nearest 'invalid_mask' that above it.
So entire line "invalid_mask = vorr_u16(invalid_mask, mask);" can be removed.

Code logic is not changed. But I'm not sure the code change impacts readability
or maintainability that you may concern.

Thanks.
> 
> With best regards,
> Slava
  
Slava Ovsiienko July 5, 2021, 10:01 a.m. UTC | #3
Hi, Ruifeng

The invalid_mask is used to set error flags and calculate the statistics.
So, all the CQEs the first one with error or invalid status should be masked out
(and the CQEs after that).

IMO, what we could improve (apply just the part of the patch below):
>>>>
index 2234fbe6b2..98a75b09c6 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -768,18 +768,11 @@  rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
 					  comp_mask), 0)) /
 					  (sizeof(uint16_t) * 8);
 		/* D.6 mask out entries after the compressed CQE. */
-		mask = vcreate_u16(comp_idx < MLX5_VPMD_DESCS_PER_LOOP ?
-				   -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
-				   0);
-		invalid_mask = vorr_u16(invalid_mask, mask);
+		invalid_mask = vorr_u16(invalid_mask, comp_mask);
 		/* D.7 count non-compressed valid CQEs. */
 		n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
 				   invalid_mask), 0)) / (sizeof(uint16_t) * 8);
 		nocmp_n += n;
<<<<

And that's it. The rest of the patch:
>>>>
-		/* D.2 get the final invalid mask. */
-		mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
-				   -1UL >> (n * sizeof(uint16_t) * 8) : 0);
-		invalid_mask = vorr_u16(invalid_mask, mask);
<<<<
Should not be applied, otherwise the following might be affected:

opcode = vbic_u16(opcode, invalid_mask);
...
opcode = vbic_u16(opcode, invalid_mask);

With best regards,
Slava

> -----Original Message-----
> From: Ruifeng Wang <Ruifeng.Wang@arm.com>
> Sent: Friday, July 2, 2021 13:30
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; Raslan Darawsheh
> <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org; nd
> <nd@arm.com>
> Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations
> 
> > -----Original Message-----
> > From: Slava Ovsiienko <viacheslavo@nvidia.com>
> > Sent: Friday, July 2, 2021 4:13 PM
> > To: Ruifeng Wang <Ruifeng.Wang@arm.com>; Raslan Darawsheh
> > <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > <shahafs@nvidia.com>
> > Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org
> > Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations
> >
> > Hi, Ruifeng
> Hi, Slava
> 
> >
> > > -----Original Message-----
> > > From: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Sent: Tuesday, June 1, 2021 11:31
> > > To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava
> > > Ovsiienko <viacheslavo@nvidia.com>
> > > Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> > > honnappa.nagarahalli@arm.com; Ruifeng Wang
> <ruifeng.wang@arm.com>;
> > > stable@dpdk.org
> > > Subject: [PATCH 1/2] net/mlx5: remove redundant operations
> > >
> > > Some operations on mask are redundant and can be removed.
> > > The change yielded 1.6% performance gain on N1SDP.
> > > On ThunderX2, slight performance uplift was also observed.
> > >
> > > Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > ---
> > >  drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 9 +--------
> > >  1 file changed, 1 insertion(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > index 2234fbe6b2..98a75b09c6 100644
> > > --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > @@ -768,18 +768,11 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> > > volatile struct mlx5_cqe *cq,
> > >  					  comp_mask), 0)) /
> > >  					  (sizeof(uint16_t) * 8);
> > >  		/* D.6 mask out entries after the compressed CQE. */
> > > -		mask = vcreate_u16(comp_idx <
> > > MLX5_VPMD_DESCS_PER_LOOP ?
> > > -				   -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
> > > -				   0);
> > > -		invalid_mask = vorr_u16(invalid_mask, mask);
> > > +		invalid_mask = vorr_u16(invalid_mask, comp_mask);
> >
> > Mmmm... I'm not sure we can drop the masking compressed (and
> > following) CQE skip.
> > Let's consider the completion scenario (the series of 4 CQEs, each
> > element is 64B long)
> >
> > 0: normal uncompressed CQE, ownership OK, format uncompressed, opcode
> > OK, no error
> > 1: compressed CQE, ownership OK, format compressed, opcode OK, no
> > error
> > 2: miniCQE array, format can be any!!, may be discovered as ownership
> > OK, format uncompressed, opcode OK, no error
> > 3: miniCQE array, format can be any!!, may be discovered as ownership
> > OK, format uncompressed, opcode OK, no error
> 
> Thanks for your review and explanation about CQE processing details.
> I did the change based on the fact that some calculations doesn't change the
> data.
> So some intermediate calculations were removed.
> 
> In the above diff section, result of 'mask' always equals to the nearest
> 'comp_mask' that above it.
> So I just remoed 'mask' and use 'comp_mask' instead.
> >
> > Obviously, we should unconditionally mask out 2 and 3, regardless of
> > recognized their formats/opcode/error/etc.
> > I think we can get the diff above and skip diff below:
> >
> > >  		/* D.7 count non-compressed valid CQEs. */
> > >  		n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
> > >  				   invalid_mask), 0)) / (sizeof(uint16_t) * 8);
> > >  		nocmp_n += n;
> > > -		/* D.2 get the final invalid mask. */
> > > -		mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
> > > -				   -1UL >> (n * sizeof(uint16_t) * 8) : 0);
> > > -		invalid_mask = vorr_u16(invalid_mask, mask);
> >
> > and get the correct final invalid_mask - all compressed and invalid
> > CQEs and following ones will be masked out.
> 
> This diff section is similar to the previous one.
> 'mask' always equals to the nearest 'invalid_mask' that above it.
> So entire line "invalid_mask = vorr_u16(invalid_mask, mask);" can be removed.
> 
> Code logic is not changed. But I'm not sure the code change impacts readability
> or maintainability that you may concern.
> 
> Thanks.
> >
> > With best regards,
> > Slava
  
Ruifeng Wang July 7, 2021, 8 a.m. UTC | #4
> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Monday, July 5, 2021 6:02 PM
> To: Ruifeng Wang <Ruifeng.Wang@arm.com>; Raslan Darawsheh
> <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org; nd
> <nd@arm.com>
> Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations
> 
> Hi, Ruifeng
> 
> The invalid_mask is used to set error flags and calculate the statistics.
> So, all the CQEs the first one with error or invalid status should be masked
> out (and the CQEs after that).
Now I understand it. What I was missing is inconsecutive mask bits.
Thanks for your patience.
I'll update in next version.

> 
> IMO, what we could improve (apply just the part of the patch below):
> >>>>
> index 2234fbe6b2..98a75b09c6 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> @@ -768,18 +768,11 @@  rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
>  					  comp_mask), 0)) /
>  					  (sizeof(uint16_t) * 8);
>  		/* D.6 mask out entries after the compressed CQE. */
> -		mask = vcreate_u16(comp_idx <
> MLX5_VPMD_DESCS_PER_LOOP ?
> -				   -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
> -				   0);
> -		invalid_mask = vorr_u16(invalid_mask, mask);
> +		invalid_mask = vorr_u16(invalid_mask, comp_mask);
>  		/* D.7 count non-compressed valid CQEs. */
>  		n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
>  				   invalid_mask), 0)) / (sizeof(uint16_t) * 8);
>  		nocmp_n += n;
> <<<<
> 
> And that's it. The rest of the patch:
> >>>>
> -		/* D.2 get the final invalid mask. */
> -		mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
> -				   -1UL >> (n * sizeof(uint16_t) * 8) : 0);
> -		invalid_mask = vorr_u16(invalid_mask, mask);
> <<<<
> Should not be applied, otherwise the following might be affected:
> 
> opcode = vbic_u16(opcode, invalid_mask); ...
> opcode = vbic_u16(opcode, invalid_mask);
> 
> With best regards,
> Slava
> 
> > -----Original Message-----
> > From: Ruifeng Wang <Ruifeng.Wang@arm.com>
> > Sent: Friday, July 2, 2021 13:30
> > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Raslan Darawsheh
> > <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > <shahafs@nvidia.com>
> > Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org; nd
> > <nd@arm.com>
> > Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations
> >
> > > -----Original Message-----
> > > From: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > Sent: Friday, July 2, 2021 4:13 PM
> > > To: Ruifeng Wang <Ruifeng.Wang@arm.com>; Raslan Darawsheh
> > > <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf
> Shuler
> > > <shahafs@nvidia.com>
> > > Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> > > Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org
> > > Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations
> > >
> > > Hi, Ruifeng
> > Hi, Slava
> >
> > >
> > > > -----Original Message-----
> > > > From: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Sent: Tuesday, June 1, 2021 11:31
> > > > To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava
> > > > Ovsiienko <viacheslavo@nvidia.com>
> > > > Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> > > > honnappa.nagarahalli@arm.com; Ruifeng Wang
> > <ruifeng.wang@arm.com>;
> > > > stable@dpdk.org
> > > > Subject: [PATCH 1/2] net/mlx5: remove redundant operations
> > > >
> > > > Some operations on mask are redundant and can be removed.
> > > > The change yielded 1.6% performance gain on N1SDP.
> > > > On ThunderX2, slight performance uplift was also observed.
> > > >
> > > > Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for
> > > > ARM")
> > > > Cc: stable@dpdk.org
> > > >
> > > > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > ---
> > > >  drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 9 +--------
> > > >  1 file changed, 1 insertion(+), 8 deletions(-)
> > > >
> > > > diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > > b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > > index 2234fbe6b2..98a75b09c6 100644
> > > > --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > > +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > > @@ -768,18 +768,11 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> > > > volatile struct mlx5_cqe *cq,
> > > >  					  comp_mask), 0)) /
> > > >  					  (sizeof(uint16_t) * 8);
> > > >  		/* D.6 mask out entries after the compressed CQE. */
> > > > -		mask = vcreate_u16(comp_idx <
> > > > MLX5_VPMD_DESCS_PER_LOOP ?
> > > > -				   -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
> > > > -				   0);
> > > > -		invalid_mask = vorr_u16(invalid_mask, mask);
> > > > +		invalid_mask = vorr_u16(invalid_mask, comp_mask);
> > >
> > > Mmmm... I'm not sure we can drop the masking compressed (and
> > > following) CQE skip.
> > > Let's consider the completion scenario (the series of 4 CQEs, each
> > > element is 64B long)
> > >
> > > 0: normal uncompressed CQE, ownership OK, format uncompressed,
> > > opcode OK, no error
> > > 1: compressed CQE, ownership OK, format compressed, opcode OK, no
> > > error
> > > 2: miniCQE array, format can be any!!, may be discovered as
> > > ownership OK, format uncompressed, opcode OK, no error
> > > 3: miniCQE array, format can be any!!, may be discovered as
> > > ownership OK, format uncompressed, opcode OK, no error
> >
> > Thanks for your review and explanation about CQE processing details.
> > I did the change based on the fact that some calculations doesn't
> > change the data.
> > So some intermediate calculations were removed.
> >
> > In the above diff section, result of 'mask' always equals to the
> > nearest 'comp_mask' that above it.
> > So I just remoed 'mask' and use 'comp_mask' instead.
> > >
> > > Obviously, we should unconditionally mask out 2 and 3, regardless of
> > > recognized their formats/opcode/error/etc.
> > > I think we can get the diff above and skip diff below:
> > >
> > > >  		/* D.7 count non-compressed valid CQEs. */
> > > >  		n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
> > > >  				   invalid_mask), 0)) / (sizeof(uint16_t) * 8);
> > > >  		nocmp_n += n;
> > > > -		/* D.2 get the final invalid mask. */
> > > > -		mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
> > > > -				   -1UL >> (n * sizeof(uint16_t) * 8) : 0);
> > > > -		invalid_mask = vorr_u16(invalid_mask, mask);
> > >
> > > and get the correct final invalid_mask - all compressed and invalid
> > > CQEs and following ones will be masked out.
> >
> > This diff section is similar to the previous one.
> > 'mask' always equals to the nearest 'invalid_mask' that above it.
> > So entire line "invalid_mask = vorr_u16(invalid_mask, mask);" can be
> removed.
> >
> > Code logic is not changed. But I'm not sure the code change impacts
> > readability or maintainability that you may concern.
> >
> > Thanks.
> > >
> > > With best regards,
> > > Slava
  

Patch

diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
index 2234fbe6b2..98a75b09c6 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -768,18 +768,11 @@  rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
 					  comp_mask), 0)) /
 					  (sizeof(uint16_t) * 8);
 		/* D.6 mask out entries after the compressed CQE. */
-		mask = vcreate_u16(comp_idx < MLX5_VPMD_DESCS_PER_LOOP ?
-				   -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
-				   0);
-		invalid_mask = vorr_u16(invalid_mask, mask);
+		invalid_mask = vorr_u16(invalid_mask, comp_mask);
 		/* D.7 count non-compressed valid CQEs. */
 		n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
 				   invalid_mask), 0)) / (sizeof(uint16_t) * 8);
 		nocmp_n += n;
-		/* D.2 get the final invalid mask. */
-		mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
-				   -1UL >> (n * sizeof(uint16_t) * 8) : 0);
-		invalid_mask = vorr_u16(invalid_mask, mask);
 		/* D.3 check error in opcode. */
 		opcode = vceq_u16(resp_err_check, opcode);
 		opcode = vbic_u16(opcode, invalid_mask);