[v5,4/8] net/ether: use bitops to speedup comparison

Message ID 20190624204435.29452-5-stephen@networkplumber.org (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series net/ether: enhancements and optimizations |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation fail Compilation issues

Commit Message

Stephen Hemminger June 24, 2019, 8:44 p.m. UTC
  Using bit operations like or and xor is faster than a loop
on all architectures. Really just explicit unrolling.

Similar cast to uint16 unaligned is already done in
other functions here.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
---
 lib/librte_net/rte_ether.h | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)
  

Comments

Olivier Matz July 2, 2019, 7:53 a.m. UTC | #1
Hi,

On Mon, Jun 24, 2019 at 01:44:31PM -0700, Stephen Hemminger wrote:
> Using bit operations like or and xor is faster than a loop
> on all architectures. Really just explicit unrolling.
> 
> Similar cast to uint16 unaligned is already done in
> other functions here.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
> ---
>  lib/librte_net/rte_ether.h | 17 +++++++----------
>  1 file changed, 7 insertions(+), 10 deletions(-)
> 
> diff --git a/lib/librte_net/rte_ether.h b/lib/librte_net/rte_ether.h
> index 8edc7e217b25..feb35a33c94b 100644
> --- a/lib/librte_net/rte_ether.h
> +++ b/lib/librte_net/rte_ether.h
> @@ -81,11 +81,10 @@ struct rte_ether_addr {
>  static inline int rte_is_same_ether_addr(const struct rte_ether_addr *ea1,
>  				     const struct rte_ether_addr *ea2)
>  {
> -	int i;
> -	for (i = 0; i < RTE_ETHER_ADDR_LEN; i++)
> -		if (ea1->addr_bytes[i] != ea2->addr_bytes[i])
> -			return 0;
> -	return 1;
> +	const unaligned_uint16_t *w1 = (const uint16_t *)ea1;
> +	const unaligned_uint16_t *w2 = (const uint16_t *)ea2;
> +
> +	return ((w1[0] ^ w2[0]) | (w1[1] ^ w2[1]) | (w1[2] ^ w2[2])) == 0;
>  }
>  
>  /**
> @@ -100,11 +99,9 @@ static inline int rte_is_same_ether_addr(const struct rte_ether_addr *ea1,
>   */
>  static inline int rte_is_zero_ether_addr(const struct rte_ether_addr *ea)
>  {
> -	int i;
> -	for (i = 0; i < RTE_ETHER_ADDR_LEN; i++)
> -		if (ea->addr_bytes[i] != 0x00)
> -			return 0;
> -	return 1;
> +	const unaligned_uint16_t *w = (const uint16_t *)ea;
> +
> +	return (w[0] | w[1] | w[2]) == 0;
>  }
>  
>  /**

I wonder if using memcmp() isn't faster with recent compilers (gcc >= 7).
I tried it quickly, and it seems the generated code is good (no call):
https://godbolt.org/z/9MOL7g

It would avoid the use of unaligned_uint16_t, and the next patch that
adds the alignment constraint.
  
Olivier Matz July 2, 2019, 9:26 a.m. UTC | #2
On Tue, Jul 02, 2019 at 09:53:14AM +0200, Olivier Matz wrote:
> Hi,
> 
> On Mon, Jun 24, 2019 at 01:44:31PM -0700, Stephen Hemminger wrote:
> > Using bit operations like or and xor is faster than a loop
> > on all architectures. Really just explicit unrolling.
> > 
> > Similar cast to uint16 unaligned is already done in
> > other functions here.
> > 
> > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
> > ---
> >  lib/librte_net/rte_ether.h | 17 +++++++----------
> >  1 file changed, 7 insertions(+), 10 deletions(-)
> > 
> > diff --git a/lib/librte_net/rte_ether.h b/lib/librte_net/rte_ether.h
> > index 8edc7e217b25..feb35a33c94b 100644
> > --- a/lib/librte_net/rte_ether.h
> > +++ b/lib/librte_net/rte_ether.h
> > @@ -81,11 +81,10 @@ struct rte_ether_addr {
> >  static inline int rte_is_same_ether_addr(const struct rte_ether_addr *ea1,
> >  				     const struct rte_ether_addr *ea2)
> >  {
> > -	int i;
> > -	for (i = 0; i < RTE_ETHER_ADDR_LEN; i++)
> > -		if (ea1->addr_bytes[i] != ea2->addr_bytes[i])
> > -			return 0;
> > -	return 1;
> > +	const unaligned_uint16_t *w1 = (const uint16_t *)ea1;
> > +	const unaligned_uint16_t *w2 = (const uint16_t *)ea2;
> > +
> > +	return ((w1[0] ^ w2[0]) | (w1[1] ^ w2[1]) | (w1[2] ^ w2[2])) == 0;
> >  }
> >  
> >  /**
> > @@ -100,11 +99,9 @@ static inline int rte_is_same_ether_addr(const struct rte_ether_addr *ea1,
> >   */
> >  static inline int rte_is_zero_ether_addr(const struct rte_ether_addr *ea)
> >  {
> > -	int i;
> > -	for (i = 0; i < RTE_ETHER_ADDR_LEN; i++)
> > -		if (ea->addr_bytes[i] != 0x00)
> > -			return 0;
> > -	return 1;
> > +	const unaligned_uint16_t *w = (const uint16_t *)ea;
> > +
> > +	return (w[0] | w[1] | w[2]) == 0;
> >  }
> >  
> >  /**
> 
> I wonder if using memcmp() isn't faster with recent compilers (gcc >= 7).
> I tried it quickly, and it seems the generated code is good (no call):
> https://godbolt.org/z/9MOL7g
> 
> It would avoid the use of unaligned_uint16_t, and the next patch that
> adds the alignment constraint.

As pointed out by Konstantin privately (I guess he wanted to do a
reply-all), the size of addr_bytes is wrong in my previous link (8
instead of 6). Thanks for catching it.

With 6, the gcc code is not as good: there is still no call to memcmp(),
but there are some jumps. With the latest clang, the generated code is
nice: https://godbolt.org/z/nfptnY
  
Stephen Hemminger July 2, 2019, 3:28 p.m. UTC | #3
On Tue, 2 Jul 2019 11:26:15 +0200
Olivier Matz <olivier.matz@6wind.com> wrote:

> On Tue, Jul 02, 2019 at 09:53:14AM +0200, Olivier Matz wrote:
> > Hi,
> > 
> > On Mon, Jun 24, 2019 at 01:44:31PM -0700, Stephen Hemminger wrote:  
> > > Using bit operations like or and xor is faster than a loop
> > > on all architectures. Really just explicit unrolling.
> > > 
> > > Similar cast to uint16 unaligned is already done in
> > > other functions here.
> > > 
> > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > > Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
> > > ---
> > >  lib/librte_net/rte_ether.h | 17 +++++++----------
> > >  1 file changed, 7 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/lib/librte_net/rte_ether.h b/lib/librte_net/rte_ether.h
> > > index 8edc7e217b25..feb35a33c94b 100644
> > > --- a/lib/librte_net/rte_ether.h
> > > +++ b/lib/librte_net/rte_ether.h
> > > @@ -81,11 +81,10 @@ struct rte_ether_addr {
> > >  static inline int rte_is_same_ether_addr(const struct rte_ether_addr *ea1,
> > >  				     const struct rte_ether_addr *ea2)
> > >  {
> > > -	int i;
> > > -	for (i = 0; i < RTE_ETHER_ADDR_LEN; i++)
> > > -		if (ea1->addr_bytes[i] != ea2->addr_bytes[i])
> > > -			return 0;
> > > -	return 1;
> > > +	const unaligned_uint16_t *w1 = (const uint16_t *)ea1;
> > > +	const unaligned_uint16_t *w2 = (const uint16_t *)ea2;
> > > +
> > > +	return ((w1[0] ^ w2[0]) | (w1[1] ^ w2[1]) | (w1[2] ^ w2[2])) == 0;
> > >  }
> > >  
> > >  /**
> > > @@ -100,11 +99,9 @@ static inline int rte_is_same_ether_addr(const struct rte_ether_addr *ea1,
> > >   */
> > >  static inline int rte_is_zero_ether_addr(const struct rte_ether_addr *ea)
> > >  {
> > > -	int i;
> > > -	for (i = 0; i < RTE_ETHER_ADDR_LEN; i++)
> > > -		if (ea->addr_bytes[i] != 0x00)
> > > -			return 0;
> > > -	return 1;
> > > +	const unaligned_uint16_t *w = (const uint16_t *)ea;
> > > +
> > > +	return (w[0] | w[1] | w[2]) == 0;
> > >  }
> > >  
> > >  /**  
> > 
> > I wonder if using memcmp() isn't faster with recent compilers (gcc >= 7).
> > I tried it quickly, and it seems the generated code is good (no call):
> > https://godbolt.org/z/9MOL7g
> > 
> > It would avoid the use of unaligned_uint16_t, and the next patch that
> > adds the alignment constraint.  
> 
> As pointed out by Konstantin privately (I guess he wanted to do a
> reply-all), the size of addr_bytes is wrong in my previous link (8
> instead of 6). Thanks for catching it.
> 
> With 6, the gcc code is not as good: there is still no call to memcmp(),
> but there are some jumps. With the latest clang, the generated code is
> nice: https://godbolt.org/z/nfptnY

gcc matters most for current DPDK users.
  

Patch

diff --git a/lib/librte_net/rte_ether.h b/lib/librte_net/rte_ether.h
index 8edc7e217b25..feb35a33c94b 100644
--- a/lib/librte_net/rte_ether.h
+++ b/lib/librte_net/rte_ether.h
@@ -81,11 +81,10 @@  struct rte_ether_addr {
 static inline int rte_is_same_ether_addr(const struct rte_ether_addr *ea1,
 				     const struct rte_ether_addr *ea2)
 {
-	int i;
-	for (i = 0; i < RTE_ETHER_ADDR_LEN; i++)
-		if (ea1->addr_bytes[i] != ea2->addr_bytes[i])
-			return 0;
-	return 1;
+	const unaligned_uint16_t *w1 = (const uint16_t *)ea1;
+	const unaligned_uint16_t *w2 = (const uint16_t *)ea2;
+
+	return ((w1[0] ^ w2[0]) | (w1[1] ^ w2[1]) | (w1[2] ^ w2[2])) == 0;
 }
 
 /**
@@ -100,11 +99,9 @@  static inline int rte_is_same_ether_addr(const struct rte_ether_addr *ea1,
  */
 static inline int rte_is_zero_ether_addr(const struct rte_ether_addr *ea)
 {
-	int i;
-	for (i = 0; i < RTE_ETHER_ADDR_LEN; i++)
-		if (ea->addr_bytes[i] != 0x00)
-			return 0;
-	return 1;
+	const unaligned_uint16_t *w = (const uint16_t *)ea;
+
+	return (w[0] | w[1] | w[2]) == 0;
 }
 
 /**