[v3,5/6] spinlock: reimplement with atomic one-way barrier builtins

Message ID 20181227041349.3058-6-gavin.hu@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series spinlock optimization and test case enhancements |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Gavin Hu Dec. 27, 2018, 4:13 a.m. UTC
  The __sync builtin based implementation generates full memory barriers
('dmb ish') on Arm platforms. Using C11 atomic builtins to generate one way
barriers.

Here is the assembly code of __sync_compare_and_swap builtin.
__sync_bool_compare_and_swap(dst, exp, src);
   0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
   0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
   0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
   0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
   0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
   0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
   0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
<rte_atomic16_cmpset+52>  // b.any
   0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
   0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
<rte_atomic16_cmpset+32>
   0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
   0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq = none

The benchmarking results showed 3X performance gain on Cavium ThunderX2 and
13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
Here is the example test result on TX2:

*** spinlock_autotest without this patch ***
Core [123] Cost Time = 639822 us
Core [124] Cost Time = 633253 us
Core [125] Cost Time = 646030 us
Core [126] Cost Time = 643189 us
Core [127] Cost Time = 647039 us
Total Cost Time = 95433298 us

*** spinlock_autotest with this patch ***
Core [123] Cost Time = 163615 us
Core [124] Cost Time = 166471 us
Core [125] Cost Time = 189044 us
Core [126] Cost Time = 195745 us
Core [127] Cost Time = 78423 us
Total Cost Time = 27339656 us

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Steve Capper <Steve.Capper@arm.com>
---
 lib/librte_eal/common/include/generic/rte_spinlock.h | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)
  

Comments

Jerin Jacob Kollanukkaran Dec. 27, 2018, 7:42 a.m. UTC | #1
On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
-------------------------------------------------------------------
> ---
> The __sync builtin based implementation generates full memory
> barriers
> ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate
> one way
> barriers.
> 
> Here is the assembly code of __sync_compare_and_swap builtin.
> __sync_bool_compare_and_swap(dst, exp, src);
>    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
>    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
>    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
>    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
>    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
>    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
>    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> <rte_atomic16_cmpset+52>  // b.any
>    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
>    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> <rte_atomic16_cmpset+32>
>    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
>    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> none
> 
> The benchmarking results showed 3X performance gain on Cavium
> ThunderX2 and
> 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> Here is the example test result on TX2:
> 
> *** spinlock_autotest without this patch ***
> Core [123] Cost Time = 639822 us
> Core [124] Cost Time = 633253 us
> Core [125] Cost Time = 646030 us
> Core [126] Cost Time = 643189 us
> Core [127] Cost Time = 647039 us
> Total Cost Time = 95433298 us
> 
> *** spinlock_autotest with this patch ***
> Core [123] Cost Time = 163615 us
> Core [124] Cost Time = 166471 us
> Core [125] Cost Time = 189044 us
> Core [126] Cost Time = 195745 us
> Core [127] Cost Time = 78423 us
> Total Cost Time = 27339656 us
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> ---
>  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> b/lib/librte_eal/common/include/generic/rte_spinlock.h
> index c4c3fc31e..87ae7a4f1 100644
> --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
>  static inline void
>  rte_spinlock_lock(rte_spinlock_t *sl)
>  {
> -	while (__sync_lock_test_and_set(&sl->locked, 1))
> -		while(sl->locked)
> +	int exp = 0;
> +
> +	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {

How about remove explict exp = 0 and change to
__atomic_test_and_set(flag, __ATOMIC_ACQUIRE);

i.e 
while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))



> +		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
>  			rte_pause();
> +		exp = 0;

We can remove exp = 0 with above scheme.

> +	}
>  }
>  #endif
>  
> @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
>  static inline void
>  rte_spinlock_unlock (rte_spinlock_t *sl)
>  {
> -	__sync_lock_release(&sl->locked);
> +	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
 }
>  #endif
>  
> @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
>  static inline int
>  rte_spinlock_trylock (rte_spinlock_t *sl)
>  {
> -	return __sync_lock_test_and_set(&sl->locked,1) == 0;
> +	int exp = 0;
> +	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> +				0, /* disallow spurious failure */
> +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);

Here to remove explicit exp.

return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)


>  }
>  #endif
>  
> @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
>   */
>  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
>  {
> -	return sl->locked;
> +	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);

__ATOMIC_RELAXED would be enough here. Right ?


>  }
>  
>  /**
  
Gavin Hu Dec. 27, 2018, 9:02 a.m. UTC | #2
> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Thursday, December 27, 2018 3:42 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Subject: Re: [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way
> barrier builtins
> 
> On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> -------------------------------------------------------------------
> > ---
> > The __sync builtin based implementation generates full memory
> > barriers
> > ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate
> > one way
> > barriers.
> >
> > Here is the assembly code of __sync_compare_and_swap builtin.
> > __sync_bool_compare_and_swap(dst, exp, src);
> >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
> >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
> >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > <rte_atomic16_cmpset+52>  // b.any
> >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
> >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> > <rte_atomic16_cmpset+32>
> >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> > none
> >
> > The benchmarking results showed 3X performance gain on Cavium
> > ThunderX2 and
> > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> > Here is the example test result on TX2:
> >
> > *** spinlock_autotest without this patch ***
> > Core [123] Cost Time = 639822 us
> > Core [124] Cost Time = 633253 us
> > Core [125] Cost Time = 646030 us
> > Core [126] Cost Time = 643189 us
> > Core [127] Cost Time = 647039 us
> > Total Cost Time = 95433298 us
> >
> > *** spinlock_autotest with this patch ***
> > Core [123] Cost Time = 163615 us
> > Core [124] Cost Time = 166471 us
> > Core [125] Cost Time = 189044 us
> > Core [126] Cost Time = 195745 us
> > Core [127] Cost Time = 78423 us
> > Total Cost Time = 27339656 us
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> > ---
> >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > +++++++++++++-----
> >  1 file changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > index c4c3fc31e..87ae7a4f1 100644
> > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
> >  static inline void
> >  rte_spinlock_lock(rte_spinlock_t *sl)
> >  {
> > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > -		while(sl->locked)
> > +	int exp = 0;
> > +
> > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> {
> 
> How about remove explict exp = 0 and change to
> __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);
> 
> i.e
> while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))
> 
Will do it in v4.
> 
> > +		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> >  			rte_pause();
> > +		exp = 0;
> 
> We can remove exp = 0 with above scheme.
> 
> > +	}
> >  }
> >  #endif
> >
> > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
> >  static inline void
> >  rte_spinlock_unlock (rte_spinlock_t *sl)
> >  {
> > -	__sync_lock_release(&sl->locked);
> > +	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
>  }
> >  #endif
> >
> > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
> >  static inline int
> >  rte_spinlock_trylock (rte_spinlock_t *sl)
> >  {
> > -	return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > +	int exp = 0;
> > +	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > +				0, /* disallow spurious failure */
> > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> 
> Here to remove explicit exp.
> 
> return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)

Will do it in v4.

> >  }
> >  #endif
> >
> > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> >   */
> >  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
> >  {
> > -	return sl->locked;
> > +	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> 
> __ATOMIC_RELAXED would be enough here. Right ?
Yes, it is enough for current DPDK uses, used for testing and assertions only.

For general applicability, we set acquire as concerned about it is used for reading protected data while the lock is not taken by anybody.
In this use case, Acquire will properly see all updates from before the lock was released, but this is still dangerous, as during the course, someone else might have taken the lock and changed the data.

Anyway, I will set Relaxed in v4 as the above use scenario was not recommended and not present in DPDK. 

> 
> >  }
> >
> >  /**
  
Honnappa Nagarahalli Jan. 3, 2019, 8:35 p.m. UTC | #3
> >
> > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > -------------------------------------------------------------------
> > > ---
> > > The __sync builtin based implementation generates full memory
> > > barriers ('dmb ish') on Arm platforms. Using C11 atomic builtins to
> > > generate one way barriers.
> > >
> > > Here is the assembly code of __sync_compare_and_swap builtin.
> > > __sync_bool_compare_and_swap(dst, exp, src);
> > >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> > >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
> > >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
> > >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> > >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> > >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> > >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > > <rte_atomic16_cmpset+52>  // b.any
> > >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
> > >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> > > <rte_atomic16_cmpset+32>
> > >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> > >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> > > none
> > >
> > > The benchmarking results showed 3X performance gain on Cavium
> > > ThunderX2 and
> > > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
                                            ^^^^^^
Typo, should be Falkor

> > > Here is the example test result on TX2:
> > >
> > > *** spinlock_autotest without this patch *** Core [123] Cost Time =
> > > 639822 us Core [124] Cost Time = 633253 us Core [125] Cost Time =
> > > 646030 us Core [126] Cost Time = 643189 us Core [127] Cost Time =
> > > 647039 us Total Cost Time = 95433298 us
> > >
> > > *** spinlock_autotest with this patch *** Core [123] Cost Time =
> > > 163615 us Core [124] Cost Time = 166471 us Core [125] Cost Time =
> > > 189044 us Core [126] Cost Time = 195745 us Core [127] Cost Time =
> > > 78423 us Total Cost Time = 27339656 us
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > > Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> > > ---
> > >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > > +++++++++++++-----
> > >  1 file changed, 13 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > index c4c3fc31e..87ae7a4f1 100644
> > > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);  static
> > > inline void  rte_spinlock_lock(rte_spinlock_t *sl)  {
> > > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > > -		while(sl->locked)
> > > +	int exp = 0;
> > > +
> > > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> > {
> >
> > How about remove explict exp = 0 and change to
> > __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);
> >
> > i.e
> > while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))
> >
> Will do it in v4.
The operand for '__atomic_test_and_set' needs to be a bool or char. This means, sl->locked need to be changed to bool or char. But the API 'rte_spinlock_lock_tm' also uses sl->locked. The requirements of 'rte_spinlock_lock_tm' need to be considered.

> >
> > > +		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> > >  			rte_pause();
> > > +		exp = 0;
> >
> > We can remove exp = 0 with above scheme.
> >
> > > +	}
> > >  }
> > >  #endif
> > >
> > > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);  static
> > > inline void  rte_spinlock_unlock (rte_spinlock_t *sl)  {
> > > -	__sync_lock_release(&sl->locked);
> > > +	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
> >  }
> > >  #endif
> > >
> > > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
> > > static inline int  rte_spinlock_trylock (rte_spinlock_t *sl)  {
> > > -	return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > > +	int exp = 0;
> > > +	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > > +				0, /* disallow spurious failure */
> > > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> >
> > Here to remove explicit exp.
> >
> > return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)
> 
> Will do it in v4.
> 
> > >  }
> > >  #endif
> > >
> > > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> > >   */
> > >  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)  {
> > > -	return sl->locked;
> > > +	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> >
> > __ATOMIC_RELAXED would be enough here. Right ?
> Yes, it is enough for current DPDK uses, used for testing and assertions only.
> 
> For general applicability, we set acquire as concerned about it is used for
> reading protected data while the lock is not taken by anybody.
> In this use case, Acquire will properly see all updates from before the lock
> was released, but this is still dangerous, as during the course, someone else
> might have taken the lock and changed the data.
> 
> Anyway, I will set Relaxed in v4 as the above use scenario was not
> recommended and not present in DPDK.
IMO, once the API is provided the application can make use of it in anyway it wants. Once use case I can think of is follows:

if (rte_spinlock_is_locked(&sl) == true) {
	access_shared_data();
}

access_shared_data() can get hoisted/executed speculatively before the 'if' statement. This needs to be prevented, hence we need 'acquire' semantics.

> 
> >
> > >  }
> > >
> > >  /**
  
Gavin Hu Jan. 11, 2019, 1:52 p.m. UTC | #4
> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Thursday, December 27, 2018 3:42 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> dev@dpdk.org
> Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Subject: Re: [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-
> way barrier builtins
> 
> On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> -------------------------------------------------------------------
> > ---
> > The __sync builtin based implementation generates full memory
> > barriers
> > ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate
> > one way
> > barriers.
> >
> > Here is the assembly code of __sync_compare_and_swap builtin.
> > __sync_bool_compare_and_swap(dst, exp, src);
> >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
> >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
> >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > <rte_atomic16_cmpset+52>  // b.any
> >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
> >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> > <rte_atomic16_cmpset+32>
> >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> > none
> >
> > The benchmarking results showed 3X performance gain on Cavium
> > ThunderX2 and
> > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> > Here is the example test result on TX2:
> >
> > *** spinlock_autotest without this patch ***
> > Core [123] Cost Time = 639822 us
> > Core [124] Cost Time = 633253 us
> > Core [125] Cost Time = 646030 us
> > Core [126] Cost Time = 643189 us
> > Core [127] Cost Time = 647039 us
> > Total Cost Time = 95433298 us
> >
> > *** spinlock_autotest with this patch ***
> > Core [123] Cost Time = 163615 us
> > Core [124] Cost Time = 166471 us
> > Core [125] Cost Time = 189044 us
> > Core [126] Cost Time = 195745 us
> > Core [127] Cost Time = 78423 us
> > Total Cost Time = 27339656 us
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> > ---
> >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > +++++++++++++-----
> >  1 file changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > index c4c3fc31e..87ae7a4f1 100644
> > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
> >  static inline void
> >  rte_spinlock_lock(rte_spinlock_t *sl)
> >  {
> > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > -		while(sl->locked)
> > +	int exp = 0;
> > +
> > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> {
> 
> How about remove explict exp = 0 and change to
> __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);

Yes, __atomic_test_and_set means simpler code and better, but __atomic_test_and_set takes the first argument as a pointer to type bool or char, in our case, sl->locked is of type uint32. 
We can force it to uint8, or just pass in the 32bit pointer, only one byte/bit is really used in this case, is that ok? 

"It should be only used for operands of type bool or char. For other types only part of the value may be set."
https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/_005f_005fatomic-Builtins.html

From performance perspective, in our testing, the performance was very close, compared to __atomic.

> 
> i.e
> while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))
> 
> 
> 
> > +		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> >  			rte_pause();
> > +		exp = 0;
> 
> We can remove exp = 0 with above scheme.
> 
> > +	}
> >  }
> >  #endif
> >
> > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
> >  static inline void
> >  rte_spinlock_unlock (rte_spinlock_t *sl)
> >  {
> > -	__sync_lock_release(&sl->locked);
> > +	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
>  }
> >  #endif
> >
> > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
> >  static inline int
> >  rte_spinlock_trylock (rte_spinlock_t *sl)
> >  {
> > -	return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > +	int exp = 0;
> > +	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > +				0, /* disallow spurious failure */
> > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> 
> Here to remove explicit exp.
> 
> return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)
> 
> 
> >  }
> >  #endif
> >
> > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> >   */
> >  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
> >  {
> > -	return sl->locked;
> > +	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> 
> __ATOMIC_RELAXED would be enough here. Right ?
> 
> 
> >  }
> >
> >  /**
  
Honnappa Nagarahalli Jan. 14, 2019, 5:54 a.m. UTC | #5
> >
> > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > -------------------------------------------------------------------
> > > ---
> > > The __sync builtin based implementation generates full memory
> > > barriers ('dmb ish') on Arm platforms. Using C11 atomic builtins to
> > > generate one way barriers.
> > >
> > > Here is the assembly code of __sync_compare_and_swap builtin.
> > > __sync_bool_compare_and_swap(dst, exp, src);
> > >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> > >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
> > >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
> > >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> > >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> > >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> > >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > > <rte_atomic16_cmpset+52>  // b.any
> > >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
> > >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> > > <rte_atomic16_cmpset+32>
> > >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> > >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> > > none
> > >
> > > The benchmarking results showed 3X performance gain on Cavium
> > > ThunderX2 and
> > > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> > > Here is the example test result on TX2:
> > >
> > > *** spinlock_autotest without this patch *** Core [123] Cost Time =
> > > 639822 us Core [124] Cost Time = 633253 us Core [125] Cost Time =
> > > 646030 us Core [126] Cost Time = 643189 us Core [127] Cost Time =
> > > 647039 us Total Cost Time = 95433298 us
> > >
> > > *** spinlock_autotest with this patch *** Core [123] Cost Time =
> > > 163615 us Core [124] Cost Time = 166471 us Core [125] Cost Time =
> > > 189044 us Core [126] Cost Time = 195745 us Core [127] Cost Time =
> > > 78423 us Total Cost Time = 27339656 us
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > > Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> > > ---
> > >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > > +++++++++++++-----
> > >  1 file changed, 13 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > index c4c3fc31e..87ae7a4f1 100644
> > > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);  static
> > > inline void  rte_spinlock_lock(rte_spinlock_t *sl)  {
> > > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > > -		while(sl->locked)
> > > +	int exp = 0;
> > > +
> > > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> > {
> >
> > How about remove explict exp = 0 and change to
> > __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);
> 
> Yes, __atomic_test_and_set means simpler code and better, but
> __atomic_test_and_set takes the first argument as a pointer to type bool or
> char, in our case, sl->locked is of type uint32.
> We can force it to uint8, or just pass in the 32bit pointer, only one byte/bit is
> really used in this case, is that ok?
> 
> "It should be only used for operands of type bool or char. For other types only
> part of the value may be set."
> https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/_005f_005fatomic-
> Builtins.html
> 
> From performance perspective, in our testing, the performance was very close,
> compared to __atomic.
If performance is close, I suggest we go with the existing patch. Changing sl->locked to bool/char would be an ABI change and will affect x86 TM based implementation as well.
Jerin, what do you think?

> 
> >
> > i.e
> > while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))
> >
> >
> >
> > > +		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> > >  			rte_pause();
> > > +		exp = 0;
> >
> > We can remove exp = 0 with above scheme.
> >
> > > +	}
> > >  }
> > >  #endif
> > >
> > > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);  static
> > > inline void  rte_spinlock_unlock (rte_spinlock_t *sl)  {
> > > -	__sync_lock_release(&sl->locked);
> > > +	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
> >  }
> > >  #endif
> > >
> > > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
> > > static inline int  rte_spinlock_trylock (rte_spinlock_t *sl)  {
> > > -	return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > > +	int exp = 0;
> > > +	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > > +				0, /* disallow spurious failure */
> > > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> >
> > Here to remove explicit exp.
> >
> > return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)
> >
> >
> > >  }
> > >  #endif
> > >
> > > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> > >   */
> > >  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)  {
> > > -	return sl->locked;
> > > +	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> >
> > __ATOMIC_RELAXED would be enough here. Right ?
> >
> >
> > >  }
> > >
> > >  /**
  
Jerin Jacob Kollanukkaran Jan. 14, 2019, 7:39 a.m. UTC | #6
On Mon, 2019-01-14 at 05:54 +0000, Honnappa Nagarahalli wrote:
> > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > > ---------------------------------------------------------------
> > > ----
> > > > ---
> > > > The __sync builtin based implementation generates full memory
> > > > barriers ('dmb ish') on Arm platforms. Using C11 atomic
> > > > builtins to
> > > > generate one way barriers.
> > > > 
> > > > Here is the assembly code of __sync_compare_and_swap builtin.
> > > > __sync_bool_compare_and_swap(dst, exp, src);
> > > >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> > > >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp,
> > > > #6]
> > > >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp,
> > > > #4]
> > > >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> > > >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> > > >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> > > >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > > > <rte_atomic16_cmpset+52>  // b.any
> > > >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2,
> > > > [x0]
> > > >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4,
> > > > 0x90f1c0
> > > > <rte_atomic16_cmpset+32>
> > > >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> > > >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  //
> > > > eq =
> > > > none
> > > > 
> > > > The benchmarking results showed 3X performance gain on Cavium
> > > > ThunderX2 and
> > > > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> > > > Here is the example test result on TX2:
> > > > 
> > > > *** spinlock_autotest without this patch *** Core [123] Cost
> > > > Time =
> > > > 639822 us Core [124] Cost Time = 633253 us Core [125] Cost Time
> > > > =
> > > > 646030 us Core [126] Cost Time = 643189 us Core [127] Cost Time
> > > > =
> > > > 647039 us Total Cost Time = 95433298 us
> > > > 
> > > > *** spinlock_autotest with this patch *** Core [123] Cost Time
> > > > =
> > > > 163615 us Core [124] Cost Time = 166471 us Core [125] Cost Time
> > > > =
> > > > 189044 us Core [126] Cost Time = 195745 us Core [127] Cost Time
> > > > =
> > > > 78423 us Total Cost Time = 27339656 us
> > > > 
> > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com
> > > > >
> > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > > > Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> > > > ---
> > > >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > > > +++++++++++++-----
> > > >  1 file changed, 13 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git
> > > > a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > index c4c3fc31e..87ae7a4f1 100644
> > > > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t
> > > > *sl);  static
> > > > inline void  rte_spinlock_lock(rte_spinlock_t *sl)  {
> > > > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > > > -		while(sl->locked)
> > > > +	int exp = 0;
> > > > +
> > > > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp,
> > > > 1, 0,
> > > > +				__ATOMIC_ACQUIRE,
> > > > __ATOMIC_RELAXED))
> > > {
> > > 
> > > How about remove explict exp = 0 and change to
> > > __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);
> > 
> > Yes, __atomic_test_and_set means simpler code and better, but
> > __atomic_test_and_set takes the first argument as a pointer to type
> > bool or
> > char, in our case, sl->locked is of type uint32.
> > We can force it to uint8, or just pass in the 32bit pointer, only
> > one byte/bit is
> > really used in this case, is that ok?
> > 
> > "It should be only used for operands of type bool or char. For
> > other types only
> > part of the value may be set."
> > https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/_005f_005fatomic-
> > Builtins.html
> > 
> > From performance perspective, in our testing, the performance was
> > very close,
> > compared to __atomic.
> If performance is close, I suggest we go with the existing patch.
> Changing sl->locked to bool/char would be an ABI change and will
> affect x86 TM based implementation as well.
> Jerin, what do you think?

Looks good to me.
  
Gavin Hu Jan. 14, 2019, 7:57 a.m. UTC | #7
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Monday, January 14, 2019 1:55 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> jerinj@marvell.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> hemant.agrawal@nxp.com; stephen@networkplumber.org; Joyce Kong (Arm
> Technology China) <Joyce.Kong@arm.com>; nd <nd@arm.com>
> Subject: RE: [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-
> way barrier builtins
> 
> > >
> > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > > -------------------------------------------------------------------
> > > > ---
> > > > The __sync builtin based implementation generates full memory
> > > > barriers ('dmb ish') on Arm platforms. Using C11 atomic builtins to
> > > > generate one way barriers.
> > > >
> > > > Here is the assembly code of __sync_compare_and_swap builtin.
> > > > __sync_bool_compare_and_swap(dst, exp, src);
> > > >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> > > >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
> > > >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
> > > >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> > > >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> > > >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> > > >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > > > <rte_atomic16_cmpset+52>  // b.any
> > > >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
> > > >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> > > > <rte_atomic16_cmpset+32>
> > > >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> > > >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> > > > none
> > > >
> > > > The benchmarking results showed 3X performance gain on Cavium
> > > > ThunderX2 and
> > > > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> > > > Here is the example test result on TX2:
> > > >
> > > > *** spinlock_autotest without this patch *** Core [123] Cost Time =
> > > > 639822 us Core [124] Cost Time = 633253 us Core [125] Cost Time =
> > > > 646030 us Core [126] Cost Time = 643189 us Core [127] Cost Time =
> > > > 647039 us Total Cost Time = 95433298 us
> > > >
> > > > *** spinlock_autotest with this patch *** Core [123] Cost Time =
> > > > 163615 us Core [124] Cost Time = 166471 us Core [125] Cost Time =
> > > > 189044 us Core [126] Cost Time = 195745 us Core [127] Cost Time =
> > > > 78423 us Total Cost Time = 27339656 us
> > > >
> > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > > > Reviewed-by: Steve Capper <Steve.Capper@arm.com>
> > > > ---
> > > >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > > > +++++++++++++-----
> > > >  1 file changed, 13 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > index c4c3fc31e..87ae7a4f1 100644
> > > > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > > > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);  static
> > > > inline void  rte_spinlock_lock(rte_spinlock_t *sl)  {
> > > > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > > > -		while(sl->locked)
> > > > +	int exp = 0;
> > > > +
> > > > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > > > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> > > {
> > >
> > > How about remove explict exp = 0 and change to
> > > __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);
> >
> > Yes, __atomic_test_and_set means simpler code and better, but
> > __atomic_test_and_set takes the first argument as a pointer to type bool
> or
> > char, in our case, sl->locked is of type uint32.
> > We can force it to uint8, or just pass in the 32bit pointer, only one byte/bit
> is
> > really used in this case, is that ok?
> >
> > "It should be only used for operands of type bool or char. For other types
> only
> > part of the value may be set."
> > https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/_005f_005fatomic-
> > Builtins.html
> >
> > From performance perspective, in our testing, the performance was very
> close,
> > compared to __atomic.
> If performance is close, I suggest we go with the existing patch. Changing sl-
> >locked to bool/char would be an ABI change and will affect x86 TM based
> implementation as well.
> Jerin, what do you think?

I have benchmarked on Qualcomm, ThunderX2 and Context A72. 
In comparison to the existing patch, on the new patch using __atomic_test_and_set, Qualcomm Falkor gained 60% performance, 4-core A72 degraded 13%, ThunderX2 even worse, degraded 10 times. 
I am not sure why ThunderX2 degraded so much, maybe it was caused by two many cores (128 cores) with high contention? 

> 
> >
> > >
> > > i.e
> > > while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))
> > >
> > >
> > >
> > > > +		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> > > >  			rte_pause();
> > > > +		exp = 0;
> > >
> > > We can remove exp = 0 with above scheme.
> > >
> > > > +	}
> > > >  }
> > > >  #endif
> > > >
> > > > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);  static
> > > > inline void  rte_spinlock_unlock (rte_spinlock_t *sl)  {
> > > > -	__sync_lock_release(&sl->locked);
> > > > +	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
> > >  }
> > > >  #endif
> > > >
> > > > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
> > > > static inline int  rte_spinlock_trylock (rte_spinlock_t *sl)  {
> > > > -	return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > > > +	int exp = 0;
> > > > +	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > > > +				0, /* disallow spurious failure */
> > > > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> > >
> > > Here to remove explicit exp.
> > >
> > > return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)
> > >
> > >
> > > >  }
> > > >  #endif
> > > >
> > > > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> > > >   */
> > > >  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)  {
> > > > -	return sl->locked;
> > > > +	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> > >
> > > __ATOMIC_RELAXED would be enough here. Right ?
> > >
> > >
> > > >  }
> > > >
> > > >  /**
  
Gavin Hu Jan. 14, 2019, 5:08 p.m. UTC | #8
> > > > > *sl);  static
> > > > > inline void  rte_spinlock_lock(rte_spinlock_t *sl)  {
> > > > > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > > > > -		while(sl->locked)
> > > > > +	int exp = 0;
> > > > > +
> > > > > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp,
> > > > > 1, 0,
> > > > > +				__ATOMIC_ACQUIRE,
> > > > > __ATOMIC_RELAXED))
> > > > {
> > > >
> > > > How about remove explict exp = 0 and change to
> > > > __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);
> > >
> > > Yes, __atomic_test_and_set means simpler code and better, but
> > > __atomic_test_and_set takes the first argument as a pointer to type
> > > bool or
> > > char, in our case, sl->locked is of type uint32.
> > > We can force it to uint8, or just pass in the 32bit pointer, only
> > > one byte/bit is
> > > really used in this case, is that ok?
> > >
> > > "It should be only used for operands of type bool or char. For
> > > other types only
> > > part of the value may be set."
> > > https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/_005f_005fatomic-
> > > Builtins.html
> > >
> > > From performance perspective, in our testing, the performance was
> > > very close,
> > > compared to __atomic.
> > If performance is close, I suggest we go with the existing patch.
> > Changing sl->locked to bool/char would be an ABI change and will
> > affect x86 TM based implementation as well.
> > Jerin, what do you think?
> 
> Looks good to me.
> 
I tested __atomic_test_and_test based patch(I did not push this, it is in internal review), it caused 10 times performance degradation on ThunderX2.
In the internal review, Ola Liljedahl's comment well explained the degradation:
"Unlike the old code, the new code will write the lock location (and thus get exclusive access to the whole cache line) repeatedly while it is busy waiting. This is bad for the lock owner if it is accessing other data located in the same cache line as the lock (which is often the case)."
The real difference is __atomic_test_and_test keeps writing the lock location(whether it is locked or not) and monopolizing the cache line, it is bad to the lock owner and other contenders.
  

Patch

diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index c4c3fc31e..87ae7a4f1 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -61,9 +61,14 @@  rte_spinlock_lock(rte_spinlock_t *sl);
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl)
 {
-	while (__sync_lock_test_and_set(&sl->locked, 1))
-		while(sl->locked)
+	int exp = 0;
+
+	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
+				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
+		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
 			rte_pause();
+		exp = 0;
+	}
 }
 #endif
 
@@ -80,7 +85,7 @@  rte_spinlock_unlock (rte_spinlock_t *sl);
 static inline void
 rte_spinlock_unlock (rte_spinlock_t *sl)
 {
-	__sync_lock_release(&sl->locked);
+	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
 }
 #endif
 
@@ -99,7 +104,10 @@  rte_spinlock_trylock (rte_spinlock_t *sl);
 static inline int
 rte_spinlock_trylock (rte_spinlock_t *sl)
 {
-	return __sync_lock_test_and_set(&sl->locked,1) == 0;
+	int exp = 0;
+	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
+				0, /* disallow spurious failure */
+				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
 }
 #endif
 
@@ -113,7 +121,7 @@  rte_spinlock_trylock (rte_spinlock_t *sl)
  */
 static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
 {
-	return sl->locked;
+	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
 }
 
 /**