[v3,6/6] spinlock: ticket based to improve fairness

Message ID 20181227041349.3058-7-gavin.hu@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series spinlock optimization and test case enhancements |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Gavin Hu Dec. 27, 2018, 4:13 a.m. UTC
  From: Joyce Kong <joyce.kong@arm.com>

The old implementation is unfair, some threads may take locks aggressively
while leaving the other threads starving for long time. As shown in the
following test, within same period of time, there are threads taking locks
much more times than the others.

The new implementation gives each waiting thread a ticket and they can take
the lock one by one, first come, first serviced, this avoids starvation for
too long time and is more predictable.

*** spinlock_autotest without this patch ***
Core [0] count = 89
Core [1] count = 84
Core [2] count = 94
...
Core [208] count = 171
Core [209] count = 152
Core [210] count = 161
Core [211] count = 187

*** spinlock_autotest with this patch ***
Core [0] count = 534
Core [1] count = 533
Core [2] count = 542
...
Core [208] count = 554
Core [209] count = 556
Core [210] count = 555
Core [211] count = 551

The overal spinlock fairness increased on thundex-2.

Signed-off-by: Joyce Kong <joyce.kong@arm.com>
---
 .../common/include/arch/ppc_64/rte_spinlock.h      |  5 ++
 .../common/include/arch/x86/rte_spinlock.h         |  6 +++
 .../common/include/generic/rte_spinlock.h          | 53 +++++++++++++---------
 3 files changed, 42 insertions(+), 22 deletions(-)
  

Comments

Jerin Jacob Kollanukkaran Dec. 27, 2018, 6:58 a.m. UTC | #1
On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> -------------------------------------------------------------------
> ---
> From: Joyce Kong <joyce.kong@arm.com>
> 
> The old implementation is unfair, some threads may take locks
> aggressively

I think, one issue here is x86 and ppc follows traditional spinlock and
arm64 will be following ticket lock for spinlock implementation.
This would change application behaviour on arm64 compared to x86 and
ppc.

How about having a separate API for ticket lock? That would give,
# application choice to use the locking strategy
# application behaviour will be same across all arch.

Initial ticket lock implementation can be generic with C11 memory
primitive, latter arch can optimize it, if required.

> while leaving the other threads starving for long time. As shown in
> the
> following test, within same period of time, there are threads taking
> locks
> much more times than the others.
> 
>  
>  #ifdef RTE_FORCE_INTRINSICS
>  static inline void
> -rte_spinlock_unlock (rte_spinlock_t *sl)
> +rte_spinlock_unlock(rte_spinlock_t *sl)
>  {
> -	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
> +	uint16_t i = __atomic_load_n(&sl->s.current, __ATOMIC_RELAXED);
> +	i++;
> +	__atomic_store_n(&sl->s.current, i, __ATOMIC_RELAXED);

Shouldn't we use __ATOMIC_RELEASE here to pair with lock() ?


>  }
>  #endif
>  
> @@ -98,16 +100,19 @@ rte_spinlock_unlock (rte_spinlock_t *sl)
>   *   1 if the lock is successfully taken; 0 otherwise.
>   */
>  static inline int
> -rte_spinlock_trylock (rte_spinlock_t *sl);
> +rte_spinlock_trylock(rte_spinlock_t *sl);
>  
>  #ifdef RTE_FORCE_INTRINSICS
>  static inline int
> -rte_spinlock_trylock (rte_spinlock_t *sl)
> +rte_spinlock_trylock(rte_spinlock_t *sl)
>  {
> -	int exp = 0;
> -	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> -				0, /* disallow spurious failure */
> -				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> +	uint16_t me = __atomic_fetch_add(&sl->s.next, 1,
> __ATOMIC_RELAXED);
> +	while (__atomic_load_n(&sl->s.current, __ATOMIC_RELAXED) != me)
> {
> +		__atomic_sub_fetch(&sl->s.next, 1, __ATOMIC_RELAXED);
> +		return 0;
> +	}
> +

Shouldn't we need CAS here?
Similar implementation here:
https://git.linaro.org/lng/odp.git/tree/platform/linux-generic/include/odp/api/plat/ticketlock_inlines.h


> +	return 1;
>  }
>  #endif
>  
>
  
Gavin Hu Dec. 27, 2018, 10:05 a.m. UTC | #2
> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Thursday, December 27, 2018 2:58 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net; Joyce
> Kong (Arm Technology China) <Joyce.Kong@arm.com>;
> hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Subject: Re: [EXT] [PATCH v3 6/6] spinlock: ticket based to improve fairness
> 
> On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > -------------------------------------------------------------------
> > ---
> > From: Joyce Kong <joyce.kong@arm.com>
> >
> > The old implementation is unfair, some threads may take locks
> > aggressively
> 
> I think, one issue here is x86 and ppc follows traditional spinlock and
> arm64 will be following ticket lock for spinlock implementation.
> This would change application behaviour on arm64 compared to x86 and
> ppc.
> 
> How about having a separate API for ticket lock? That would give,
> # application choice to use the locking strategy
> # application behaviour will be same across all arch.

Ok, will do in v4 to have a new named rte_ticket_spinlock API.

> Initial ticket lock implementation can be generic with C11 memory
> primitive, latter arch can optimize it, if required.
Yes, latter we might optimize with new instructions. 

> > while leaving the other threads starving for long time. As shown in
> > the
> > following test, within same period of time, there are threads taking
> > locks
> > much more times than the others.
> >
> >
> >  #ifdef RTE_FORCE_INTRINSICS
> >  static inline void
> > -rte_spinlock_unlock (rte_spinlock_t *sl)
> > +rte_spinlock_unlock(rte_spinlock_t *sl)
> >  {
> > -	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
> > +	uint16_t i = __atomic_load_n(&sl->s.current, __ATOMIC_RELAXED);
> > +	i++;
> > +	__atomic_store_n(&sl->s.current, i, __ATOMIC_RELAXED);
> 
> Shouldn't we use __ATOMIC_RELEASE here to pair with lock() ?
> 
> 
> >  }
> >  #endif
> >
> > @@ -98,16 +100,19 @@ rte_spinlock_unlock (rte_spinlock_t *sl)
> >   *   1 if the lock is successfully taken; 0 otherwise.
> >   */
> >  static inline int
> > -rte_spinlock_trylock (rte_spinlock_t *sl);
> > +rte_spinlock_trylock(rte_spinlock_t *sl);
> >
> >  #ifdef RTE_FORCE_INTRINSICS
> >  static inline int
> > -rte_spinlock_trylock (rte_spinlock_t *sl)
> > +rte_spinlock_trylock(rte_spinlock_t *sl)
> >  {
> > -	int exp = 0;
> > -	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > -				0, /* disallow spurious failure */
> > -				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> > +	uint16_t me = __atomic_fetch_add(&sl->s.next, 1,
> > __ATOMIC_RELAXED);
> > +	while (__atomic_load_n(&sl->s.current, __ATOMIC_RELAXED) != me)
> > {
> > +		__atomic_sub_fetch(&sl->s.next, 1, __ATOMIC_RELAXED);
> > +		return 0;
> > +	}
> > +
> 
> Shouldn't we need CAS here?
> Similar implementation here:
> https://git.linaro.org/lng/odp.git/tree/platform/linux-
> generic/include/odp/api/plat/ticketlock_inlines.h

That is correct, CAS is required. 
Assume T2 takes precedence for requesting a ticket, eg. T2.next = 2, but did not take the lock yet, eg. T2.current = 1, then T1 trylock, T1.next = 3, then T2 takes the lock, T2.next = T2.current = 2, T1 fallback, T1.next = 2, both next = 2, that's not correct, next current will be 3, T1 will not get the lock any more.
> 
> > +	return 1;
> >  }
> >  #endif
> >
> >
  
Jerin Jacob Kollanukkaran Dec. 27, 2018, 12:08 p.m. UTC | #3
On Thu, 2018-12-27 at 10:05 +0000, Gavin Hu (Arm Technology China)
wrote:
> > -----Original Message-----
> > From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> > Sent: Thursday, December 27, 2018 2:58 PM
> > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; 
> > dev@dpdk.org
> > Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> > <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> > Joyce
> > Kong (Arm Technology China) <Joyce.Kong@arm.com>;
> > hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Subject: Re: [EXT] [PATCH v3 6/6] spinlock: ticket based to improve
> > fairness
> > 
> > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > > ---------------------------------------------------------------
> > > ----
> > > ---
> > > From: Joyce Kong <joyce.kong@arm.com>
> > > 
> > > The old implementation is unfair, some threads may take locks
> > > aggressively
> > 
> > I think, one issue here is x86 and ppc follows traditional spinlock
> > and
> > arm64 will be following ticket lock for spinlock implementation.
> > This would change application behaviour on arm64 compared to x86
> > and
> > ppc.
> > 
> > How about having a separate API for ticket lock? That would give,
> > # application choice to use the locking strategy
> > # application behaviour will be same across all arch.
> 
> Ok, will do in v4 to have a new named rte_ticket_spinlock API.

I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked] name
instead of rte_ticket_spinlock_lock etc to reduce the length of the
API.
  
Stephen Hemminger Dec. 27, 2018, 11:41 p.m. UTC | #4
On Thu, 27 Dec 2018 12:08:26 +0000
Jerin Jacob Kollanukkaran <jerinj@marvell.com> wrote:

> On Thu, 2018-12-27 at 10:05 +0000, Gavin Hu (Arm Technology China)
> wrote:
> > > -----Original Message-----
> > > From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> > > Sent: Thursday, December 27, 2018 2:58 PM
> > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; 
> > > dev@dpdk.org
> > > Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> > > <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> > > Joyce
> > > Kong (Arm Technology China) <Joyce.Kong@arm.com>;
> > > hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> > > Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Subject: Re: [EXT] [PATCH v3 6/6] spinlock: ticket based to improve
> > > fairness
> > > 
> > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:  
> > > > ---------------------------------------------------------------
> > > > ----
> > > > ---
> > > > From: Joyce Kong <joyce.kong@arm.com>
> > > > 
> > > > The old implementation is unfair, some threads may take locks
> > > > aggressively  
> > > 
> > > I think, one issue here is x86 and ppc follows traditional spinlock
> > > and
> > > arm64 will be following ticket lock for spinlock implementation.
> > > This would change application behaviour on arm64 compared to x86
> > > and
> > > ppc.
> > > 
> > > How about having a separate API for ticket lock? That would give,
> > > # application choice to use the locking strategy
> > > # application behaviour will be same across all arch.  
> > 
> > Ok, will do in v4 to have a new named rte_ticket_spinlock API.  
> 
> I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked] name
> instead of rte_ticket_spinlock_lock etc to reduce the length of the
> API.


NAK to adding new API for this.

I want the best possible locks for all applications and all architectures.
These should be called spinlock so there is no requirement for application
to change to get better performance. Why not just implement the best algorithm
across the board. Yes, this means collaboration or working on the other guys
architecture.
  
Jerin Jacob Kollanukkaran Dec. 28, 2018, 4:39 a.m. UTC | #5
On Thu, 2018-12-27 at 15:41 -0800, Stephen Hemminger wrote:
> On Thu, 27 Dec 2018 12:08:26 +0000
> Jerin Jacob Kollanukkaran <jerinj@marvell.com> wrote:
> 
> > On Thu, 2018-12-27 at 10:05 +0000, Gavin Hu (Arm Technology China)
> > wrote:
> > > > -----Original Message-----
> > > > From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> > > > Sent: Thursday, December 27, 2018 2:58 PM
> > > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; 
> > > > dev@dpdk.org
> > > > Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> > > > <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> > > > Joyce
> > > > Kong (Arm Technology China) <Joyce.Kong@arm.com>;
> > > > hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> > > > Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > > Subject: Re: [EXT] [PATCH v3 6/6] spinlock: ticket based to
> > > > improve
> > > > fairness
> > > > 
> > > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:  
> > > > > -----------------------------------------------------------
> > > > > ----
> > > > > ----
> > > > > ---
> > > > > From: Joyce Kong <joyce.kong@arm.com>
> > > > > 
> > > > > The old implementation is unfair, some threads may take locks
> > > > > aggressively  
> > > > 
> > > > I think, one issue here is x86 and ppc follows traditional
> > > > spinlock
> > > > and
> > > > arm64 will be following ticket lock for spinlock
> > > > implementation.
> > > > This would change application behaviour on arm64 compared to
> > > > x86
> > > > and
> > > > ppc.
> > > > 
> > > > How about having a separate API for ticket lock? That would
> > > > give,
> > > > # application choice to use the locking strategy
> > > > # application behaviour will be same across all arch.  
> > > 
> > > Ok, will do in v4 to have a new named rte_ticket_spinlock API.  
> > 
> > I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked] name
> > instead of rte_ticket_spinlock_lock etc to reduce the length of the
> > API.
> 
> NAK to adding new API for this.
> 
> I want the best possible locks for all applications and all
> architectures.
> These should be called spinlock so there is no requirement for
> application
> to change to get better performance. Why not just implement the best
> algorithm
> across the board. Yes, this means collaboration or working on the
> other guys
> architecture.

Then 6/6 patch needs to put on hold if every arch needs to make ticket
lock as default spinlock lock strategy.

How about following to make forward progress:
1) Introduce rte_ticketlock_[lock/unlock/trylock/is_locked] API now as
experimental with default implementation
2) Provide a time line to switch every arch for optimized ticketlock
implementation if needed.
3) Switch rte_ticketlock_ as rte_spinlock_ API.
4) Keep old version of spinlock as new API if some application does not
need fairness between threads at the cost of light weight spinlock
implementation.

I don't want arm64 to behave differently than other arch(s).
  
Gavin Hu Dec. 28, 2018, 10:04 a.m. UTC | #6
> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Friday, December 28, 2018 12:39 PM
> To: stephen@networkplumber.org
> Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> dev@dpdk.org; Joyce Kong (Arm Technology China) <Joyce.Kong@arm.com>;
> hemant.agrawal@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>
> Subject: Re: [EXT] [PATCH v3 6/6] spinlock: ticket based to improve fairness
> 
> On Thu, 2018-12-27 at 15:41 -0800, Stephen Hemminger wrote:
> > On Thu, 27 Dec 2018 12:08:26 +0000
> > Jerin Jacob Kollanukkaran <jerinj@marvell.com> wrote:
> >
> > > On Thu, 2018-12-27 at 10:05 +0000, Gavin Hu (Arm Technology China)
> > > wrote:
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> > > > > Sent: Thursday, December 27, 2018 2:58 PM
> > > > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> > > > > dev@dpdk.org
> > > > > Cc: david.marchand@redhat.com; chaozhu@linux.vnet.ibm.com; nd
> > > > > <nd@arm.com>; bruce.richardson@intel.com; thomas@monjalon.net;
> > > > > Joyce
> > > > > Kong (Arm Technology China) <Joyce.Kong@arm.com>;
> > > > > hemant.agrawal@nxp.com; stephen@networkplumber.org; Honnappa
> > > > > Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > > > Subject: Re: [EXT] [PATCH v3 6/6] spinlock: ticket based to
> > > > > improve
> > > > > fairness
> > > > >
> > > > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > > > > > -----------------------------------------------------------
> > > > > > ----
> > > > > > ----
> > > > > > ---
> > > > > > From: Joyce Kong <joyce.kong@arm.com>
> > > > > >
> > > > > > The old implementation is unfair, some threads may take locks
> > > > > > aggressively
> > > > >
> > > > > I think, one issue here is x86 and ppc follows traditional
> > > > > spinlock
> > > > > and
> > > > > arm64 will be following ticket lock for spinlock
> > > > > implementation.
> > > > > This would change application behaviour on arm64 compared to
> > > > > x86
> > > > > and
> > > > > ppc.
> > > > >
> > > > > How about having a separate API for ticket lock? That would
> > > > > give,
> > > > > # application choice to use the locking strategy
> > > > > # application behaviour will be same across all arch.
> > > >
> > > > Ok, will do in v4 to have a new named rte_ticket_spinlock API.
> > >
> > > I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked] name
> > > instead of rte_ticket_spinlock_lock etc to reduce the length of the
> > > API.
> >
> > NAK to adding new API for this.
> >
> > I want the best possible locks for all applications and all
> > architectures.
> > These should be called spinlock so there is no requirement for
> > application
> > to change to get better performance. Why not just implement the best
> > algorithm
> > across the board. Yes, this means collaboration or working on the
> > other guys
> > architecture.
> 
> Then 6/6 patch needs to put on hold if every arch needs to make ticket
> lock as default spinlock lock strategy.
> 
> How about following to make forward progress:
> 1) Introduce rte_ticketlock_[lock/unlock/trylock/is_locked] API now as
> experimental with default implementation
> 2) Provide a time line to switch every arch for optimized ticketlock
> implementation if needed.
> 3) Switch rte_ticketlock_ as rte_spinlock_ API.
> 4) Keep old version of spinlock as new API if some application does not
> need fairness between threads at the cost of light weight spinlock
> implementation.

We will rework the patches following this proposal, and let's switch over at some point of later.
 
> I don't want arm64 to behave differently than other arch(s).

This is the generic implementation, x86 and ppc can choose also by setting CONFIG_RTE_FORCE_INTRINSICS=y  in the config file.
The arch specific implementation is instruction level based, arm will implement this also, they are not ticket based.
>
  
Honnappa Nagarahalli Jan. 3, 2019, 6:35 p.m. UTC | #7
> > > > >
> > > > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > > > > > -----------------------------------------------------------
> > > > > > ----
> > > > > > ----
> > > > > > ---
> > > > > > From: Joyce Kong <joyce.kong@arm.com>
> > > > > >
> > > > > > The old implementation is unfair, some threads may take locks
> > > > > > aggressively
> > > > >
> > > > > I think, one issue here is x86 and ppc follows traditional
> > > > > spinlock and
> > > > > arm64 will be following ticket lock for spinlock implementation.
> > > > > This would change application behaviour on arm64 compared to
> > > > > x86
> > > > > and
> > > > > ppc.
> > > > >
> > > > > How about having a separate API for ticket lock? That would
> > > > > give, # application choice to use the locking strategy #
> > > > > application behaviour will be same across all arch.
> > > >
> > > > Ok, will do in v4 to have a new named rte_ticket_spinlock API.
> > >
> > > I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked] name
> > > instead of rte_ticket_spinlock_lock etc to reduce the length of the
> > > API.
> >
> > NAK to adding new API for this.
> >
> > I want the best possible locks for all applications and all
> > architectures.
> > These should be called spinlock so there is no requirement for
> > application to change to get better performance. Why not just
> > implement the best algorithm across the board. Yes, this means
> > collaboration or working on the other guys architecture.
IMO, the ticket lock should be a separate API. Spin lock (as implemented today) and ticket lock have different characteristics in terms of behavior as well as resources required. For ex: spin lock might be sufficient for a low contention use case and it requires less number of cache lines. Ticket lock needs more number of cache lines and might be good for use cases with more contention. The decision to use spin lock vs ticket lock should be left to the application.

> 
> Then 6/6 patch needs to put on hold if every arch needs to make ticket lock
> as default spinlock lock strategy.
+1 for doing 6/6 as a separate patch.

> 
> How about following to make forward progress:
> 1) Introduce rte_ticketlock_[lock/unlock/trylock/is_locked] API now as
> experimental with default implementation
> 2) Provide a time line to switch every arch for optimized ticketlock
> implementation if needed.
> 3) Switch rte_ticketlock_ as rte_spinlock_ API.
> 4) Keep old version of spinlock as new API if some application does not need
> fairness between threads at the cost of light weight spinlock
> implementation.
> 
> I don't want arm64 to behave differently than other arch(s).
  
Stephen Hemminger Jan. 3, 2019, 7:53 p.m. UTC | #8
On Thu, 3 Jan 2019 18:35:31 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> > > > > >
> > > > > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:  
> > > > > > > -----------------------------------------------------------
> > > > > > > ----
> > > > > > > ----
> > > > > > > ---
> > > > > > > From: Joyce Kong <joyce.kong@arm.com>
> > > > > > >
> > > > > > > The old implementation is unfair, some threads may take locks
> > > > > > > aggressively  
> > > > > >
> > > > > > I think, one issue here is x86 and ppc follows traditional
> > > > > > spinlock and
> > > > > > arm64 will be following ticket lock for spinlock implementation.
> > > > > > This would change application behaviour on arm64 compared to
> > > > > > x86
> > > > > > and
> > > > > > ppc.
> > > > > >
> > > > > > How about having a separate API for ticket lock? That would
> > > > > > give, # application choice to use the locking strategy #
> > > > > > application behaviour will be same across all arch.  
> > > > >
> > > > > Ok, will do in v4 to have a new named rte_ticket_spinlock API.  
> > > >
> > > > I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked] name
> > > > instead of rte_ticket_spinlock_lock etc to reduce the length of the
> > > > API.  
> > >
> > > NAK to adding new API for this.
> > >
> > > I want the best possible locks for all applications and all
> > > architectures.
> > > These should be called spinlock so there is no requirement for
> > > application to change to get better performance. Why not just
> > > implement the best algorithm across the board. Yes, this means
> > > collaboration or working on the other guys architecture.  
> IMO, the ticket lock should be a separate API. Spin lock (as implemented today) and ticket lock have different characteristics in terms of behavior as well as resources required. For ex: spin lock might be sufficient for a low contention use case and it requires less number of cache lines. Ticket lock needs more number of cache lines and might be good for use cases with more contention. The decision to use spin lock vs ticket lock should be left to the application.

The problem is that changing applications is hard. Real world applications are non-trivial
and large. That is while doing things global or at the library level are easier.
Do you want to go back and evaluate every lock in VPP, NFF-go, OVS, Tungsten Fabric, ....

Either a config or runtime option seems best to me.
  
Honnappa Nagarahalli Jan. 4, 2019, 7:06 a.m. UTC | #9
> 
> On Thu, 3 Jan 2019 18:35:31 +0000
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:
> 
> > > > > > >
> > > > > > > On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> > > > > > > > ----------------------------------------------------------
> > > > > > > > -
> > > > > > > > ----
> > > > > > > > ----
> > > > > > > > ---
> > > > > > > > From: Joyce Kong <joyce.kong@arm.com>
> > > > > > > >
> > > > > > > > The old implementation is unfair, some threads may take
> > > > > > > > locks aggressively
> > > > > > >
> > > > > > > I think, one issue here is x86 and ppc follows traditional
> > > > > > > spinlock and
> > > > > > > arm64 will be following ticket lock for spinlock implementation.
> > > > > > > This would change application behaviour on arm64 compared to
> > > > > > > x86
> > > > > > > and
> > > > > > > ppc.
> > > > > > >
> > > > > > > How about having a separate API for ticket lock? That would
> > > > > > > give, # application choice to use the locking strategy #
> > > > > > > application behaviour will be same across all arch.
> > > > > >
> > > > > > Ok, will do in v4 to have a new named rte_ticket_spinlock API.
> > > > >
> > > > > I would prefer rte_ticketlock_[lock/unlock/trylock/is_locked]
> > > > > name instead of rte_ticket_spinlock_lock etc to reduce the
> > > > > length of the API.
> > > >
> > > > NAK to adding new API for this.
> > > >
> > > > I want the best possible locks for all applications and all
> > > > architectures.
> > > > These should be called spinlock so there is no requirement for
> > > > application to change to get better performance. Why not just
> > > > implement the best algorithm across the board. Yes, this means
> > > > collaboration or working on the other guys architecture.
> > IMO, the ticket lock should be a separate API. Spin lock (as implemented
> today) and ticket lock have different characteristics in terms of behavior as
> well as resources required. For ex: spin lock might be sufficient for a low
> contention use case and it requires less number of cache lines. Ticket lock
> needs more number of cache lines and might be good for use cases with more
> contention. The decision to use spin lock vs ticket lock should be left to the
> application.
> 
> The problem is that changing applications is hard. Real world applications are
> non-trivial and large. That is while doing things global or at the library level
> are easier.
> Do you want to go back and evaluate every lock in VPP, NFF-go, OVS,
> Tungsten Fabric, ....
> 
> Either a config or runtime option seems best to me.
IMO, we should provide both APIs and a config to use ticket lock as the implementation for spin lock. (similar to Jerin's proposal?)
  

Patch

diff --git a/lib/librte_eal/common/include/arch/ppc_64/rte_spinlock.h b/lib/librte_eal/common/include/arch/ppc_64/rte_spinlock.h
index 39815d9ee..9fa904f92 100644
--- a/lib/librte_eal/common/include/arch/ppc_64/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/ppc_64/rte_spinlock.h
@@ -65,6 +65,11 @@  rte_spinlock_trylock(rte_spinlock_t *sl)
 	return __sync_lock_test_and_set(&sl->locked, 1) == 0;
 }
 
+static inline int
+rte_spinlock_is_locked(rte_spinlock_t *sl)
+{
+	return sl->locked;
+}
 #endif
 
 static inline int rte_tm_supported(void)
diff --git a/lib/librte_eal/common/include/arch/x86/rte_spinlock.h b/lib/librte_eal/common/include/arch/x86/rte_spinlock.h
index e2e2b2643..db80fa420 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_spinlock.h
@@ -65,6 +65,12 @@  rte_spinlock_trylock (rte_spinlock_t *sl)
 
 	return lockval == 0;
 }
+
+static inline int
+rte_spinlock_is_locked(rte_spinlock_t *sl)
+{
+	return sl->locked;
+}
 #endif
 
 extern uint8_t rte_rtm_supported;
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index 87ae7a4f1..607abd400 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -27,8 +27,12 @@ 
 /**
  * The rte_spinlock_t type.
  */
-typedef struct {
-	volatile int locked; /**< lock status 0 = unlocked, 1 = locked */
+typedef union {
+	volatile int locked; /* lock status 0 = unlocked, 1 = locked */
+	struct {
+		uint16_t current;
+		uint16_t next;
+	} s;
 } rte_spinlock_t;
 
 /**
@@ -45,7 +49,8 @@  typedef struct {
 static inline void
 rte_spinlock_init(rte_spinlock_t *sl)
 {
-	sl->locked = 0;
+	__atomic_store_n(&sl->s.current, 0, __ATOMIC_RELAXED);
+	__atomic_store_n(&sl->s.next, 0, __ATOMIC_RELAXED);
 }
 
 /**
@@ -61,14 +66,9 @@  rte_spinlock_lock(rte_spinlock_t *sl);
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl)
 {
-	int exp = 0;
-
-	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
-				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
-		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
-			rte_pause();
-		exp = 0;
-	}
+	uint16_t me = __atomic_fetch_add(&sl->s.next, 1, __ATOMIC_RELAXED);
+	while (__atomic_load_n(&sl->s.current, __ATOMIC_ACQUIRE) != me)
+		rte_pause();
 }
 #endif
 
@@ -79,13 +79,15 @@  rte_spinlock_lock(rte_spinlock_t *sl)
  *   A pointer to the spinlock.
  */
 static inline void
-rte_spinlock_unlock (rte_spinlock_t *sl);
+rte_spinlock_unlock(rte_spinlock_t *sl);
 
 #ifdef RTE_FORCE_INTRINSICS
 static inline void
-rte_spinlock_unlock (rte_spinlock_t *sl)
+rte_spinlock_unlock(rte_spinlock_t *sl)
 {
-	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
+	uint16_t i = __atomic_load_n(&sl->s.current, __ATOMIC_RELAXED);
+	i++;
+	__atomic_store_n(&sl->s.current, i, __ATOMIC_RELAXED);
 }
 #endif
 
@@ -98,16 +100,19 @@  rte_spinlock_unlock (rte_spinlock_t *sl)
  *   1 if the lock is successfully taken; 0 otherwise.
  */
 static inline int
-rte_spinlock_trylock (rte_spinlock_t *sl);
+rte_spinlock_trylock(rte_spinlock_t *sl);
 
 #ifdef RTE_FORCE_INTRINSICS
 static inline int
-rte_spinlock_trylock (rte_spinlock_t *sl)
+rte_spinlock_trylock(rte_spinlock_t *sl)
 {
-	int exp = 0;
-	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
-				0, /* disallow spurious failure */
-				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
+	uint16_t me = __atomic_fetch_add(&sl->s.next, 1, __ATOMIC_RELAXED);
+	while (__atomic_load_n(&sl->s.current, __ATOMIC_RELAXED) != me) {
+		__atomic_sub_fetch(&sl->s.next, 1, __ATOMIC_RELAXED);
+		return 0;
+	}
+
+	return 1;
 }
 #endif
 
@@ -119,10 +124,14 @@  rte_spinlock_trylock (rte_spinlock_t *sl)
  * @return
  *   1 if the lock is currently taken; 0 otherwise.
  */
-static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
+#ifdef RTE_FORCE_INTRINSICS
+static inline int
+rte_spinlock_is_locked(rte_spinlock_t *sl)
 {
-	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
+	return (__atomic_load_n(&sl->s.current, __ATOMIC_RELAXED) !=
+			__atomic_load_n(&sl->s.next, __ATOMIC_RELAXED));
 }
+#endif
 
 /**
  * Test if hardware transactional memory (lock elision) is supported