librte_eal: fix mcslock hang on weak memory
diff mbox series

Message ID 20200826092002.19395-1-diogo.behrens@huawei.com
State Accepted, archived
Delegated to: Thomas Monjalon
Headers show
Series
  • librte_eal: fix mcslock hang on weak memory
Related show

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-testing success Testing PASS
ci/travis-robot success Travis build: passed
ci/Intel-compilation success Compilation OK

Commit Message

Diogo Behrens Aug. 26, 2020, 9:20 a.m. UTC
The initialization me->locked=1 in lock() must happen before
    next->locked=0 in unlock(), otherwise a thread may hang forever,
    waiting me->locked become 0. On weak memory systems (such as ARMv8),
    the current implementation allows me->locked=1 to be reordered with
    announcing the node (pred->next=me) and, consequently, to be
    reordered with next->locked=0 in unlock().

    This fix adds a release barrier to pred->next=me, forcing
    me->locked=1 to happen before this operation.

Signed-off-by: Diogo Behrens <diogo.behrens@huawei.com>
---
 lib/librte_eal/include/generic/rte_mcslock.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Phil Yang Aug. 26, 2020, 10:17 a.m. UTC | #1
Diogo Behrens <diogo.behrens@huawei.com> writes:

> Subject: [PATCH] librte_eal: fix mcslock hang on weak memory
> 
>     The initialization me->locked=1 in lock() must happen before
>     next->locked=0 in unlock(), otherwise a thread may hang forever,
>     waiting me->locked become 0. On weak memory systems (such as ARMv8),
>     the current implementation allows me->locked=1 to be reordered with
>     announcing the node (pred->next=me) and, consequently, to be
>     reordered with next->locked=0 in unlock().
> 
>     This fix adds a release barrier to pred->next=me, forcing
>     me->locked=1 to happen before this operation.
> 
> Signed-off-by: Diogo Behrens <diogo.behrens@huawei.com>
> ---
>  lib/librte_eal/include/generic/rte_mcslock.h | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/librte_eal/include/generic/rte_mcslock.h
> b/lib/librte_eal/include/generic/rte_mcslock.h
> index 2bef28351..ce553f547 100644
> --- a/lib/librte_eal/include/generic/rte_mcslock.h
> +++ b/lib/librte_eal/include/generic/rte_mcslock.h
> @@ -68,7 +68,14 @@ rte_mcslock_lock(rte_mcslock_t **msl, rte_mcslock_t
> *me)
>  		 */
>  		return;
>  	}
> -	__atomic_store_n(&prev->next, me, __ATOMIC_RELAXED);
> +	/* The store to me->next above should also complete before the
> node is
> +	 * visible to predecessor thread releasing the lock. Hence, the store
> +	 * prev->next also requires release semantics. Note that, for
> example,
> +	 * on ARM, the release semantics in the exchange operation is not
> +	 * strong as a release fence and is not sufficient to enforce the
> +	 * desired order here.


Hi Diogo,

I didn't quite understand why the exchange operation with ACQ_REL memory ordering is not sufficient.
It emits 'stlxr' on armv8.0-a and 'swpal' on armv8.1-a (with LSE support). 
Both of these two instructions contain a release semantics. 

Please check the URL for the detail.
https://gcc.godbolt.org/z/Mc4373

BTW, if you captured a deadlock issue on your testbed, could you please share your test case here?

Thanks,
Phil


> +	 */
> +	__atomic_store_n(&prev->next, me, __ATOMIC_RELEASE);
> 
>  	/* The while-load of me->locked should not move above the
> previous
>  	 * store to prev->next. Otherwise it will cause a deadlock. Need a
> --
> 2.28.0
>
Diogo Behrens Aug. 27, 2020, 8:56 a.m. UTC | #2
Hi Phil,

thanks for taking a look at this. Indeed, the cmpxchg will have release semantics, but implicit barriers do not provide the same ordering guarantees as DMB ISH, ie, atomic_thread_fence(__ATOMIC_ACQ_REL).

Let me try to explain the scenario. Here is a pseudo-ARM assembly with the instructions involved:

STR me->locked  = 1   // 1: initialize the node
LDAXR pred = tail  // 2: cmpxchg
STLXR tail = me    // 3: cmpxchg
STR pred->next = me // 4: the problematic store

Now, ARM allows instruction 1 and 2 to be reordered, and also instructions 3 and 4 to be reordered:

LDAXR pred = tail  // 2: cmpxchg  (acquires can be moved up)
STR me-> locked = 1   // 1: initialize the node
STR pred->next = me // 4: the problematic store
STLXR tail = me    // 3: cmpxchg (releases can be moved down)

And since stores 1 and 4 can be reordered too, the following execution is possible:

LDAXR pred = tail  // 2: cmpxchg
STR pred->next = me // 4: the problematic store
STR me-> locked = 1   // 1: initialize the node
STLXR tail = me    // 3: cmpxchg

In this subtle situation, the locking-thread can overwrite the store to next->locked of the unlocking-thread (if it occurs between 4 and 1), and subsequently hang waiting for me->locked to become 0.

We couldn't reproduce this with our available hardware, but we "reproduced" it with the RMEM model checker, which precisely models the ARM memory model (https://github.com/rems-project/rmem). The GenMC model checker (https://github.com/MPI-SWS/genmc) also finds the issue, but its intermediate memory model is slightly more general than the ARM model.

The release attached to store (4) prevents (1) to reordering with it.

Please, let me know if that makes sense or if I am missing something.

Thanks,
-Diogo

-----Original Message-----
From: Phil Yang [mailto:Phil.Yang@arm.com] 
Sent: Wednesday, August 26, 2020 12:17 PM
To: Diogo Behrens <diogo.behrens@huawei.com>
Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
Subject: RE: [PATCH] librte_eal: fix mcslock hang on weak memory

Diogo Behrens <diogo.behrens@huawei.com> writes:

> Subject: [PATCH] librte_eal: fix mcslock hang on weak memory
> 
>     The initialization me->locked=1 in lock() must happen before
>     next->locked=0 in unlock(), otherwise a thread may hang forever,
>     waiting me->locked become 0. On weak memory systems (sHi Puch as ARMv8),
>     the current implementation allows me->locked=1 to be reordered with
>     announcing the node (pred->next=me) and, consequently, to be
>     reordered with next->locked=0 in unlock().
> 
>     This fix adds a release barrier to pred->next=me, forcing
>     me->locked=1 to happen before this operation.
> 
> Signed-off-by: Diogo Behrens <diogo.behrens@huawei.com>
> ---
>  lib/librte_eal/include/generic/rte_mcslock.h | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/librte_eal/include/generic/rte_mcslock.h
> b/lib/librte_eal/include/generic/rte_mcslock.h
> index 2bef28351..ce553f547 100644
> --- a/lib/librte_eal/include/generic/rte_mcslock.h
> +++ b/lib/librte_eal/include/generic/rte_mcslock.h
> @@ -68,7 +68,14 @@ rte_mcslock_lock(rte_mcslock_t **msl, rte_mcslock_t
> *me)
>  		 */
>  		return;
>  	}
> -	__atomic_store_n(&prev->next, me, __ATOMIC_RELAXED);
> +	/* The store to me->next above should also complete before the
> node is
> +	 * visible to predecessor thread releasing the lock. Hence, the store
> +	 * prev->next also requires release semantics. Note that, for
> example,
> +	 * on ARM, the release semantics in the exchange operation is not
> +	 * strong as a release fence and is not sufficient to enforce the
> +	 * desired order here.


Hi Diogo,

I didn't quite understand why the exchange operation with ACQ_REL memory ordering is not sufficient.
It emits 'stlxr' on armv8.0-a and 'swpal' on armv8.1-a (with LSE support). 
Both of these two instructions contain a release semantics. 

Please check the URL for the detail.
https://gcc.godbolt.org/z/Mc4373

BTW, if you captured a deadlock issue on your testbed, could you please share your test case here?

Thanks,
Phil


> +	 */
> +	__atomic_store_n(&prev->next, me, __ATOMIC_RELEASE);
> 
>  	/* The while-load of me->locked should not move above the previous
>  	 * store to prev->next. Otherwise it will cause a deadlock. Need a
> --
> 2.28.0
>
Phil Yang Aug. 28, 2020, 9:19 a.m. UTC | #3
Hi Diogo,



Thanks for your explanation.



As documented in Arm ARM<https://developer.arm.com/documentation/ddi0487/fc>  B2.9.5 Load-Exclusive and Store-Exclusive instruction usage restrictions:

" Between the Load-Exclusive and the Store-Exclusive, there are no explicit memory accesses, preloads,

direct or indirect System register writes, address translation instructions, cache or TLB maintenance

instructions, exception generating instructions, exception returns, or indirect branches."



So it is not allowed to insert (1) & (4) between (2, 3). The cmpxchg operation is atomic.



Thanks,

Phil Yang



> -----Original Message-----

> From: Diogo Behrens <diogo.behrens@huawei.com>

> Sent: Thursday, August 27, 2020 4:57 PM

> To: Phil Yang <Phil.Yang@arm.com>

> Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli

> <Honnappa.Nagarahalli@arm.com>

> Subject: RE: [PATCH] librte_eal: fix mcslock hang on weak memory

>

> Hi Phil,

>

> thanks for taking a look at this. Indeed, the cmpxchg will have release

> semantics, but implicit barriers do not provide the same ordering guarantees

> as DMB ISH, ie, atomic_thread_fence(__ATOMIC_ACQ_REL).

>

> Let me try to explain the scenario. Here is a pseudo-ARM assembly with the

> instructions involved:

>

> STR me->locked  = 1   // 1: initialize the node

> LDAXR pred = tail  // 2: cmpxchg

> STLXR tail = me    // 3: cmpxchg

> STR pred->next = me // 4: the problematic store

>

> Now, ARM allows instruction 1 and 2 to be reordered, and also instructions 3

> and 4 to be reordered:

>

> LDAXR pred = tail  // 2: cmpxchg  (acquires can be moved up)

> STR me-> locked = 1   // 1: initialize the node

> STR pred->next = me // 4: the problematic store

> STLXR tail = me    // 3: cmpxchg (releases can be moved down)

>

> And since stores 1 and 4 can be reordered too, the following execution is

> possible:

>

> LDAXR pred = tail  // 2: cmpxchg

> STR pred->next = me // 4: the problematic store

> STR me-> locked = 1   // 1: initialize the node

> STLXR tail = me    // 3: cmpxchg

>

> In this subtle situation, the locking-thread can overwrite the store to next-

> >locked of the unlocking-thread (if it occurs between 4 and 1), and

> subsequently hang waiting for me->locked to become 0.

>

> We couldn't reproduce this with our available hardware, but we

> "reproduced" it with the RMEM model checker, which precisely models the

> ARM memory model (https://github.com/rems-project/rmem). The GenMC

> model checker (https://github.com/MPI-SWS/genmc) also finds the issue,

> but its intermediate memory model is slightly more general than the ARM

> model.

>

> The release attached to store (4) prevents (1) to reordering with it.

>

> Please, let me know if that makes sense or if I am missing something.

>

> Thanks,

> -Diogo

>

> -----Original Message-----

> From: Phil Yang [mailto:Phil.Yang@arm.com]

> Sent: Wednesday, August 26, 2020 12:17 PM

> To: Diogo Behrens <diogo.behrens@huawei.com<mailto:diogo.behrens@huawei.com>>

> Cc: dev@dpdk.org<mailto:dev@dpdk.org>; nd <nd@arm.com<mailto:nd@arm.com>>; Honnappa Nagarahalli

> <Honnappa.Nagarahalli@arm.com<mailto:Honnappa.Nagarahalli@arm.com>>; nd <nd@arm.com<mailto:nd@arm.com>>

> Subject: RE: [PATCH] librte_eal: fix mcslock hang on weak memory

>

> Diogo Behrens <diogo.behrens@huawei.com<mailto:diogo.behrens@huawei.com>> writes:

>

> > Subject: [PATCH] librte_eal: fix mcslock hang on weak memory

> >

> >     The initialization me->locked=1 in lock() must happen before

> >     next->locked=0 in unlock(), otherwise a thread may hang forever,

> >     waiting me->locked become 0. On weak memory systems (sHi Puch as

> ARMv8),

> >     the current implementation allows me->locked=1 to be reordered with

> >     announcing the node (pred->next=me) and, consequently, to be

> >     reordered with next->locked=0 in unlock().

> >

> >     This fix adds a release barrier to pred->next=me, forcing

> >     me->locked=1 to happen before this operation.

> >

> > Signed-off-by: Diogo Behrens <diogo.behrens@huawei.com<mailto:diogo.behrens@huawei.com>>

> > ---

> >  lib/librte_eal/include/generic/rte_mcslock.h | 9 ++++++++-

> >  1 file changed, 8 insertions(+), 1 deletion(-)

> >

> > diff --git a/lib/librte_eal/include/generic/rte_mcslock.h

> > b/lib/librte_eal/include/generic/rte_mcslock.h

> > index 2bef28351..ce553f547 100644

> > --- a/lib/librte_eal/include/generic/rte_mcslock.h

> > +++ b/lib/librte_eal/include/generic/rte_mcslock.h

> > @@ -68,7 +68,14 @@ rte_mcslock_lock(rte_mcslock_t **msl,

> rte_mcslock_t

> > *me)

> >                           */

> >                          return;

> >          }

> > -       __atomic_store_n(&prev->next, me, __ATOMIC_RELAXED);

> > +      /* The store to me->next above should also complete before the

> > node is

> > +      * visible to predecessor thread releasing the lock. Hence, the store

> > +      * prev->next also requires release semantics. Note that, for

> > example,

> > +      * on ARM, the release semantics in the exchange operation is not

> > +      * strong as a release fence and is not sufficient to enforce the

> > +      * desired order here.

>

>

> Hi Diogo,

>

> I didn't quite understand why the exchange operation with ACQ_REL

> memory ordering is not sufficient.

> It emits 'stlxr' on armv8.0-a and 'swpal' on armv8.1-a (with LSE support).

> Both of these two instructions contain a release semantics.

>

> Please check the URL for the detail.

> https://gcc.godbolt.org/z/Mc4373

>

> BTW, if you captured a deadlock issue on your testbed, could you please

> share your test case here?

>

> Thanks,

> Phil

>

>

> > +      */

> > +      __atomic_store_n(&prev->next, me, __ATOMIC_RELEASE);

> >

> >          /* The while-load of me->locked should not move above the

> previous

> >           * store to prev->next. Otherwise it will cause a deadlock. Need a

> > --

> > 2.28.0

> >
Honnappa Nagarahalli Aug. 31, 2020, 6:45 p.m. UTC | #4
Hi Diogo,

Thanks for your explanation.

As documented in https://developer.arm.com/documentation/ddi0487/fc  B2.9.5 Load-Exclusive and Store-Exclusive instruction usage restrictions:
" Between the Load-Exclusive and the Store-Exclusive, there are no explicit memory accesses, preloads, 
direct or indirect System register writes, address translation instructions, cache or TLB maintenance
instructions, exception generating instructions, exception returns, or indirect branches."
[Honnappa] This is a requirement on the software, not on the micro-architecture.
We are having few discussions internally, will get back soon.

So it is not allowed to insert (1) & (4) between (2, 3). The cmpxchg operation is atomic.

Thanks,
Phil Yang

> -----Original Message-----
> From: Diogo Behrens <mailto:diogo.behrens@huawei.com>
> Sent: Thursday, August 27, 2020 4:57 PM
> To: Phil Yang <mailto:Phil.Yang@arm.com>
> Cc: mailto:dev@dpdk.org; nd <mailto:nd@arm.com>; Honnappa Nagarahalli
> <mailto:Honnappa.Nagarahalli@arm.com>
> Subject: RE: [PATCH] librte_eal: fix mcslock hang on weak memory
> 
> Hi Phil,
> 
> thanks for taking a look at this. Indeed, the cmpxchg will have release
> semantics, but implicit barriers do not provide the same ordering guarantees
> as DMB ISH, ie, atomic_thread_fence(__ATOMIC_ACQ_REL).
> 
> Let me try to explain the scenario. Here is a pseudo-ARM assembly with the
> instructions involved:
> 
> STR me->locked  = 1   // 1: initialize the node
> LDAXR pred = tail  // 2: cmpxchg
> STLXR tail = me    // 3: cmpxchg
> STR pred->next = me // 4: the problematic store
> 
> Now, ARM allows instruction 1 and 2 to be reordered, and also instructions 3
> and 4 to be reordered:
> 
> LDAXR pred = tail  // 2: cmpxchg  (acquires can be moved up)
> STR me-> locked = 1   // 1: initialize the node
> STR pred->next = me // 4: the problematic store
> STLXR tail = me    // 3: cmpxchg (releases can be moved down)
> 
> And since stores 1 and 4 can be reordered too, the following execution is
> possible:
> 
> LDAXR pred = tail  // 2: cmpxchg
> STR pred->next = me // 4: the problematic store
> STR me-> locked = 1   // 1: initialize the node
> STLXR tail = me    // 3: cmpxchg
> 
> In this subtle situation, the locking-thread can overwrite the store to next-
> >locked of the unlocking-thread (if it occurs between 4 and 1), and
> subsequently hang waiting for me->locked to become 0.
> 
> We couldn't reproduce this with our available hardware, but we
> "reproduced" it with the RMEM model checker, which precisely models the
> ARM memory model (https://github.com/rems-project/rmem). The GenMC
> model checker (https://github.com/MPI-SWS/genmc) also finds the issue,
> but its intermediate memory model is slightly more general than the ARM
> model.
> 
> The release attached to store (4) prevents (1) to reordering with it.
> 
> Please, let me know if that makes sense or if I am missing something.
> 
> Thanks,
> -Diogo
> 
> -----Original Message-----
> From: Phil Yang [mailto:Phil.Yang@arm.com]
> Sent: Wednesday, August 26, 2020 12:17 PM
> To: Diogo Behrens <mailto:diogo.behrens@huawei.com>
> Cc: mailto:dev@dpdk.org; nd <mailto:nd@arm.com>; Honnappa Nagarahalli
> <mailto:Honnappa.Nagarahalli@arm.com>; nd <mailto:nd@arm.com>
> Subject: RE: [PATCH] librte_eal: fix mcslock hang on weak memory
> 
> Diogo Behrens <mailto:diogo.behrens@huawei.com> writes:
> 
> > Subject: [PATCH] librte_eal: fix mcslock hang on weak memory
> >
> >     The initialization me->locked=1 in lock() must happen before
> >     next->locked=0 in unlock(), otherwise a thread may hang forever,
> >     waiting me->locked become 0. On weak memory systems (sHi Puch as
> ARMv8),
> >     the current implementation allows me->locked=1 to be reordered with
> >     announcing the node (pred->next=me) and, consequently, to be
> >     reordered with next->locked=0 in unlock().
> >
> >     This fix adds a release barrier to pred->next=me, forcing
> >     me->locked=1 to happen before this operation.
> >
> > Signed-off-by: Diogo Behrens <mailto:diogo.behrens@huawei.com>
> > ---
> >  lib/librte_eal/include/generic/rte_mcslock.h | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_eal/include/generic/rte_mcslock.h
> > b/lib/librte_eal/include/generic/rte_mcslock.h
> > index 2bef28351..ce553f547 100644
> > --- a/lib/librte_eal/include/generic/rte_mcslock.h
> > +++ b/lib/librte_eal/include/generic/rte_mcslock.h
> > @@ -68,7 +68,14 @@ rte_mcslock_lock(rte_mcslock_t **msl,
> rte_mcslock_t
> > *me)
> >                           */
> >                          return;
> >          }
> > -       __atomic_store_n(&prev->next, me, __ATOMIC_RELAXED);
> > +      /* The store to me->next above should also complete before the
> > node is
> > +      * visible to predecessor thread releasing the lock. Hence, the store
> > +      * prev->next also requires release semantics. Note that, for
> > example,
> > +      * on ARM, the release semantics in the exchange operation is not
> > +      * strong as a release fence and is not sufficient to enforce the
> > +      * desired order here.
> 
> 
> Hi Diogo,
> 
> I didn't quite understand why the exchange operation with ACQ_REL
> memory ordering is not sufficient.
> It emits 'stlxr' on armv8.0-a and 'swpal' on armv8.1-a (with LSE support).
> Both of these two instructions contain a release semantics.
> 
> Please check the URL for the detail.
> https://gcc.godbolt.org/z/Mc4373
> 
> BTW, if you captured a deadlock issue on your testbed, could you please
> share your test case here?
> 
> Thanks,
> Phil
> 
> 
> > +      */
> > +      __atomic_store_n(&prev->next, me, __ATOMIC_RELEASE);
> >
> >          /* The while-load of me->locked should not move above the
> previous
> >           * store to prev->next. Otherwise it will cause a deadlock. Need a
> > --
> > 2.28.0
> >
Thomas Monjalon Oct. 6, 2020, 9:49 p.m. UTC | #5
31/08/2020 20:45, Honnappa Nagarahalli:
> 
> Hi Diogo,
> 
> Thanks for your explanation.
> 
> As documented in https://developer.arm.com/documentation/ddi0487/fc  B2.9.5 Load-Exclusive and Store-Exclusive instruction usage restrictions:
> " Between the Load-Exclusive and the Store-Exclusive, there are no explicit memory accesses, preloads, 
> direct or indirect System register writes, address translation instructions, cache or TLB maintenance
> instructions, exception generating instructions, exception returns, or indirect branches."
> [Honnappa] This is a requirement on the software, not on the micro-architecture.
> We are having few discussions internally, will get back soon.
> 
> So it is not allowed to insert (1) & (4) between (2, 3). The cmpxchg operation is atomic.


Please what is the conclusion?
Diogo Behrens Oct. 7, 2020, 9:55 a.m. UTC | #6
Hi Thomas,

we are still waiting for the comments from Honnappa. In our understanding, the missing barrier is a bug according to the model. We reproduced the scenario in herd7, which represents the authoritative memory model: https://developer.arm.com/architectures/cpu-architecture/a-profile/memory-model-tool

Here is a litmus code that shows that the XCHG (when compiled to LDAXR and STLR) is not atomic wrt memory updates to other locations:
-----
AArch64 XCHG-nonatomic
{
0:X1=locked; 0:X3=next;
1:X1=locked; 1:X3=next; 1:X5=tail;
}
 P0		| P1;
 LDR W0, [X3]	| MOV W0, #1;
 CBZ W0, end	| STR W0, [X1]; (* init locked *) 
 MOV W2, #2	| MOV W2, #0;
 STR W2, [X1]	| xchg:;
 end:		| LDAXR W6, [X5];
 NOP		| STLXR W4, W0, [X5];
 NOP		| CBNZ W4, xchg;
 NOP		| STR W0, [X3]; (* set next *) 
exists
(0:X2=2 /\ locked=1)
-----
(web version of herd7: http://diy.inria.fr/www/?record=aarch64)

P1 is trying to acquire the lock:
- initializes locked
- does the xchg on the tail of the mcslock
- sets the next

P0 is releasing the lock:
- if next is not set, just terminates
- if next is set, stores 2 in locked

The initialization of locked should never overwrite the store 2 to locked, but it does.
To avoid that reordering to happen, one should make the last store of P1 to have a "release" barrier, ie, STLR.

This is equivalent to the reordering occurring in the mcslock of librte_eal.

Best regards,
-Diogo

-----Original Message-----
From: Thomas Monjalon [mailto:thomas@monjalon.net] 
Sent: Tuesday, October 6, 2020 11:50 PM
To: Phil Yang <Phil.Yang@arm.com>; Diogo Behrens <diogo.behrens@huawei.com>; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Cc: dev@dpdk.org; nd <nd@arm.com>
Subject: Re: [dpdk-dev] [PATCH] librte_eal: fix mcslock hang on weak memory

31/08/2020 20:45, Honnappa Nagarahalli:
> 
> Hi Diogo,
> 
> Thanks for your explanation.
> 
> As documented in https://developer.arm.com/documentation/ddi0487/fc  B2.9.5 Load-Exclusive and Store-Exclusive instruction usage restrictions:
> " Between the Load-Exclusive and the Store-Exclusive, there are no 
> explicit memory accesses, preloads, direct or indirect System register 
> writes, address translation instructions, cache or TLB maintenance instructions, exception generating instructions, exception returns, or indirect branches."
> [Honnappa] This is a requirement on the software, not on the micro-architecture.
> We are having few discussions internally, will get back soon.
> 
> So it is not allowed to insert (1) & (4) between (2, 3). The cmpxchg operation is atomic.


Please what is the conclusion?
Thomas Monjalon Oct. 20, 2020, 11:56 a.m. UTC | #7
Honnappa?

07/10/2020 11:55, Diogo Behrens:
> Hi Thomas,
> 
> we are still waiting for the comments from Honnappa. In our understanding, the missing barrier is a bug according to the model. We reproduced the scenario in herd7, which represents the authoritative memory model: https://developer.arm.com/architectures/cpu-architecture/a-profile/memory-model-tool
> 
> Here is a litmus code that shows that the XCHG (when compiled to LDAXR and STLR) is not atomic wrt memory updates to other locations:
> -----
> AArch64 XCHG-nonatomic
> {
> 0:X1=locked; 0:X3=next;
> 1:X1=locked; 1:X3=next; 1:X5=tail;
> }
>  P0		| P1;
>  LDR W0, [X3]	| MOV W0, #1;
>  CBZ W0, end	| STR W0, [X1]; (* init locked *) 
>  MOV W2, #2	| MOV W2, #0;
>  STR W2, [X1]	| xchg:;
>  end:		| LDAXR W6, [X5];
>  NOP		| STLXR W4, W0, [X5];
>  NOP		| CBNZ W4, xchg;
>  NOP		| STR W0, [X3]; (* set next *) 
> exists
> (0:X2=2 /\ locked=1)
> -----
> (web version of herd7: http://diy.inria.fr/www/?record=aarch64)
> 
> P1 is trying to acquire the lock:
> - initializes locked
> - does the xchg on the tail of the mcslock
> - sets the next
> 
> P0 is releasing the lock:
> - if next is not set, just terminates
> - if next is set, stores 2 in locked
> 
> The initialization of locked should never overwrite the store 2 to locked, but it does.
> To avoid that reordering to happen, one should make the last store of P1 to have a "release" barrier, ie, STLR.
> 
> This is equivalent to the reordering occurring in the mcslock of librte_eal.
> 
> Best regards,
> -Diogo
> 
> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net] 
> Sent: Tuesday, October 6, 2020 11:50 PM
> To: Phil Yang <Phil.Yang@arm.com>; Diogo Behrens <diogo.behrens@huawei.com>; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH] librte_eal: fix mcslock hang on weak memory
> 
> 31/08/2020 20:45, Honnappa Nagarahalli:
> > 
> > Hi Diogo,
> > 
> > Thanks for your explanation.
> > 
> > As documented in https://developer.arm.com/documentation/ddi0487/fc  B2.9.5 Load-Exclusive and Store-Exclusive instruction usage restrictions:
> > " Between the Load-Exclusive and the Store-Exclusive, there are no 
> > explicit memory accesses, preloads, direct or indirect System register 
> > writes, address translation instructions, cache or TLB maintenance instructions, exception generating instructions, exception returns, or indirect branches."
> > [Honnappa] This is a requirement on the software, not on the micro-architecture.
> > We are having few discussions internally, will get back soon.
> > 
> > So it is not allowed to insert (1) & (4) between (2, 3). The cmpxchg operation is atomic.
> 
> 
> Please what is the conclusion?
Honnappa Nagarahalli Oct. 20, 2020, 9:49 p.m. UTC | #8
<snip>

> 
> Honnappa?
> 
> 07/10/2020 11:55, Diogo Behrens:
> > Hi Thomas,
> >
> > we are still waiting for the comments from Honnappa. In our
> > understanding, the missing barrier is a bug according to the model. We
> > reproduced the scenario in herd7, which represents the authoritative
> > memory model:
> > https://developer.arm.com/architectures/cpu-architecture/a-profile/mem
> > ory-model-tool
> >
> > Here is a litmus code that shows that the XCHG (when compiled to LDAXR
> and STLR) is not atomic wrt memory updates to other locations:
> > -----
> > AArch64 XCHG-nonatomic
> > {
> > 0:X1=locked; 0:X3=next;
> > 1:X1=locked; 1:X3=next; 1:X5=tail;
> > }
> >  P0		| P1;
> >  LDR W0, [X3]	| MOV W0, #1;
> >  CBZ W0, end	| STR W0, [X1]; (* init locked *)
> >  MOV W2, #2	| MOV W2, #0;
> >  STR W2, [X1]	| xchg:;
> >  end:		| LDAXR W6, [X5];
> >  NOP		| STLXR W4, W0, [X5];
> >  NOP		| CBNZ W4, xchg;
> >  NOP		| STR W0, [X3]; (* set next *)
> > exists
> > (0:X2=2 /\ locked=1)
> > -----
> > (web version of herd7: http://diy.inria.fr/www/?record=aarch64)
> >
> > P1 is trying to acquire the lock:
> > - initializes locked
> > - does the xchg on the tail of the mcslock
> > - sets the next
> >
> > P0 is releasing the lock:
> > - if next is not set, just terminates
> > - if next is set, stores 2 in locked
> >
> > The initialization of locked should never overwrite the store 2 to locked, but
> it does.
> > To avoid that reordering to happen, one should make the last store of P1 to
> have a "release" barrier, ie, STLR.
> >
> > This is equivalent to the reordering occurring in the mcslock of librte_eal.
> >
> > Best regards,
> > -Diogo
> >
> > -----Original Message-----
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Tuesday, October 6, 2020 11:50 PM
> > To: Phil Yang <Phil.Yang@arm.com>; Diogo Behrens
> > <diogo.behrens@huawei.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>
> > Cc: dev@dpdk.org; nd <nd@arm.com>
> > Subject: Re: [dpdk-dev] [PATCH] librte_eal: fix mcslock hang on weak
> > memory
> >
> > 31/08/2020 20:45, Honnappa Nagarahalli:
> > >
> > > Hi Diogo,
> > >
> > > Thanks for your explanation.
> > >
> > > As documented in
> https://developer.arm.com/documentation/ddi0487/fc  B2.9.5 Load-
> Exclusive and Store-Exclusive instruction usage restrictions:
> > > " Between the Load-Exclusive and the Store-Exclusive, there are no
> > > explicit memory accesses, preloads, direct or indirect System
> > > register writes, address translation instructions, cache or TLB
> maintenance instructions, exception generating instructions, exception
> returns, or indirect branches."
> > > [Honnappa] This is a requirement on the software, not on the micro-
> architecture.
> > > We are having few discussions internally, will get back soon.
> > >
> > > So it is not allowed to insert (1) & (4) between (2, 3). The cmpxchg
> operation is atomic.
> >
> >
> > Please what is the conclusion?
Apologies for not updating on this sooner.

Unfortunately, memory ordering questions are hard topics. I have been discussing this internally with few experts and it is still ongoing, hope to conclude soon.

My focus has been to replace __atomic_exchange_n(msl, me, __ATOMIC_ACQ_REL) with __atomic_exchange_n(msl, me, __ATOMIC_SEQ_CST). However, the generated code is the same in the second case as well (for load-store exclusives), which I am not sure if it is correct.

I think we have 2 choices here:
1) Accept the patch - when my internal discussion concludes, I can make the change and backport according to the conclusion.
2) Wait till the discussion is over - it might take another couple of weeks

> 
>
Thomas Monjalon Nov. 22, 2020, 6:07 p.m. UTC | #9
20/10/2020 23:49, Honnappa Nagarahalli:
> <snip>
> 
> > 
> > Honnappa?
> > 
> > 07/10/2020 11:55, Diogo Behrens:
> > > Hi Thomas,
> > >
> > > we are still waiting for the comments from Honnappa. In our
> > > understanding, the missing barrier is a bug according to the model. We
> > > reproduced the scenario in herd7, which represents the authoritative
> > > memory model:
> > > https://developer.arm.com/architectures/cpu-architecture/a-profile/mem
> > > ory-model-tool
> > >
> > > Here is a litmus code that shows that the XCHG (when compiled to LDAXR
> > and STLR) is not atomic wrt memory updates to other locations:
> > > -----
> > > AArch64 XCHG-nonatomic
> > > {
> > > 0:X1=locked; 0:X3=next;
> > > 1:X1=locked; 1:X3=next; 1:X5=tail;
> > > }
> > >  P0		| P1;
> > >  LDR W0, [X3]	| MOV W0, #1;
> > >  CBZ W0, end	| STR W0, [X1]; (* init locked *)
> > >  MOV W2, #2	| MOV W2, #0;
> > >  STR W2, [X1]	| xchg:;
> > >  end:		| LDAXR W6, [X5];
> > >  NOP		| STLXR W4, W0, [X5];
> > >  NOP		| CBNZ W4, xchg;
> > >  NOP		| STR W0, [X3]; (* set next *)
> > > exists
> > > (0:X2=2 /\ locked=1)
> > > -----
> > > (web version of herd7: http://diy.inria.fr/www/?record=aarch64)
> > >
> > > P1 is trying to acquire the lock:
> > > - initializes locked
> > > - does the xchg on the tail of the mcslock
> > > - sets the next
> > >
> > > P0 is releasing the lock:
> > > - if next is not set, just terminates
> > > - if next is set, stores 2 in locked
> > >
> > > The initialization of locked should never overwrite the store 2 to locked, but
> > it does.
> > > To avoid that reordering to happen, one should make the last store of P1 to
> > have a "release" barrier, ie, STLR.
> > >
> > > This is equivalent to the reordering occurring in the mcslock of librte_eal.
> > >
> > > Best regards,
> > > -Diogo
> > >
> > > -----Original Message-----
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > Sent: Tuesday, October 6, 2020 11:50 PM
> > > To: Phil Yang <Phil.Yang@arm.com>; Diogo Behrens
> > > <diogo.behrens@huawei.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>
> > > Cc: dev@dpdk.org; nd <nd@arm.com>
> > > Subject: Re: [dpdk-dev] [PATCH] librte_eal: fix mcslock hang on weak
> > > memory
> > >
> > > 31/08/2020 20:45, Honnappa Nagarahalli:
> > > >
> > > > Hi Diogo,
> > > >
> > > > Thanks for your explanation.
> > > >
> > > > As documented in
> > https://developer.arm.com/documentation/ddi0487/fc  B2.9.5 Load-
> > Exclusive and Store-Exclusive instruction usage restrictions:
> > > > " Between the Load-Exclusive and the Store-Exclusive, there are no
> > > > explicit memory accesses, preloads, direct or indirect System
> > > > register writes, address translation instructions, cache or TLB
> > maintenance instructions, exception generating instructions, exception
> > returns, or indirect branches."
> > > > [Honnappa] This is a requirement on the software, not on the micro-
> > architecture.
> > > > We are having few discussions internally, will get back soon.
> > > >
> > > > So it is not allowed to insert (1) & (4) between (2, 3). The cmpxchg
> > operation is atomic.
> > >
> > >
> > > Please what is the conclusion?
> Apologies for not updating on this sooner.
> 
> Unfortunately, memory ordering questions are hard topics. I have been discussing this internally with few experts and it is still ongoing, hope to conclude soon.
> 
> My focus has been to replace __atomic_exchange_n(msl, me, __ATOMIC_ACQ_REL) with __atomic_exchange_n(msl, me, __ATOMIC_SEQ_CST). However, the generated code is the same in the second case as well (for load-store exclusives), which I am not sure if it is correct.
> 
> I think we have 2 choices here:
> 1) Accept the patch - when my internal discussion concludes, I can make the change and backport according to the conclusion.
> 2) Wait till the discussion is over - it might take another couple of weeks

One month passed since this last update.
We are keeping this issue in DPDK 20.11.0 I guess.
Honnappa Nagarahalli Nov. 23, 2020, 3:06 p.m. UTC | #10
<snip>

> > >
> > > 07/10/2020 11:55, Diogo Behrens:
> > > > Hi Thomas,
> > > >
> > > > we are still waiting for the comments from Honnappa. In our
> > > > understanding, the missing barrier is a bug according to the
> > > > model. We reproduced the scenario in herd7, which represents the
> > > > authoritative memory model:
> > > > https://developer.arm.com/architectures/cpu-architecture/a-profile
> > > > /mem
> > > > ory-model-tool
> > > >
> > > > Here is a litmus code that shows that the XCHG (when compiled to
> > > > LDAXR
> > > and STLR) is not atomic wrt memory updates to other locations:
> > > > -----
> > > > AArch64 XCHG-nonatomic
> > > > {
> > > > 0:X1=locked; 0:X3=next;
> > > > 1:X1=locked; 1:X3=next; 1:X5=tail; }
> > > >  P0		| P1;
> > > >  LDR W0, [X3]	| MOV W0, #1;
> > > >  CBZ W0, end	| STR W0, [X1]; (* init locked *)
> > > >  MOV W2, #2	| MOV W2, #0;
> > > >  STR W2, [X1]	| xchg:;
> > > >  end:		| LDAXR W6, [X5];
> > > >  NOP		| STLXR W4, W0, [X5];
> > > >  NOP		| CBNZ W4, xchg;
> > > >  NOP		| STR W0, [X3]; (* set next *)
> > > > exists
> > > > (0:X2=2 /\ locked=1)
> > > > -----
> > > > (web version of herd7: http://diy.inria.fr/www/?record=aarch64)
> > > >
> > > > P1 is trying to acquire the lock:
> > > > - initializes locked
> > > > - does the xchg on the tail of the mcslock
> > > > - sets the next
> > > >
> > > > P0 is releasing the lock:
> > > > - if next is not set, just terminates
> > > > - if next is set, stores 2 in locked
> > > >
> > > > The initialization of locked should never overwrite the store 2 to
> > > > locked, but
> > > it does.
> > > > To avoid that reordering to happen, one should make the last store
> > > > of P1 to
> > > have a "release" barrier, ie, STLR.
> > > >
> > > > This is equivalent to the reordering occurring in the mcslock of librte_eal.
> > > >
> > > > Best regards,
> > > > -Diogo
> > > >
> > > > -----Original Message-----
> > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > Sent: Tuesday, October 6, 2020 11:50 PM
> > > > To: Phil Yang <Phil.Yang@arm.com>; Diogo Behrens
> > > > <diogo.behrens@huawei.com>; Honnappa Nagarahalli
> > > > <Honnappa.Nagarahalli@arm.com>
> > > > Cc: dev@dpdk.org; nd <nd@arm.com>
> > > > Subject: Re: [dpdk-dev] [PATCH] librte_eal: fix mcslock hang on
> > > > weak memory
> > > >
> > > > 31/08/2020 20:45, Honnappa Nagarahalli:
> > > > >
> > > > > Hi Diogo,
> > > > >
> > > > > Thanks for your explanation.
> > > > >
> > > > > As documented in
> > > https://developer.arm.com/documentation/ddi0487/fc  B2.9.5 Load-
> > > Exclusive and Store-Exclusive instruction usage restrictions:
> > > > > " Between the Load-Exclusive and the Store-Exclusive, there are
> > > > > no explicit memory accesses, preloads, direct or indirect System
> > > > > register writes, address translation instructions, cache or TLB
> > > maintenance instructions, exception generating instructions,
> > > exception returns, or indirect branches."
> > > > > [Honnappa] This is a requirement on the software, not on the
> > > > > micro-
> > > architecture.
> > > > > We are having few discussions internally, will get back soon.
> > > > >
> > > > > So it is not allowed to insert (1) & (4) between (2, 3). The
> > > > > cmpxchg
> > > operation is atomic.
> > > >
> > > >
> > > > Please what is the conclusion?
> > Apologies for not updating on this sooner.
> >
> > Unfortunately, memory ordering questions are hard topics. I have been
> discussing this internally with few experts and it is still ongoing, hope to
> conclude soon.
> >
> > My focus has been to replace __atomic_exchange_n(msl, me,
> __ATOMIC_ACQ_REL) with __atomic_exchange_n(msl, me,
> __ATOMIC_SEQ_CST). However, the generated code is the same in the second
> case as well (for load-store exclusives), which I am not sure if it is correct.
> >
> > I think we have 2 choices here:
> > 1) Accept the patch - when my internal discussion concludes, I can make the
> change and backport according to the conclusion.
> > 2) Wait till the discussion is over - it might take another couple of
> > weeks
> 
> One month passed since this last update.
> We are keeping this issue in DPDK 20.11.0 I guess.
> 
I can accept this patch and move forward for 20.11. It is a stronger barrier and I do not see any issues from the code perspective. I will run tests on few platforms and provide my ACK.

It is work in progress with few changes for me to make sure we have an optimal solution for all platforms. Those changes can go into 21.02.
Stephen Hemminger Nov. 23, 2020, 3:44 p.m. UTC | #11
On Mon, 23 Nov 2020 15:06:06 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> <snip>
> 
> > > >
> > > > 07/10/2020 11:55, Diogo Behrens:  
> > > > > Hi Thomas,
> > > > >
> > > > > we are still waiting for the comments from Honnappa. In our
> > > > > understanding, the missing barrier is a bug according to the
> > > > > model. We reproduced the scenario in herd7, which represents
> > > > > the authoritative memory model:
> > > > > https://developer.arm.com/architectures/cpu-architecture/a-profile
> > > > > /mem
> > > > > ory-model-tool
> > > > >
> > > > > Here is a litmus code that shows that the XCHG (when compiled
> > > > > to LDAXR  
> > > > and STLR) is not atomic wrt memory updates to other locations:  
> > > > > -----
> > > > > AArch64 XCHG-nonatomic
> > > > > {
> > > > > 0:X1=locked; 0:X3=next;
> > > > > 1:X1=locked; 1:X3=next; 1:X5=tail; }
> > > > >  P0		| P1;
> > > > >  LDR W0, [X3]	| MOV W0, #1;
> > > > >  CBZ W0, end	| STR W0, [X1]; (* init locked *)
> > > > >  MOV W2, #2	| MOV W2, #0;
> > > > >  STR W2, [X1]	| xchg:;
> > > > >  end:		| LDAXR W6, [X5];
> > > > >  NOP		| STLXR W4, W0, [X5];
> > > > >  NOP		| CBNZ W4, xchg;
> > > > >  NOP		| STR W0, [X3]; (* set next *)
> > > > > exists
> > > > > (0:X2=2 /\ locked=1)
> > > > > -----
> > > > > (web version of herd7:
> > > > > http://diy.inria.fr/www/?record=aarch64)
> > > > >
> > > > > P1 is trying to acquire the lock:
> > > > > - initializes locked
> > > > > - does the xchg on the tail of the mcslock
> > > > > - sets the next
> > > > >
> > > > > P0 is releasing the lock:
> > > > > - if next is not set, just terminates
> > > > > - if next is set, stores 2 in locked
> > > > >
> > > > > The initialization of locked should never overwrite the store
> > > > > 2 to locked, but  
> > > > it does.  
> > > > > To avoid that reordering to happen, one should make the last
> > > > > store of P1 to  
> > > > have a "release" barrier, ie, STLR.  
> > > > >
> > > > > This is equivalent to the reordering occurring in the mcslock
> > > > > of librte_eal.
> > > > >
> > > > > Best regards,
> > > > > -Diogo
> > > > >
> > > > > -----Original Message-----
> > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > Sent: Tuesday, October 6, 2020 11:50 PM
> > > > > To: Phil Yang <Phil.Yang@arm.com>; Diogo Behrens
> > > > > <diogo.behrens@huawei.com>; Honnappa Nagarahalli
> > > > > <Honnappa.Nagarahalli@arm.com>
> > > > > Cc: dev@dpdk.org; nd <nd@arm.com>
> > > > > Subject: Re: [dpdk-dev] [PATCH] librte_eal: fix mcslock hang
> > > > > on weak memory
> > > > >
> > > > > 31/08/2020 20:45, Honnappa Nagarahalli:  
> > > > > >
> > > > > > Hi Diogo,
> > > > > >
> > > > > > Thanks for your explanation.
> > > > > >
> > > > > > As documented in  
> > > > https://developer.arm.com/documentation/ddi0487/fc  B2.9.5 Load-
> > > > Exclusive and Store-Exclusive instruction usage restrictions:  
> > > > > > " Between the Load-Exclusive and the Store-Exclusive, there
> > > > > > are no explicit memory accesses, preloads, direct or
> > > > > > indirect System register writes, address translation
> > > > > > instructions, cache or TLB  
> > > > maintenance instructions, exception generating instructions,
> > > > exception returns, or indirect branches."  
> > > > > > [Honnappa] This is a requirement on the software, not on the
> > > > > > micro-  
> > > > architecture.  
> > > > > > We are having few discussions internally, will get back
> > > > > > soon.
> > > > > >
> > > > > > So it is not allowed to insert (1) & (4) between (2, 3). The
> > > > > > cmpxchg  
> > > > operation is atomic.  
> > > > >
> > > > >
> > > > > Please what is the conclusion?  
> > > Apologies for not updating on this sooner.
> > >
> > > Unfortunately, memory ordering questions are hard topics. I have
> > > been  
> > discussing this internally with few experts and it is still
> > ongoing, hope to conclude soon.  
> > >
> > > My focus has been to replace __atomic_exchange_n(msl, me,  
> > __ATOMIC_ACQ_REL) with __atomic_exchange_n(msl, me,
> > __ATOMIC_SEQ_CST). However, the generated code is the same in the
> > second case as well (for load-store exclusives), which I am not
> > sure if it is correct.  
> > >
> > > I think we have 2 choices here:
> > > 1) Accept the patch - when my internal discussion concludes, I
> > > can make the  
> > change and backport according to the conclusion.  
> > > 2) Wait till the discussion is over - it might take another
> > > couple of weeks  
> > 
> > One month passed since this last update.
> > We are keeping this issue in DPDK 20.11.0 I guess.
> >   
> I can accept this patch and move forward for 20.11. It is a stronger
> barrier and I do not see any issues from the code perspective. I will
> run tests on few platforms and provide my ACK.
> 
> It is work in progress with few changes for me to make sure we have
> an optimal solution for all platforms. Those changes can go into
> 21.02.

Has anyone investigated later developments in concurrency?
While researching MCS Lock discovered this quote:
https://mfukar.github.io/2017/09/26/mcs.html
	Luckily, we don’t have to worry about this very much. MCS locks
	right now are mostly a teaching tool, and have mostly been superseded by:

	CLH locks: Craig, Landin, and Hagersten locks replace the explicit
		queue for a logical queue 
	K42 locks: On-stack information is used instead of keeping a thread-local
		queue node around.
	A similar idea is used by the stack-lock algorithm.

Note: K42 locks are patented by IBM.
Honnappa Nagarahalli Nov. 23, 2020, 6:16 p.m. UTC | #12
<snip>

> 
> Has anyone investigated later developments in concurrency?
> While researching MCS Lock discovered this quote:
> https://mfukar.github.io/2017/09/26/mcs.html
> 	Luckily, we don’t have to worry about this very much. MCS locks
> 	right now are mostly a teaching tool, and have mostly been superseded
> by:
> 
> 	CLH locks: Craig, Landin, and Hagersten locks replace the explicit
> 		queue for a logical queue
> 	K42 locks: On-stack information is used instead of keeping a thread-
> local
> 		queue node around.
> 	A similar idea is used by the stack-lock algorithm.
I have not looked at these. I have looked briefly at a NUMA aware locks.

> 
> Note: K42 locks are patented by IBM.
Honnappa Nagarahalli Nov. 23, 2020, 6:29 p.m. UTC | #13
<snip>

> 
>     The initialization me->locked=1 in lock() must happen before
>     next->locked=0 in unlock(), otherwise a thread may hang forever,
>     waiting me->locked become 0. On weak memory systems (such as ARMv8),
>     the current implementation allows me->locked=1 to be reordered with
>     announcing the node (pred->next=me) and, consequently, to be
>     reordered with next->locked=0 in unlock().
> 
>     This fix adds a release barrier to pred->next=me, forcing
>     me->locked=1 to happen before this operation.
> 
> Signed-off-by: Diogo Behrens <diogo.behrens@huawei.com>
The change looks fine to me.  I have tested this on few x86 and Arm machines.
Acked-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> ---
>  lib/librte_eal/include/generic/rte_mcslock.h | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/librte_eal/include/generic/rte_mcslock.h
> b/lib/librte_eal/include/generic/rte_mcslock.h
> index 2bef28351..ce553f547 100644
> --- a/lib/librte_eal/include/generic/rte_mcslock.h
> +++ b/lib/librte_eal/include/generic/rte_mcslock.h
> @@ -68,7 +68,14 @@ rte_mcslock_lock(rte_mcslock_t **msl, rte_mcslock_t
> *me)
>  		 */
>  		return;
>  	}
> -	__atomic_store_n(&prev->next, me, __ATOMIC_RELAXED);
> +	/* The store to me->next above should also complete before the node is
> +	 * visible to predecessor thread releasing the lock. Hence, the store
> +	 * prev->next also requires release semantics. Note that, for example,
> +	 * on ARM, the release semantics in the exchange operation is not
> +	 * strong as a release fence and is not sufficient to enforce the
> +	 * desired order here.
> +	 */
> +	__atomic_store_n(&prev->next, me, __ATOMIC_RELEASE);
> 
>  	/* The while-load of me->locked should not move above the previous
>  	 * store to prev->next. Otherwise it will cause a deadlock. Need a
> --
> 2.28.0
>
Stephen Hemminger Nov. 23, 2020, 7:36 p.m. UTC | #14
On Mon, 23 Nov 2020 18:29:32 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> <snip>
> 
> > 
> >     The initialization me->locked=1 in lock() must happen before
> >     next->locked=0 in unlock(), otherwise a thread may hang forever,
> >     waiting me->locked become 0. On weak memory systems (such as ARMv8),
> >     the current implementation allows me->locked=1 to be reordered with
> >     announcing the node (pred->next=me) and, consequently, to be
> >     reordered with next->locked=0 in unlock().
> > 
> >     This fix adds a release barrier to pred->next=me, forcing
> >     me->locked=1 to happen before this operation.
> > 
> > Signed-off-by: Diogo Behrens <diogo.behrens@huawei.com>  
> The change looks fine to me.  I have tested this on few x86 and Arm machines.
> Acked-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Maybe a simpler alternative would be as fast and safer.
By using compare_exchange you can get same effect in one operation.
Like the following UNTESTED.

diff --git a/lib/librte_eal/include/generic/rte_mcslock.h b/lib/librte_eal/include/generic/rte_mcslock.h
index 78b0df295e2d..9c537ce577e6 100644
--- a/lib/librte_eal/include/generic/rte_mcslock.h
+++ b/lib/librte_eal/include/generic/rte_mcslock.h
@@ -48,23 +48,23 @@ rte_mcslock_lock(rte_mcslock_t **msl, rte_mcslock_t *me)
 	rte_mcslock_t *prev;
 
 	/* Init me node */
-	__atomic_store_n(&me->locked, 1, __ATOMIC_RELAXED);
-	__atomic_store_n(&me->next, NULL, __ATOMIC_RELAXED);
+	me->locked = 1;
 
-	/* If the queue is empty, the exchange operation is enough to acquire
-	 * the lock. Hence, the exchange operation requires acquire semantics.
-	 * The store to me->next above should complete before the node is
-	 * visible to other CPUs/threads. Hence, the exchange operation requires
-	 * release semantics as well.
+	/*
+	 * Atomic insert into single linked list
 	 */
-	prev = __atomic_exchange_n(msl, me, __ATOMIC_ACQ_REL);
+	do {
+		prev = __atomic_load_n(msl, __ATOMIC_RELAXED);
+		me->next = prev;
+	} while (!__atomic_compare_exchange_n(&msl, me, prev,
+					    __ATOMIC_ACQUIRE, __ATOMIC_RELAXED));
+
 	if (likely(prev == NULL)) {
 		/* Queue was empty, no further action required,
 		 * proceed with lock taken.
 		 */
 		return;
 	}
-	__atomic_store_n(&prev->next, me, __ATOMIC_RELAXED);
 
 	/* The while-load of me->locked should not move above the previous
 	 * store to prev->next. Otherwise it will cause a deadlock. Need a
Honnappa Nagarahalli Nov. 25, 2020, 4:50 a.m. UTC | #15
<snip>

> >
> > >
> > >     The initialization me->locked=1 in lock() must happen before
> > >     next->locked=0 in unlock(), otherwise a thread may hang forever,
> > >     waiting me->locked become 0. On weak memory systems (such as
> ARMv8),
> > >     the current implementation allows me->locked=1 to be reordered with
> > >     announcing the node (pred->next=me) and, consequently, to be
> > >     reordered with next->locked=0 in unlock().
> > >
> > >     This fix adds a release barrier to pred->next=me, forcing
> > >     me->locked=1 to happen before this operation.
> > >
> > > Signed-off-by: Diogo Behrens <diogo.behrens@huawei.com>
> > The change looks fine to me.  I have tested this on few x86 and Arm machines.
> > Acked-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> Maybe a simpler alternative would be as fast and safer.
Why is this safer?

> By using compare_exchange you can get same effect in one operation.
> Like the following UNTESTED.
> 
> diff --git a/lib/librte_eal/include/generic/rte_mcslock.h
> b/lib/librte_eal/include/generic/rte_mcslock.h
> index 78b0df295e2d..9c537ce577e6 100644
> --- a/lib/librte_eal/include/generic/rte_mcslock.h
> +++ b/lib/librte_eal/include/generic/rte_mcslock.h
> @@ -48,23 +48,23 @@ rte_mcslock_lock(rte_mcslock_t **msl, rte_mcslock_t
> *me)
>  	rte_mcslock_t *prev;
> 
>  	/* Init me node */
> -	__atomic_store_n(&me->locked, 1, __ATOMIC_RELAXED);
> -	__atomic_store_n(&me->next, NULL, __ATOMIC_RELAXED);
> +	me->locked = 1;
> 
> -	/* If the queue is empty, the exchange operation is enough to acquire
> -	 * the lock. Hence, the exchange operation requires acquire semantics.
> -	 * The store to me->next above should complete before the node is
> -	 * visible to other CPUs/threads. Hence, the exchange operation
> requires
> -	 * release semantics as well.
> +	/*
> +	 * Atomic insert into single linked list
>  	 */
> -	prev = __atomic_exchange_n(msl, me, __ATOMIC_ACQ_REL);
> +	do {
> +		prev = __atomic_load_n(msl, __ATOMIC_RELAXED);
> +		me->next = prev;
This needs to be __atomic_store_n(__ATOMIC_RELEASE) as it can sink below the following line.

> +	} while (!__atomic_compare_exchange_n(&msl, me, prev,
> +					    __ATOMIC_ACQUIRE,
> __ATOMIC_RELAXED));
> +
>  	if (likely(prev == NULL)) {
>  		/* Queue was empty, no further action required,
>  		 * proceed with lock taken.
>  		 */
>  		return;
>  	}
> -	__atomic_store_n(&prev->next, me, __ATOMIC_RELAXED);
> 
>  	/* The while-load of me->locked should not move above the previous
>  	 * store to prev->next. Otherwise it will cause a deadlock. Need a
Diogo Behrens Nov. 25, 2020, 8:41 a.m. UTC | #16
Hi Stephen, 

the patch we submitted is safe, it has been verified that it indeed fixes the bug:
Besides rmem and genmc model checkers, we verified the bugfix with herd7 tool,
which is the official tool ARM provides to check such pieces of code against their
memory model.

Actually, I think there are other barriers in this MCS lock implementation that
could be weakened without causing problems, but that would be an optimization.
This patch is just to fix the bug.

Best regards,
-Diogo

-----Original Message-----
From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] 
Sent: Wednesday, November 25, 2020 5:51 AM
To: Stephen Hemminger <stephen@networkplumber.org>
Cc: Diogo Behrens <diogo.behrens@huawei.com>; thomas@monjalon.net; david.marchand@redhat.com; dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
Subject: RE: [dpdk-dev] [PATCH] librte_eal: fix mcslock hang on weak memory

<snip>

> >
> > >
> > >     The initialization me->locked=1 in lock() must happen before
> > >     next->locked=0 in unlock(), otherwise a thread may hang forever,
> > >     waiting me->locked become 0. On weak memory systems (such as
> ARMv8),
> > >     the current implementation allows me->locked=1 to be reordered with
> > >     announcing the node (pred->next=me) and, consequently, to be
> > >     reordered with next->locked=0 in unlock().
> > >
> > >     This fix adds a release barrier to pred->next=me, forcing
> > >     me->locked=1 to happen before this operation.
> > >
> > > Signed-off-by: Diogo Behrens <diogo.behrens@huawei.com>
> > The change looks fine to me.  I have tested this on few x86 and Arm machines.
> > Acked-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> Maybe a simpler alternative would be as fast and safer.
Why is this safer?

> By using compare_exchange you can get same effect in one operation.
> Like the following UNTESTED.
> 
> diff --git a/lib/librte_eal/include/generic/rte_mcslock.h
> b/lib/librte_eal/include/generic/rte_mcslock.h
> index 78b0df295e2d..9c537ce577e6 100644
> --- a/lib/librte_eal/include/generic/rte_mcslock.h
> +++ b/lib/librte_eal/include/generic/rte_mcslock.h
> @@ -48,23 +48,23 @@ rte_mcslock_lock(rte_mcslock_t **msl, rte_mcslock_t
> *me)
>  	rte_mcslock_t *prev;
> 
>  	/* Init me node */
> -	__atomic_store_n(&me->locked, 1, __ATOMIC_RELAXED);
> -	__atomic_store_n(&me->next, NULL, __ATOMIC_RELAXED);
> +	me->locked = 1;
> 
> -	/* If the queue is empty, the exchange operation is enough to acquire
> -	 * the lock. Hence, the exchange operation requires acquire semantics.
> -	 * The store to me->next above should complete before the node is
> -	 * visible to other CPUs/threads. Hence, the exchange operation
> requires
> -	 * release semantics as well.
> +	/*
> +	 * Atomic insert into single linked list
>  	 */
> -	prev = __atomic_exchange_n(msl, me, __ATOMIC_ACQ_REL);
> +	do {
> +		prev = __atomic_load_n(msl, __ATOMIC_RELAXED);
> +		me->next = prev;
This needs to be __atomic_store_n(__ATOMIC_RELEASE) as it can sink below the following line.

> +	} while (!__atomic_compare_exchange_n(&msl, me, prev,
> +					    __ATOMIC_ACQUIRE,
> __ATOMIC_RELAXED));
> +
>  	if (likely(prev == NULL)) {
>  		/* Queue was empty, no further action required,
>  		 * proceed with lock taken.
>  		 */
>  		return;
>  	}
> -	__atomic_store_n(&prev->next, me, __ATOMIC_RELAXED);
> 
>  	/* The while-load of me->locked should not move above the previous
>  	 * store to prev->next. Otherwise it will cause a deadlock. Need a
Thomas Monjalon Nov. 25, 2020, 2:16 p.m. UTC | #17
> >     The initialization me->locked=1 in lock() must happen before
> >     next->locked=0 in unlock(), otherwise a thread may hang forever,
> >     waiting me->locked become 0. On weak memory systems (such as ARMv8),
> >     the current implementation allows me->locked=1 to be reordered with
> >     announcing the node (pred->next=me) and, consequently, to be
> >     reordered with next->locked=0 in unlock().
> > 
> >     This fix adds a release barrier to pred->next=me, forcing
> >     me->locked=1 to happen before this operation.
> > 
> > Signed-off-by: Diogo Behrens <diogo.behrens@huawei.com>
> The change looks fine to me.  I have tested this on few x86 and Arm machines.
> Acked-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Applied, thanks

Patch
diff mbox series

diff --git a/lib/librte_eal/include/generic/rte_mcslock.h b/lib/librte_eal/include/generic/rte_mcslock.h
index 2bef28351..ce553f547 100644
--- a/lib/librte_eal/include/generic/rte_mcslock.h
+++ b/lib/librte_eal/include/generic/rte_mcslock.h
@@ -68,7 +68,14 @@  rte_mcslock_lock(rte_mcslock_t **msl, rte_mcslock_t *me)
 		 */
 		return;
 	}
-	__atomic_store_n(&prev->next, me, __ATOMIC_RELAXED);
+	/* The store to me->next above should also complete before the node is
+	 * visible to predecessor thread releasing the lock. Hence, the store
+	 * prev->next also requires release semantics. Note that, for example,
+	 * on ARM, the release semantics in the exchange operation is not
+	 * strong as a release fence and is not sufficient to enforce the
+	 * desired order here.
+	 */
+	__atomic_store_n(&prev->next, me, __ATOMIC_RELEASE);
 
 	/* The while-load of me->locked should not move above the previous
 	 * store to prev->next. Otherwise it will cause a deadlock. Need a