[v1,3/4] ring: fix potential sync issue between head and tail values
Checks
Commit Message
This patch aims several purposes:
- provide an alternative (and I think a better) way to fix the
issue discussed in previous patch:
"ring/soring: fix synchronization issue between head and tail values"
- make sure that such problem wouldn’t happen within other usages of
__rte_ring_headtail_move_head() – both current rte_ring
implementation and possible future use-cases.
- step towards unification of move_head() implementations and
removing rte_ring_generic_pvt.h
It uses Acquire-Release memory ordering for CAS operation in move_head().
That guarantees that corresponding ‘tail’ updates will be visible
before current ‘head’ is updated.
As I said before: I think that in theory the problem described in
previous patch might happen with our conventional rte_ring too
(when RTE_USE_C11_MEM_MODEL enabled).
But, so far I didn’t manage to reproduce it in reality.
For that reason and also because it touches a critical rte_ring code-path,
I put these changes into a separate patch. Expect all interested
stakeholders to come-up with their comments and observations.
Regarding performance impact – on my boxes both ring_perf_autotest and
ring_stress_autotest – show a mixed set of results: some of them become
few cycles faster, another few cycles slower.
But so far, I didn’t notice any real degradations with that patch.
Fixes: b5458e2cc483 ("ring: introduce staged ordered ring")
Fixes: 1cc363b8ce06 ("ring: introduce HTS ring mode")
Fixes: e6ba4731c0f3 ("ring: introduce RTS ring mode")
Fixes: 49594a63147a ("ring/c11: relax ordering for load and store of the head")
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
---
lib/ring/rte_ring_c11_pvt.h | 27 +++++++++++++++++----------
lib/ring/rte_ring_hts_elem_pvt.h | 6 ++++--
lib/ring/rte_ring_rts_elem_pvt.h | 6 ++++--
lib/ring/soring.c | 5 -----
4 files changed, 25 insertions(+), 19 deletions(-)
Comments
> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Wednesday, 21 May 2025 13.15
>
> This patch aims several purposes:
> - provide an alternative (and I think a better) way to fix the
> issue discussed in previous patch:
> "ring/soring: fix synchronization issue between head and tail values"
> - make sure that such problem wouldn’t happen within other usages of
> __rte_ring_headtail_move_head() – both current rte_ring
> implementation and possible future use-cases.
> - step towards unification of move_head() implementations and
> removing rte_ring_generic_pvt.h
> It uses Acquire-Release memory ordering for CAS operation in
> move_head().
> That guarantees that corresponding ‘tail’ updates will be visible
> before current ‘head’ is updated.
> As I said before: I think that in theory the problem described in
> previous patch might happen with our conventional rte_ring too
> (when RTE_USE_C11_MEM_MODEL enabled).
> But, so far I didn’t manage to reproduce it in reality.
Overall, I think the code becomes more elegant and much easier to understand with this patch, where the atomic operations are performed explicitly on the head/tail, eliminating the need for the broad-reaching rte_atomic_thread_fence().
The detailed inline code comments are also a good improvement.
> For that reason and also because it touches a critical rte_ring code-
> path,
> I put these changes into a separate patch. Expect all interested
> stakeholders to come-up with their comments and observations.
> Regarding performance impact – on my boxes both ring_perf_autotest and
> ring_stress_autotest – show a mixed set of results: some of them become
> few cycles faster, another few cycles slower.
> But so far, I didn’t notice any real degradations with that patch.
Maybe it was the broad-reaching rte_atomic_thread_fence() that made the C11 variant slow on other architectures.
This makes me curious about performance results on other architectures with this patch.
>
> Fixes: b5458e2cc483 ("ring: introduce staged ordered ring")
> Fixes: 1cc363b8ce06 ("ring: introduce HTS ring mode")
> Fixes: e6ba4731c0f3 ("ring: introduce RTS ring mode")
> Fixes: 49594a63147a ("ring/c11: relax ordering for load and store of
> the head")
>
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> ---
> lib/ring/rte_ring_c11_pvt.h | 27 +++++++++++++++++----------
> lib/ring/rte_ring_hts_elem_pvt.h | 6 ++++--
> lib/ring/rte_ring_rts_elem_pvt.h | 6 ++++--
> lib/ring/soring.c | 5 -----
> 4 files changed, 25 insertions(+), 19 deletions(-)
>
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> index 0845cd6dcf..6d1c46df9a 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -77,20 +77,19 @@ __rte_ring_headtail_move_head(struct
> rte_ring_headtail *d,
> int success;
> unsigned int max = n;
>
> + /* Ensure the head is read before tail */
Maybe "d->head" and "s->tail" instead of "head" and "tail".
> *old_head = rte_atomic_load_explicit(&d->head,
> - rte_memory_order_relaxed);
> + rte_memory_order_acquire);
> do {
> /* Reset n to the initial burst count */
> n = max;
>
> - /* Ensure the head is read before tail */
> - rte_atomic_thread_fence(rte_memory_order_acquire);
> -
> - /* load-acquire synchronize with store-release of ht->tail
> - * in update_tail.
> + /*
> + * Read s->tail value. Note that it will be loaded after
> + * d->head load, but before CAS operation for the d->head.
> */
> stail = rte_atomic_load_explicit(&s->tail,
> - rte_memory_order_acquire);
> + rte_memory_order_relaxed);
>
> /* The subtraction is done between two unsigned 32bits
> value
> * (the result is always modulo 32 bits even if we have
> @@ -112,11 +111,19 @@ __rte_ring_headtail_move_head(struct
> rte_ring_headtail *d,
> d->head = *new_head;
> success = 1;
> } else
> - /* on failure, *old_head is updated */
> + /*
> + * on failure, *old_head is updated.
> + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist
> + * barrier to prevent:
> + * - OOO reads of cons tail value
> + * - OOO copy of elems from the ring
It's not really the ACQ_REL that does this. It's the AQUIRE.
So this comment needs some adjustment.
Also maybe "s->tail" instead of "cons tail".
> + * Also RELEASE guarantees that latest tail value
Maybe "latest s->tail" instead of "latest tail".
> + * will become visible before the new head value.
Maybe "new d->head value" instead of "new head value".
> + */
> success =
> rte_atomic_compare_exchange_strong_explicit(
> &d->head, old_head, *new_head,
> - rte_memory_order_relaxed,
> - rte_memory_order_relaxed);
> + rte_memory_order_acq_rel,
> + rte_memory_order_acquire);
> } while (unlikely(success == 0));
> return n;
> }
I haven't reviewed the remaining changes in detail, but I think my feedback that the comments should mention ACQUIRE instead of ACQ_REL also apply to the other files.
> diff --git a/lib/ring/rte_ring_hts_elem_pvt.h
> b/lib/ring/rte_ring_hts_elem_pvt.h
> diff --git a/lib/ring/rte_ring_rts_elem_pvt.h
> b/lib/ring/rte_ring_rts_elem_pvt.h
> diff --git a/lib/ring/soring.c b/lib/ring/soring.c
Hi Konstanin,
In rte_ring the store-release on tail update guarantees that CAS
won't get reordered with the store-released of the tail update.
So, the sequence of events would look like this (combined view
of head and tail update)
Releaxed-load(new_head, N) ----------------> (A)
Relaxed-CAS(d->head, new_head, old_head) ----------------> (B)
Store-release-store(d->tail, new_head) ----------------> (C)
If we look at address dependencies, then...
(B) depends on (A) due to new_head address dependency.
(C) depends on (A) due to new_head address dependency.
So, dependency graph looks like this
(A)
/ \
v v
(B) (C)
There is no implicit dependence between (B) and (C), I think
this is the issue you are brining up.
Even though there is no dependence between the two,
the store-release of (C) ensures that (B) won't drop below it.
Therefore, the above graph can be turned into an ordered
sequence as shown below..
(A) -> (B) -> (C)
I haven't looked at the so-ring yet. Could it be possible that the
issue is due to something else introduced in that code?
Thanks,
--wathsala
> This patch aims several purposes:
> - provide an alternative (and I think a better) way to fix the
> issue discussed in previous patch:
> "ring/soring: fix synchronization issue between head and tail values"
> - make sure that such problem wouldn’t happen within other usages of
> __rte_ring_headtail_move_head() – both current rte_ring
> implementation and possible future use-cases.
> - step towards unification of move_head() implementations and
> removing rte_ring_generic_pvt.h
> It uses Acquire-Release memory ordering for CAS operation in move_head().
> That guarantees that corresponding ‘tail’ updates will be visible before current
> ‘head’ is updated.
> As I said before: I think that in theory the problem described in previous patch
> might happen with our conventional rte_ring too (when
> RTE_USE_C11_MEM_MODEL enabled).
> But, so far I didn’t manage to reproduce it in reality.
> For that reason and also because it touches a critical rte_ring code-path, I put
> these changes into a separate patch. Expect all interested stakeholders to come-
> up with their comments and observations.
> Regarding performance impact – on my boxes both ring_perf_autotest and
> ring_stress_autotest – show a mixed set of results: some of them become few
> cycles faster, another few cycles slower.
> But so far, I didn’t notice any real degradations with that patch.
>
> Fixes: b5458e2cc483 ("ring: introduce staged ordered ring")
> Fixes: 1cc363b8ce06 ("ring: introduce HTS ring mode")
> Fixes: e6ba4731c0f3 ("ring: introduce RTS ring mode")
> Fixes: 49594a63147a ("ring/c11: relax ordering for load and store of the head")
>
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> ---
> lib/ring/rte_ring_c11_pvt.h | 27 +++++++++++++++++----------
> lib/ring/rte_ring_hts_elem_pvt.h | 6 ++++-- lib/ring/rte_ring_rts_elem_pvt.h
> | 6 ++++--
> lib/ring/soring.c | 5 -----
> 4 files changed, 25 insertions(+), 19 deletions(-)
>
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index
> 0845cd6dcf..6d1c46df9a 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -77,20 +77,19 @@ __rte_ring_headtail_move_head(struct
> rte_ring_headtail *d,
> int success;
> unsigned int max = n;
>
> + /* Ensure the head is read before tail */
> *old_head = rte_atomic_load_explicit(&d->head,
> - rte_memory_order_relaxed);
> + rte_memory_order_acquire);
> do {
> /* Reset n to the initial burst count */
> n = max;
>
> - /* Ensure the head is read before tail */
> - rte_atomic_thread_fence(rte_memory_order_acquire);
> -
> - /* load-acquire synchronize with store-release of ht->tail
> - * in update_tail.
> + /*
> + * Read s->tail value. Note that it will be loaded after
> + * d->head load, but before CAS operation for the d->head.
> */
> stail = rte_atomic_load_explicit(&s->tail,
> - rte_memory_order_acquire);
> + rte_memory_order_relaxed);
>
> /* The subtraction is done between two unsigned 32bits value
> * (the result is always modulo 32 bits even if we have @@ -
> 112,11 +111,19 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail
> *d,
> d->head = *new_head;
> success = 1;
> } else
> - /* on failure, *old_head is updated */
> + /*
> + * on failure, *old_head is updated.
> + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist
> + * barrier to prevent:
> + * - OOO reads of cons tail value
> + * - OOO copy of elems from the ring
> + * Also RELEASE guarantees that latest tail value
> + * will become visible before the new head value.
> + */
> success =
> rte_atomic_compare_exchange_strong_explicit(
> &d->head, old_head, *new_head,
> - rte_memory_order_relaxed,
> - rte_memory_order_relaxed);
> + rte_memory_order_acq_rel,
> + rte_memory_order_acquire);
> } while (unlikely(success == 0));
> return n;
> }
> diff --git a/lib/ring/rte_ring_hts_elem_pvt.h b/lib/ring/rte_ring_hts_elem_pvt.h
> index c59e5f6420..cc593433b9 100644
> --- a/lib/ring/rte_ring_hts_elem_pvt.h
> +++ b/lib/ring/rte_ring_hts_elem_pvt.h
> @@ -116,13 +116,15 @@ __rte_ring_hts_move_head(struct
> rte_ring_hts_headtail *d,
> np.pos.head = op.pos.head + n;
>
> /*
> - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
> + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
> * - OOO reads of cons tail value
> * - OOO copy of elems from the ring
> + * Also RELEASE guarantees that latest tail value
> + * will become visible before the new head value.
> */
> } while (rte_atomic_compare_exchange_strong_explicit(&d->ht.raw,
> (uint64_t *)(uintptr_t)&op.raw, np.raw,
> - rte_memory_order_acquire,
> + rte_memory_order_acq_rel,
> rte_memory_order_acquire) == 0);
>
> *old_head = op.pos.head;
> diff --git a/lib/ring/rte_ring_rts_elem_pvt.h b/lib/ring/rte_ring_rts_elem_pvt.h
> index 509fa674fb..860b13cc61 100644
> --- a/lib/ring/rte_ring_rts_elem_pvt.h
> +++ b/lib/ring/rte_ring_rts_elem_pvt.h
> @@ -131,13 +131,15 @@ __rte_ring_rts_move_head(struct
> rte_ring_rts_headtail *d,
> nh.val.cnt = oh.val.cnt + 1;
>
> /*
> - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
> + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
> * - OOO reads of cons tail value
> * - OOO copy of elems to the ring
> + * Also RELEASE guarantees that latest tail value
> + * will become visible before the new head value.
> */
> } while (rte_atomic_compare_exchange_strong_explicit(&d->head.raw,
> (uint64_t *)(uintptr_t)&oh.raw, nh.raw,
> - rte_memory_order_acquire,
> + rte_memory_order_acq_rel,
> rte_memory_order_acquire) == 0);
>
> *old_head = oh.val.pos;
> diff --git a/lib/ring/soring.c b/lib/ring/soring.c index 7bcbf35516..21a1a27e24
> 100644
> --- a/lib/ring/soring.c
> +++ b/lib/ring/soring.c
> @@ -123,8 +123,6 @@ __rte_soring_stage_finalize(struct
> soring_stage_headtail *sht, uint32_t stage,
> rte_atomic_store_explicit(&sht->tail.raw, ot.raw,
> rte_memory_order_release);
>
> - /* make sure that new tail value is visible */
> - rte_atomic_thread_fence(rte_memory_order_release);
> return i;
> }
>
> @@ -219,9 +217,6 @@ __rte_soring_update_tail(struct __rte_ring_headtail
> *rht,
> /* unsupported mode, shouldn't be here */
> RTE_ASSERT(0);
> }
> -
> - /* make sure that new tail value is visible */
> - rte_atomic_thread_fence(rte_memory_order_release);
> }
>
> static __rte_always_inline uint32_t
> --
> 2.43.0
Hi Wathsala,
>
> Hi Konstanin,
>
> In rte_ring the store-release on tail update guarantees that CAS
> won't get reordered with the store-released of the tail update.
>
> So, the sequence of events would look like this (combined view
> of head and tail update)
>
> Releaxed-load(new_head, N) ----------------> (A)
> Relaxed-CAS(d->head, new_head, old_head) ----------------> (B)
> Store-release-store(d->tail, new_head) ----------------> (C)
>
> If we look at address dependencies, then...
>
> (B) depends on (A) due to new_head address dependency.
> (C) depends on (A) due to new_head address dependency.
>
> So, dependency graph looks like this
> (A)
> / \
> v v
> (B) (C)
>
> There is no implicit dependence between (B) and (C), I think
> this is the issue you are brining up.
> Even though there is no dependence between the two,
> the store-release of (C) ensures that (B) won't drop below it.
> Therefore, the above graph can be turned into an ordered
> sequence as shown below..
>
> (A) -> (B) -> (C)
I do agree that with current implementation of __rte_ring_headtail_move_head()
in lib/ring/rte_ring_c11_pvt.h the order of these 3 operations (A->B->C) should be sequential.
The problem I am talking about is a different one:
thread can see 'latest' 'cons.head' value, with 'previous' value for 'prod.tail' or visa-versa.
In other words: 'cons.head' value depends on 'prod.tail', so before making latest 'cons.head'
value visible to other threads, we need to ensure that latest 'prod.tail' is also visible.
Let me try to explain it on the example:
Suppose at some moment we have:
prod={.head=2,.tail=1};
cons={.head=0,.tail=0};
I.e. thead #1 is in process to enqueue one more element into the ring.
Thread #1 Thread #2
T0:
store(&prod.tail, 2, release);
/*AFAIU: this is somewhat equivalent to: wmb(); prod.tail=2;
* I.E. - it guarantees that all stores initiated before that operation will be visible
* by other threads at the same moment or before new value of prod.tail wlll become
* visible, but it doesn't guarantee that new prod.tail value will be visible to other
* threads immediately.
*/
...
move_cons_head(...,n=2) move_cons_head(...,n=1)
... ...
T1:
*old_head = load(&cons.head, relaxed);
fence(acquire);
/*old_head == 0, no surprises */
stail = load(&prod.tail, acquire);
/*stail == 2, no surprises */
*entries = (capacity + stail - *old_head);
*new_head = *old_head + n;
/* *entries == (2 - 0) == 2; *new_head==2; all good */
...
T2:
*old_head = load(&cons.head, relaxed);
fence(acquire);
/*old_head == 0, no surprises */
stail = load(&prod.tail, acquire);
/* !!!!! stail == 1 !!!!! for Thread 2
* Even though we do use acquire here - there was no *release* after store(&prod.tail).
* So, still no guarantee that Thread 2 will see latest prod.tail value.
*/
*entries = (capacity + stail - *old_head);
/* *entries == (1 - 0) == 1, still ok */
*new_head = *old_head + n;
/* *new_head == 1 */
T3:
success = CAS(&cons.head,
old_head /*==0/, *new_head /*==2*/,
relaxed, relaxed);
/*success==1, cons.head==2*/
...
T4:
success = CAS(&cons.head,
old_head /*==0/, *new_head /*==1*/,
relaxed, relaxed);
/*success==0, *old_head==2*/
/* CAS() failed and provided Thread 2 with latest valued for cons.head(==2)
* Thread 2 repeats attempt, starts second iteration
*/
fence(acquire);
stail = load(&prod.tail, acquire);
/* !!!!! stail == 1 !!!!! for Thread 2
* Still no *release* had happened after store(&prod.tail) at T0.
* So, still no guarantee that Thread 2 will see latest prod.tail value.
*/
*entries = (capacity + stail - *old_head);
*new_head = *old_head + n;
/* !!!!! *entries == (1 - 2) == -1(UINT32_MAX); *new_head==(2+1)==3; !!!!!
* we are about to corrupt our ring !!!!!
*/
>
> I haven't looked at the so-ring yet. Could it be possible that the
> issue is due to something else introduced in that code?
Well, as I said, so far I wasn't able to re-produce this problem with conventional
ring (ring_stress_autotest), only soring_stress_autotest is failing and
for now - only on one specific ARM platform.
Regarding soring specific fix (without touching common code) -
sure it is also possible, pls see patch #2.
There I just added 'fence(release);' straight after 'store(&tail);'
That's seems enough to fix that problem within the soring only.
Though, from my understanding rte_ring might also be affected,
that's why I went ahead and prepared that patch.
If you feel, that I a missing something here - pls shout.
Konstantin
> Thanks,
>
> --wathsala
>
> > This patch aims several purposes:
> > - provide an alternative (and I think a better) way to fix the
> > issue discussed in previous patch:
> > "ring/soring: fix synchronization issue between head and tail values"
> > - make sure that such problem wouldn’t happen within other usages of
> > __rte_ring_headtail_move_head() – both current rte_ring
> > implementation and possible future use-cases.
> > - step towards unification of move_head() implementations and
> > removing rte_ring_generic_pvt.h
> > It uses Acquire-Release memory ordering for CAS operation in move_head().
> > That guarantees that corresponding ‘tail’ updates will be visible before current
> > ‘head’ is updated.
> > As I said before: I think that in theory the problem described in previous patch
> > might happen with our conventional rte_ring too (when
> > RTE_USE_C11_MEM_MODEL enabled).
> > But, so far I didn’t manage to reproduce it in reality.
> > For that reason and also because it touches a critical rte_ring code-path, I put
> > these changes into a separate patch. Expect all interested stakeholders to come-
> > up with their comments and observations.
> > Regarding performance impact – on my boxes both ring_perf_autotest and
> > ring_stress_autotest – show a mixed set of results: some of them become few
> > cycles faster, another few cycles slower.
> > But so far, I didn’t notice any real degradations with that patch.
> >
> > Fixes: b5458e2cc483 ("ring: introduce staged ordered ring")
> > Fixes: 1cc363b8ce06 ("ring: introduce HTS ring mode")
> > Fixes: e6ba4731c0f3 ("ring: introduce RTS ring mode")
> > Fixes: 49594a63147a ("ring/c11: relax ordering for load and store of the head")
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > ---
> > lib/ring/rte_ring_c11_pvt.h | 27 +++++++++++++++++----------
> > lib/ring/rte_ring_hts_elem_pvt.h | 6 ++++-- lib/ring/rte_ring_rts_elem_pvt.h
> > | 6 ++++--
> > lib/ring/soring.c | 5 -----
> > 4 files changed, 25 insertions(+), 19 deletions(-)
> >
> > diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index
> > 0845cd6dcf..6d1c46df9a 100644
> > --- a/lib/ring/rte_ring_c11_pvt.h
> > +++ b/lib/ring/rte_ring_c11_pvt.h
> > @@ -77,20 +77,19 @@ __rte_ring_headtail_move_head(struct
> > rte_ring_headtail *d,
> > int success;
> > unsigned int max = n;
> >
> > + /* Ensure the head is read before tail */
> > *old_head = rte_atomic_load_explicit(&d->head,
> > - rte_memory_order_relaxed);
> > + rte_memory_order_acquire);
> > do {
> > /* Reset n to the initial burst count */
> > n = max;
> >
> > - /* Ensure the head is read before tail */
> > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > -
> > - /* load-acquire synchronize with store-release of ht->tail
> > - * in update_tail.
> > + /*
> > + * Read s->tail value. Note that it will be loaded after
> > + * d->head load, but before CAS operation for the d->head.
> > */
> > stail = rte_atomic_load_explicit(&s->tail,
> > - rte_memory_order_acquire);
> > + rte_memory_order_relaxed);
> >
> > /* The subtraction is done between two unsigned 32bits value
> > * (the result is always modulo 32 bits even if we have @@ -
> > 112,11 +111,19 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail
> > *d,
> > d->head = *new_head;
> > success = 1;
> > } else
> > - /* on failure, *old_head is updated */
> > + /*
> > + * on failure, *old_head is updated.
> > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist
> > + * barrier to prevent:
> > + * - OOO reads of cons tail value
> > + * - OOO copy of elems from the ring
> > + * Also RELEASE guarantees that latest tail value
> > + * will become visible before the new head value.
> > + */
> > success =
> > rte_atomic_compare_exchange_strong_explicit(
> > &d->head, old_head, *new_head,
> > - rte_memory_order_relaxed,
> > - rte_memory_order_relaxed);
> > + rte_memory_order_acq_rel,
> > + rte_memory_order_acquire);
> > } while (unlikely(success == 0));
> > return n;
> > }
> > diff --git a/lib/ring/rte_ring_hts_elem_pvt.h b/lib/ring/rte_ring_hts_elem_pvt.h
> > index c59e5f6420..cc593433b9 100644
> > --- a/lib/ring/rte_ring_hts_elem_pvt.h
> > +++ b/lib/ring/rte_ring_hts_elem_pvt.h
> > @@ -116,13 +116,15 @@ __rte_ring_hts_move_head(struct
> > rte_ring_hts_headtail *d,
> > np.pos.head = op.pos.head + n;
> >
> > /*
> > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
> > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
> > * - OOO reads of cons tail value
> > * - OOO copy of elems from the ring
> > + * Also RELEASE guarantees that latest tail value
> > + * will become visible before the new head value.
> > */
> > } while (rte_atomic_compare_exchange_strong_explicit(&d->ht.raw,
> > (uint64_t *)(uintptr_t)&op.raw, np.raw,
> > - rte_memory_order_acquire,
> > + rte_memory_order_acq_rel,
> > rte_memory_order_acquire) == 0);
> >
> > *old_head = op.pos.head;
> > diff --git a/lib/ring/rte_ring_rts_elem_pvt.h b/lib/ring/rte_ring_rts_elem_pvt.h
> > index 509fa674fb..860b13cc61 100644
> > --- a/lib/ring/rte_ring_rts_elem_pvt.h
> > +++ b/lib/ring/rte_ring_rts_elem_pvt.h
> > @@ -131,13 +131,15 @@ __rte_ring_rts_move_head(struct
> > rte_ring_rts_headtail *d,
> > nh.val.cnt = oh.val.cnt + 1;
> >
> > /*
> > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
> > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
> > * - OOO reads of cons tail value
> > * - OOO copy of elems to the ring
> > + * Also RELEASE guarantees that latest tail value
> > + * will become visible before the new head value.
> > */
> > } while (rte_atomic_compare_exchange_strong_explicit(&d->head.raw,
> > (uint64_t *)(uintptr_t)&oh.raw, nh.raw,
> > - rte_memory_order_acquire,
> > + rte_memory_order_acq_rel,
> > rte_memory_order_acquire) == 0);
> >
> > *old_head = oh.val.pos;
> > diff --git a/lib/ring/soring.c b/lib/ring/soring.c index 7bcbf35516..21a1a27e24
> > 100644
> > --- a/lib/ring/soring.c
> > +++ b/lib/ring/soring.c
> > @@ -123,8 +123,6 @@ __rte_soring_stage_finalize(struct
> > soring_stage_headtail *sht, uint32_t stage,
> > rte_atomic_store_explicit(&sht->tail.raw, ot.raw,
> > rte_memory_order_release);
> >
> > - /* make sure that new tail value is visible */
> > - rte_atomic_thread_fence(rte_memory_order_release);
> > return i;
> > }
> >
> > @@ -219,9 +217,6 @@ __rte_soring_update_tail(struct __rte_ring_headtail
> > *rht,
> > /* unsupported mode, shouldn't be here */
> > RTE_ASSERT(0);
> > }
> > -
> > - /* make sure that new tail value is visible */
> > - rte_atomic_thread_fence(rte_memory_order_release);
> > }
> >
> > static __rte_always_inline uint32_t
> > --
> > 2.43.0
Hi Konstantin,
Thanks for the clarification, I now see your concern.
So, in summary what you are saying is that tail update from
Thread #1 that happened at T0 is not observed by the thread #2
at T2 when it computed new_head and entries calculation.
That cannot happen in Arm v8/v9 because tail update is a
store-release and a load-acquire that program order follows it
can only be issued after all the cores have observed the
store-release (there is a synchronization with relationship to
store-release and load-acquire pairs).
In the example you have provided Thread #1's
store(&prod.tail, 2, release) is observed by all the cores in the
system by the time same thread performs load(&prod.tail, acquire)
at T2. As per Arm v8/v9 memory model Thread #2 should observe
prod.tail=2 (not prod.tail=1).
Arm Architecture Reference Manual section B2.10.11 states…
"Where a Load-Acquire appears in program order after a Store-Release,
the memory access generated by the Store-Release instruction is
observed by each PE to the extent that PE is required to observe the
access coherently, before the memory access generated by the
Load-Acquire instruction is observed by that PE, to the extent that the
PE is required to observe the access coherently."
> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Sent: Monday, May 26, 2025 6:54 AM
> To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>;
> dev@dpdk.org
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> jerinj@marvell.com; hemant.agrawal@nxp.com; drc@linux.ibm.com; nd
> <nd@arm.com>
> Subject: RE: [PATCH v1 3/4] ring: fix potential sync issue between head and tail
> values
>
> Hi Wathsala,
>
> >
> > Hi Konstanin,
> >
> > In rte_ring the store-release on tail update guarantees that CAS
> > won't get reordered with the store-released of the tail update.
> >
> > So, the sequence of events would look like this (combined view
> > of head and tail update)
> >
> > Releaxed-load(new_head, N) ----------------> (A)
> > Relaxed-CAS(d->head, new_head, old_head) ----------------> (B)
> > Store-release-store(d->tail, new_head) ----------------> (C)
> >
> > If we look at address dependencies, then...
> >
> > (B) depends on (A) due to new_head address dependency.
> > (C) depends on (A) due to new_head address dependency.
> >
> > So, dependency graph looks like this
> > (A)
> > / \
> > v v
> > (B) (C)
> >
> > There is no implicit dependence between (B) and (C), I think
> > this is the issue you are brining up.
> > Even though there is no dependence between the two,
> > the store-release of (C) ensures that (B) won't drop below it.
> > Therefore, the above graph can be turned into an ordered
> > sequence as shown below..
> >
> > (A) -> (B) -> (C)
>
> I do agree that with current implementation of
> __rte_ring_headtail_move_head()
> in lib/ring/rte_ring_c11_pvt.h the order of these 3 operations (A->B->C)
> should be sequential.
> The problem I am talking about is a different one:
> thread can see 'latest' 'cons.head' value, with 'previous' value for 'prod.tail' or
> visa-versa.
> In other words: 'cons.head' value depends on 'prod.tail', so before making
> latest 'cons.head'
> value visible to other threads, we need to ensure that latest 'prod.tail' is also
> visible.
> Let me try to explain it on the example:
>
> Suppose at some moment we have:
> prod={.head=2,.tail=1};
> cons={.head=0,.tail=0};
> I.e. thead #1 is in process to enqueue one more element into the ring.
>
> Thread #1 Thread #2
> T0:
> store(&prod.tail, 2, release);
> /*AFAIU: this is somewhat equivalent to: wmb(); prod.tail=2;
> * I.E. - it guarantees that all stores initiated before that operation will be
> visible
> * by other threads at the same moment or before new value of prod.tail wlll
> become
> * visible, but it doesn't guarantee that new prod.tail value will be visible to
> other
> * threads immediately.
> */
> ...
> move_cons_head(...,n=2) move_cons_head(...,n=1)
> ... ...
> T1:
> *old_head = load(&cons.head, relaxed);
> fence(acquire);
> /*old_head == 0, no surprises */
> stail = load(&prod.tail, acquire);
> /*stail == 2, no surprises */
> *entries = (capacity + stail - *old_head);
> *new_head = *old_head + n;
> /* *entries == (2 - 0) == 2; *new_head==2; all good */
> ...
> T2:
> *old_head =
> load(&cons.head, relaxed);
> fence(acquire);
> /*old_head == 0, no surprises
> */
> stail = load(&prod.tail,
> acquire);
> /* !!!!! stail == 1 !!!!! for Thread 2
> * Even though we do use acquire here - there was no *release* after
> store(&prod.tail).
> * So, still no guarantee that Thread 2 will see latest prod.tail value.
> */
> *entries = (capacity + stail -
> *old_head);
> /* *entries == (1 - 0) == 1, still
> ok */
> *new_head = *old_head + n;
> /* *new_head == 1 */
> T3:
> success = CAS(&cons.head,
> old_head /*==0/, *new_head /*==2*/,
> relaxed, relaxed);
> /*success==1, cons.head==2*/
> ...
> T4:
> success = CAS(&cons.head,
> old_head /*==0/, *new_head
> /*==1*/,
> relaxed, relaxed);
> /*success==0, *old_head==2*/
> /* CAS() failed and provided Thread 2 with latest valued for cons.head(==2)
> * Thread 2 repeats attempt, starts second iteration
> */
> fence(acquire);
> stail = load(&prod.tail, acquire);
> /* !!!!! stail == 1 !!!!! for Thread 2
> * Still no *release* had happened after store(&prod.tail) at T0.
> * So, still no guarantee that Thread 2 will see latest prod.tail value.
> */
> *entries = (capacity + stail -
> *old_head);
> *new_head = *old_head + n;
>
> /* !!!!! *entries == (1 - 2) == -1(UINT32_MAX); *new_head==(2+1)==3; !!!!!
> * we are about to corrupt our ring !!!!!
> */
>
> >
> > I haven't looked at the so-ring yet. Could it be possible that the
> > issue is due to something else introduced in that code?
>
> Well, as I said, so far I wasn't able to re-produce this problem with
> conventional
> ring (ring_stress_autotest), only soring_stress_autotest is failing and
> for now - only on one specific ARM platform.
> Regarding soring specific fix (without touching common code) -
> sure it is also possible, pls see patch #2.
> There I just added 'fence(release);' straight after 'store(&tail);'
> That's seems enough to fix that problem within the soring only.
> Though, from my understanding rte_ring might also be affected,
> that's why I went ahead and prepared that patch.
> If you feel, that I a missing something here - pls shout.
> Konstantin
>
>
> > Thanks,
> >
> > --wathsala
> >
> > > This patch aims several purposes:
> > > - provide an alternative (and I think a better) way to fix the
> > > issue discussed in previous patch:
> > > "ring/soring: fix synchronization issue between head and tail values"
> > > - make sure that such problem wouldn’t happen within other usages of
> > > __rte_ring_headtail_move_head() – both current rte_ring
> > > implementation and possible future use-cases.
> > > - step towards unification of move_head() implementations and
> > > removing rte_ring_generic_pvt.h
> > > It uses Acquire-Release memory ordering for CAS operation in
> move_head().
> > > That guarantees that corresponding ‘tail’ updates will be visible before
> current
> > > ‘head’ is updated.
> > > As I said before: I think that in theory the problem described in previous
> patch
> > > might happen with our conventional rte_ring too (when
> > > RTE_USE_C11_MEM_MODEL enabled).
> > > But, so far I didn’t manage to reproduce it in reality.
> > > For that reason and also because it touches a critical rte_ring code-path, I
> put
> > > these changes into a separate patch. Expect all interested stakeholders to
> come-
> > > up with their comments and observations.
> > > Regarding performance impact – on my boxes both ring_perf_autotest and
> > > ring_stress_autotest – show a mixed set of results: some of them become
> few
> > > cycles faster, another few cycles slower.
> > > But so far, I didn’t notice any real degradations with that patch.
> > >
> > > Fixes: b5458e2cc483 ("ring: introduce staged ordered ring")
> > > Fixes: 1cc363b8ce06 ("ring: introduce HTS ring mode")
> > > Fixes: e6ba4731c0f3 ("ring: introduce RTS ring mode")
> > > Fixes: 49594a63147a ("ring/c11: relax ordering for load and store of the
> head")
> > >
> > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > > ---
> > > lib/ring/rte_ring_c11_pvt.h | 27 +++++++++++++++++----------
> > > lib/ring/rte_ring_hts_elem_pvt.h | 6 ++++--
> lib/ring/rte_ring_rts_elem_pvt.h
> > > | 6 ++++--
> > > lib/ring/soring.c | 5 -----
> > > 4 files changed, 25 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index
> > > 0845cd6dcf..6d1c46df9a 100644
> > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > @@ -77,20 +77,19 @@ __rte_ring_headtail_move_head(struct
> > > rte_ring_headtail *d,
> > > int success;
> > > unsigned int max = n;
> > >
> > > + /* Ensure the head is read before tail */
> > > *old_head = rte_atomic_load_explicit(&d->head,
> > > - rte_memory_order_relaxed);
> > > + rte_memory_order_acquire);
> > > do {
> > > /* Reset n to the initial burst count */
> > > n = max;
> > >
> > > - /* Ensure the head is read before tail */
> > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > -
> > > - /* load-acquire synchronize with store-release of ht->tail
> > > - * in update_tail.
> > > + /*
> > > + * Read s->tail value. Note that it will be loaded after
> > > + * d->head load, but before CAS operation for the d->head.
> > > */
> > > stail = rte_atomic_load_explicit(&s->tail,
> > > - rte_memory_order_acquire);
> > > + rte_memory_order_relaxed);
> > >
> > > /* The subtraction is done between two unsigned 32bits value
> > > * (the result is always modulo 32 bits even if we have @@ -
> > > 112,11 +111,19 @@ __rte_ring_headtail_move_head(struct
> rte_ring_headtail
> > > *d,
> > > d->head = *new_head;
> > > success = 1;
> > > } else
> > > - /* on failure, *old_head is updated */
> > > + /*
> > > + * on failure, *old_head is updated.
> > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist
> > > + * barrier to prevent:
> > > + * - OOO reads of cons tail value
> > > + * - OOO copy of elems from the ring
> > > + * Also RELEASE guarantees that latest tail value
> > > + * will become visible before the new head value.
> > > + */
> > > success =
> > > rte_atomic_compare_exchange_strong_explicit(
> > > &d->head, old_head, *new_head,
> > > - rte_memory_order_relaxed,
> > > - rte_memory_order_relaxed);
> > > + rte_memory_order_acq_rel,
> > > + rte_memory_order_acquire);
> > > } while (unlikely(success == 0));
> > > return n;
> > > }
> > > diff --git a/lib/ring/rte_ring_hts_elem_pvt.h
> b/lib/ring/rte_ring_hts_elem_pvt.h
> > > index c59e5f6420..cc593433b9 100644
> > > --- a/lib/ring/rte_ring_hts_elem_pvt.h
> > > +++ b/lib/ring/rte_ring_hts_elem_pvt.h
> > > @@ -116,13 +116,15 @@ __rte_ring_hts_move_head(struct
> > > rte_ring_hts_headtail *d,
> > > np.pos.head = op.pos.head + n;
> > >
> > > /*
> > > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
> > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
> > > * - OOO reads of cons tail value
> > > * - OOO copy of elems from the ring
> > > + * Also RELEASE guarantees that latest tail value
> > > + * will become visible before the new head value.
> > > */
> > > } while (rte_atomic_compare_exchange_strong_explicit(&d->ht.raw,
> > > (uint64_t *)(uintptr_t)&op.raw, np.raw,
> > > - rte_memory_order_acquire,
> > > + rte_memory_order_acq_rel,
> > > rte_memory_order_acquire) == 0);
> > >
> > > *old_head = op.pos.head;
> > > diff --git a/lib/ring/rte_ring_rts_elem_pvt.h
> b/lib/ring/rte_ring_rts_elem_pvt.h
> > > index 509fa674fb..860b13cc61 100644
> > > --- a/lib/ring/rte_ring_rts_elem_pvt.h
> > > +++ b/lib/ring/rte_ring_rts_elem_pvt.h
> > > @@ -131,13 +131,15 @@ __rte_ring_rts_move_head(struct
> > > rte_ring_rts_headtail *d,
> > > nh.val.cnt = oh.val.cnt + 1;
> > >
> > > /*
> > > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
> > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
> > > * - OOO reads of cons tail value
> > > * - OOO copy of elems to the ring
> > > + * Also RELEASE guarantees that latest tail value
> > > + * will become visible before the new head value.
> > > */
> > > } while (rte_atomic_compare_exchange_strong_explicit(&d-
> >head.raw,
> > > (uint64_t *)(uintptr_t)&oh.raw, nh.raw,
> > > - rte_memory_order_acquire,
> > > + rte_memory_order_acq_rel,
> > > rte_memory_order_acquire) == 0);
> > >
> > > *old_head = oh.val.pos;
> > > diff --git a/lib/ring/soring.c b/lib/ring/soring.c index
> 7bcbf35516..21a1a27e24
> > > 100644
> > > --- a/lib/ring/soring.c
> > > +++ b/lib/ring/soring.c
> > > @@ -123,8 +123,6 @@ __rte_soring_stage_finalize(struct
> > > soring_stage_headtail *sht, uint32_t stage,
> > > rte_atomic_store_explicit(&sht->tail.raw, ot.raw,
> > > rte_memory_order_release);
> > >
> > > - /* make sure that new tail value is visible */
> > > - rte_atomic_thread_fence(rte_memory_order_release);
> > > return i;
> > > }
> > >
> > > @@ -219,9 +217,6 @@ __rte_soring_update_tail(struct
> __rte_ring_headtail
> > > *rht,
> > > /* unsupported mode, shouldn't be here */
> > > RTE_ASSERT(0);
> > > }
> > > -
> > > - /* make sure that new tail value is visible */
> > > - rte_atomic_thread_fence(rte_memory_order_release);
> > > }
> > >
> > > static __rte_always_inline uint32_t
> > > --
> > > 2.43.0
> Hi Konstantin,
>
> Thanks for the clarification, I now see your concern.
>
> So, in summary what you are saying is that tail update from
> Thread #1 that happened at T0 is not observed by the thread #2
> at T2 when it computed new_head and entries calculation.
> That cannot happen in Arm v8/v9 because tail update is a
> store-release and a load-acquire that program order follows it
> can only be issued after all the cores have observed the
> store-release (there is a synchronization with relationship to
> store-release and load-acquire pairs).
>
> In the example you have provided Thread #1's
> store(&prod.tail, 2, release) is observed by all the cores in the
> system by the time same thread performs load(&prod.tail, acquire)
> at T2. As per Arm v8/v9 memory model Thread #2 should observe
> prod.tail=2 (not prod.tail=1).
>
> Arm Architecture Reference Manual section B2.10.11 states…
>
> "Where a Load-Acquire appears in program order after a Store-Release,
> the memory access generated by the Store-Release instruction is
> observed by each PE to the extent that PE is required to observe the
> access coherently, before the memory access generated by the
> Load-Acquire instruction is observed by that PE, to the extent that the
> PE is required to observe the access coherently."
>
In soring is this the pair that update the tail and move head?
__rte_soring_update_tail:
__rte_ring_update_tail(&rht->ht, head, next, st, enq);
__rte_soring_move_cons_head:
__rte_ring_headtail_move_head(&r->cons.ht, &r->stage[stage].ht, 0, ...);
If so, &rht->ht and &r->stage[stage].ht are the same address? If they are
not, then you will run into the issue you have seen (a.k.a "Other-multi-copy
atomic" which is legit in Arm v8 and above).
Thanks.
--wathsala
> > -----Original Message-----
> > From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > Sent: Monday, May 26, 2025 6:54 AM
> > To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>;
> > dev@dpdk.org
> > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> > jerinj@marvell.com; hemant.agrawal@nxp.com; drc@linux.ibm.com; nd
> > <nd@arm.com>
> > Subject: RE: [PATCH v1 3/4] ring: fix potential sync issue between head and
> tail
> > values
> >
> > Hi Wathsala,
> >
> > >
> > > Hi Konstanin,
> > >
> > > In rte_ring the store-release on tail update guarantees that CAS
> > > won't get reordered with the store-released of the tail update.
> > >
> > > So, the sequence of events would look like this (combined view
> > > of head and tail update)
> > >
> > > Releaxed-load(new_head, N) ----------------> (A)
> > > Relaxed-CAS(d->head, new_head, old_head) ----------------> (B)
> > > Store-release-store(d->tail, new_head) ----------------> (C)
> > >
> > > If we look at address dependencies, then...
> > >
> > > (B) depends on (A) due to new_head address dependency.
> > > (C) depends on (A) due to new_head address dependency.
> > >
> > > So, dependency graph looks like this
> > > (A)
> > > / \
> > > v v
> > > (B) (C)
> > >
> > > There is no implicit dependence between (B) and (C), I think
> > > this is the issue you are brining up.
> > > Even though there is no dependence between the two,
> > > the store-release of (C) ensures that (B) won't drop below it.
> > > Therefore, the above graph can be turned into an ordered
> > > sequence as shown below..
> > >
> > > (A) -> (B) -> (C)
> >
> > I do agree that with current implementation of
> > __rte_ring_headtail_move_head()
> > in lib/ring/rte_ring_c11_pvt.h the order of these 3 operations (A->B->C)
> > should be sequential.
> > The problem I am talking about is a different one:
> > thread can see 'latest' 'cons.head' value, with 'previous' value for 'prod.tail' or
> > visa-versa.
> > In other words: 'cons.head' value depends on 'prod.tail', so before making
> > latest 'cons.head'
> > value visible to other threads, we need to ensure that latest 'prod.tail' is also
> > visible.
> > Let me try to explain it on the example:
> >
> > Suppose at some moment we have:
> > prod={.head=2,.tail=1};
> > cons={.head=0,.tail=0};
> > I.e. thead #1 is in process to enqueue one more element into the ring.
> >
> > Thread #1 Thread #2
> > T0:
> > store(&prod.tail, 2, release);
> > /*AFAIU: this is somewhat equivalent to: wmb(); prod.tail=2;
> > * I.E. - it guarantees that all stores initiated before that operation will be
> > visible
> > * by other threads at the same moment or before new value of prod.tail wlll
> > become
> > * visible, but it doesn't guarantee that new prod.tail value will be visible to
> > other
> > * threads immediately.
> > */
> > ...
> > move_cons_head(...,n=2)
> move_cons_head(...,n=1)
> > ... ...
> > T1:
> > *old_head = load(&cons.head, relaxed);
> > fence(acquire);
> > /*old_head == 0, no surprises */
> > stail = load(&prod.tail, acquire);
> > /*stail == 2, no surprises */
> > *entries = (capacity + stail - *old_head);
> > *new_head = *old_head + n;
> > /* *entries == (2 - 0) == 2; *new_head==2; all good */
> > ...
> > T2:
> > *old_head =
> > load(&cons.head, relaxed);
> > fence(acquire);
> > /*old_head == 0, no
> surprises
> > */
> > stail = load(&prod.tail,
> > acquire);
> > /* !!!!! stail == 1 !!!!! for Thread 2
> > * Even though we do use acquire here - there was no *release* after
> > store(&prod.tail).
> > * So, still no guarantee that Thread 2 will see latest prod.tail value.
> > */
> > *entries = (capacity + stail -
> > *old_head);
> > /* *entries == (1 - 0) == 1,
> still
> > ok */
> > *new_head = *old_head + n;
> > /* *new_head == 1 */
> > T3:
> > success = CAS(&cons.head,
> > old_head /*==0/, *new_head /*==2*/,
> > relaxed, relaxed);
> > /*success==1, cons.head==2*/
> > ...
> > T4:
> > success = CAS(&cons.head,
> > old_head /*==0/, *new_head
> > /*==1*/,
> > relaxed, relaxed);
> > /*success==0, *old_head==2*/
> > /* CAS() failed and provided Thread 2 with latest valued for cons.head(==2)
> > * Thread 2 repeats attempt, starts second iteration
> > */
> > fence(acquire);
> > stail = load(&prod.tail, acquire);
> > /* !!!!! stail == 1 !!!!! for Thread 2
> > * Still no *release* had happened after store(&prod.tail) at T0.
> > * So, still no guarantee that Thread 2 will see latest prod.tail value.
> > */
> > *entries = (capacity + stail -
> > *old_head);
> > *new_head = *old_head + n;
> >
> > /* !!!!! *entries == (1 - 2) == -1(UINT32_MAX); *new_head==(2+1)==3; !!!!!
> > * we are about to corrupt our ring !!!!!
> > */
> >
> > >
> > > I haven't looked at the so-ring yet. Could it be possible that the
> > > issue is due to something else introduced in that code?
> >
> > Well, as I said, so far I wasn't able to re-produce this problem with
> > conventional
> > ring (ring_stress_autotest), only soring_stress_autotest is failing and
> > for now - only on one specific ARM platform.
> > Regarding soring specific fix (without touching common code) -
> > sure it is also possible, pls see patch #2.
> > There I just added 'fence(release);' straight after 'store(&tail);'
> > That's seems enough to fix that problem within the soring only.
> > Though, from my understanding rte_ring might also be affected,
> > that's why I went ahead and prepared that patch.
> > If you feel, that I a missing something here - pls shout.
> > Konstantin
> >
> >
> > > Thanks,
> > >
> > > --wathsala
> > >
> > > > This patch aims several purposes:
> > > > - provide an alternative (and I think a better) way to fix the
> > > > issue discussed in previous patch:
> > > > "ring/soring: fix synchronization issue between head and tail values"
> > > > - make sure that such problem wouldn’t happen within other usages of
> > > > __rte_ring_headtail_move_head() – both current rte_ring
> > > > implementation and possible future use-cases.
> > > > - step towards unification of move_head() implementations and
> > > > removing rte_ring_generic_pvt.h
> > > > It uses Acquire-Release memory ordering for CAS operation in
> > move_head().
> > > > That guarantees that corresponding ‘tail’ updates will be visible before
> > current
> > > > ‘head’ is updated.
> > > > As I said before: I think that in theory the problem described in previous
> > patch
> > > > might happen with our conventional rte_ring too (when
> > > > RTE_USE_C11_MEM_MODEL enabled).
> > > > But, so far I didn’t manage to reproduce it in reality.
> > > > For that reason and also because it touches a critical rte_ring code-path, I
> > put
> > > > these changes into a separate patch. Expect all interested stakeholders to
> > come-
> > > > up with their comments and observations.
> > > > Regarding performance impact – on my boxes both ring_perf_autotest
> and
> > > > ring_stress_autotest – show a mixed set of results: some of them
> become
> > few
> > > > cycles faster, another few cycles slower.
> > > > But so far, I didn’t notice any real degradations with that patch.
> > > >
> > > > Fixes: b5458e2cc483 ("ring: introduce staged ordered ring")
> > > > Fixes: 1cc363b8ce06 ("ring: introduce HTS ring mode")
> > > > Fixes: e6ba4731c0f3 ("ring: introduce RTS ring mode")
> > > > Fixes: 49594a63147a ("ring/c11: relax ordering for load and store of the
> > head")
> > > >
> > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > > > ---
> > > > lib/ring/rte_ring_c11_pvt.h | 27 +++++++++++++++++----------
> > > > lib/ring/rte_ring_hts_elem_pvt.h | 6 ++++--
> > lib/ring/rte_ring_rts_elem_pvt.h
> > > > | 6 ++++--
> > > > lib/ring/soring.c | 5 -----
> > > > 4 files changed, 25 insertions(+), 19 deletions(-)
> > > >
> > > > diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> index
> > > > 0845cd6dcf..6d1c46df9a 100644
> > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > @@ -77,20 +77,19 @@ __rte_ring_headtail_move_head(struct
> > > > rte_ring_headtail *d,
> > > > int success;
> > > > unsigned int max = n;
> > > >
> > > > + /* Ensure the head is read before tail */
> > > > *old_head = rte_atomic_load_explicit(&d->head,
> > > > - rte_memory_order_relaxed);
> > > > + rte_memory_order_acquire);
> > > > do {
> > > > /* Reset n to the initial burst count */
> > > > n = max;
> > > >
> > > > - /* Ensure the head is read before tail */
> > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > -
> > > > - /* load-acquire synchronize with store-release of ht->tail
> > > > - * in update_tail.
> > > > + /*
> > > > + * Read s->tail value. Note that it will be loaded after
> > > > + * d->head load, but before CAS operation for the d->head.
> > > > */
> > > > stail = rte_atomic_load_explicit(&s->tail,
> > > > - rte_memory_order_acquire);
> > > > + rte_memory_order_relaxed);
> > > >
> > > > /* The subtraction is done between two unsigned 32bits value
> > > > * (the result is always modulo 32 bits even if we have @@ -
> > > > 112,11 +111,19 @@ __rte_ring_headtail_move_head(struct
> > rte_ring_headtail
> > > > *d,
> > > > d->head = *new_head;
> > > > success = 1;
> > > > } else
> > > > - /* on failure, *old_head is updated */
> > > > + /*
> > > > + * on failure, *old_head is updated.
> > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist
> > > > + * barrier to prevent:
> > > > + * - OOO reads of cons tail value
> > > > + * - OOO copy of elems from the ring
> > > > + * Also RELEASE guarantees that latest tail value
> > > > + * will become visible before the new head value.
> > > > + */
> > > > success =
> > > > rte_atomic_compare_exchange_strong_explicit(
> > > > &d->head, old_head, *new_head,
> > > > - rte_memory_order_relaxed,
> > > > - rte_memory_order_relaxed);
> > > > + rte_memory_order_acq_rel,
> > > > + rte_memory_order_acquire);
> > > > } while (unlikely(success == 0));
> > > > return n;
> > > > }
> > > > diff --git a/lib/ring/rte_ring_hts_elem_pvt.h
> > b/lib/ring/rte_ring_hts_elem_pvt.h
> > > > index c59e5f6420..cc593433b9 100644
> > > > --- a/lib/ring/rte_ring_hts_elem_pvt.h
> > > > +++ b/lib/ring/rte_ring_hts_elem_pvt.h
> > > > @@ -116,13 +116,15 @@ __rte_ring_hts_move_head(struct
> > > > rte_ring_hts_headtail *d,
> > > > np.pos.head = op.pos.head + n;
> > > >
> > > > /*
> > > > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
> > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
> > > > * - OOO reads of cons tail value
> > > > * - OOO copy of elems from the ring
> > > > + * Also RELEASE guarantees that latest tail value
> > > > + * will become visible before the new head value.
> > > > */
> > > > } while (rte_atomic_compare_exchange_strong_explicit(&d->ht.raw,
> > > > (uint64_t *)(uintptr_t)&op.raw, np.raw,
> > > > - rte_memory_order_acquire,
> > > > + rte_memory_order_acq_rel,
> > > > rte_memory_order_acquire) == 0);
> > > >
> > > > *old_head = op.pos.head;
> > > > diff --git a/lib/ring/rte_ring_rts_elem_pvt.h
> > b/lib/ring/rte_ring_rts_elem_pvt.h
> > > > index 509fa674fb..860b13cc61 100644
> > > > --- a/lib/ring/rte_ring_rts_elem_pvt.h
> > > > +++ b/lib/ring/rte_ring_rts_elem_pvt.h
> > > > @@ -131,13 +131,15 @@ __rte_ring_rts_move_head(struct
> > > > rte_ring_rts_headtail *d,
> > > > nh.val.cnt = oh.val.cnt + 1;
> > > >
> > > > /*
> > > > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
> > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
> > > > * - OOO reads of cons tail value
> > > > * - OOO copy of elems to the ring
> > > > + * Also RELEASE guarantees that latest tail value
> > > > + * will become visible before the new head value.
> > > > */
> > > > } while (rte_atomic_compare_exchange_strong_explicit(&d-
> > >head.raw,
> > > > (uint64_t *)(uintptr_t)&oh.raw, nh.raw,
> > > > - rte_memory_order_acquire,
> > > > + rte_memory_order_acq_rel,
> > > > rte_memory_order_acquire) == 0);
> > > >
> > > > *old_head = oh.val.pos;
> > > > diff --git a/lib/ring/soring.c b/lib/ring/soring.c index
> > 7bcbf35516..21a1a27e24
> > > > 100644
> > > > --- a/lib/ring/soring.c
> > > > +++ b/lib/ring/soring.c
> > > > @@ -123,8 +123,6 @@ __rte_soring_stage_finalize(struct
> > > > soring_stage_headtail *sht, uint32_t stage,
> > > > rte_atomic_store_explicit(&sht->tail.raw, ot.raw,
> > > > rte_memory_order_release);
> > > >
> > > > - /* make sure that new tail value is visible */
> > > > - rte_atomic_thread_fence(rte_memory_order_release);
> > > > return i;
> > > > }
> > > >
> > > > @@ -219,9 +217,6 @@ __rte_soring_update_tail(struct
> > __rte_ring_headtail
> > > > *rht,
> > > > /* unsupported mode, shouldn't be here */
> > > > RTE_ASSERT(0);
> > > > }
> > > > -
> > > > - /* make sure that new tail value is visible */
> > > > - rte_atomic_thread_fence(rte_memory_order_release);
> > > > }
> > > >
> > > > static __rte_always_inline uint32_t
> > > > --
> > > > 2.43.0
Hi Wathsala,
> >
> > Thanks for the clarification, I now see your concern.
> >
> > So, in summary what you are saying is that tail update from
> > Thread #1 that happened at T0 is not observed by the thread #2
> > at T2 when it computed new_head and entries calculation.
Yes
> > That cannot happen in Arm v8/v9 because tail update is a
> > store-release and a load-acquire that program order follows it
> > can only be issued after all the cores have observed the
> > store-release (there is a synchronization with relationship to
> > store-release and load-acquire pairs).
> >
> > In the example you have provided Thread #1's
> > store(&prod.tail, 2, release) is observed by all the cores in the
> > system by the time same thread performs load(&prod.tail, acquire)
> > at T2. As per Arm v8/v9 memory model Thread #2 should observe
> > prod.tail=2 (not prod.tail=1).
> >
> > Arm Architecture Reference Manual section B2.10.11 states…
> >
> > "Where a Load-Acquire appears in program order after a Store-Release,
> > the memory access generated by the Store-Release instruction is
> > observed by each PE to the extent that PE is required to observe the
> > access coherently, before the memory access generated by the
> > Load-Acquire instruction is observed by that PE, to the extent that the
> > PE is required to observe the access coherently."
Interesting, thanks for pointing to that.
Indeed, according to the docs, acquire/release operations use a sequentially consistent model.
But we not always getting pure LDAR instruction here, it depends on compiler march/mcpu flags.
For some HW arch gcc (and clang) generate LDAR instruction, for others - LDPAPR.
And according to the doc:
https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions
for Load-AcquirePC the requirement that Load-Acquires are observed in order with preceding Store-Releases is dropped.
To be more specific here, on my box:
1) meson with -Dmachine=default (gcc options: "-march=armv8-a+crc") generates code with LDAR instruction
https://godbolt.org/z/5cEscdxrb
2) meson with -Dmachine=native (gcc options: " -mcpu=neoverse-n1") generates code with LDAPR instruction
https://godbolt.org/z/8KjcvYxch
soring_stress_test for 1) (LDAR) passes
soring_stress_test for 2) (LDAPR) fails with symptoms I described earlier
>
> In soring is this the pair that update the tail and move head?
>
> __rte_soring_update_tail:
> __rte_ring_update_tail(&rht->ht, head, next, st, enq);
That function is used to update prod.tail and cons.tail
> __rte_soring_move_cons_head:
> __rte_ring_headtail_move_head(&r->cons.ht, &r->stage[stage].ht, 0, ...);
>
__rte_ring_headtail_move_head is invoked to update prod.head (by __rte_soring_move_prod_head)
and cons.head (by __rte_soring_move_cons_head).
> If so, &rht->ht and &r->stage[stage].ht are the same address?
No.
In conventional ring we have just two stages: 'prod' and 'cons'.
So moving prod.head depends on cons.tail and visa-versa: moving cons.head depends on prod.tail.
With soring we have multiple (N) extra stages in between.
So the dependency chain is a bit longer:
prod.head depends on cons.tail
stage[0].head depends on prod.tail,
stage[1].head depends on stage[0].tail
...
stage{N-1].head depends on stage[N-2].tail
cons.head depends on stage[N-1].tail
> If they are
> not, then you will run into the issue you have seen (a.k.a "Other-multi-copy
> atomic" which is legit in Arm v8 and above).
Can you probably elaborate a bit more for me here?
Thanks
Konstantin
> Thanks.
>
> --wathsala
>
>
> > > -----Original Message-----
> > > From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > > Sent: Monday, May 26, 2025 6:54 AM
> > > To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>;
> > > dev@dpdk.org
> > > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> > > jerinj@marvell.com; hemant.agrawal@nxp.com; drc@linux.ibm.com; nd
> > > <nd@arm.com>
> > > Subject: RE: [PATCH v1 3/4] ring: fix potential sync issue between head and
> > tail
> > > values
> > >
> > > Hi Wathsala,
> > >
> > > >
> > > > Hi Konstanin,
> > > >
> > > > In rte_ring the store-release on tail update guarantees that CAS
> > > > won't get reordered with the store-released of the tail update.
> > > >
> > > > So, the sequence of events would look like this (combined view
> > > > of head and tail update)
> > > >
> > > > Releaxed-load(new_head, N) ----------------> (A)
> > > > Relaxed-CAS(d->head, new_head, old_head) ----------------> (B)
> > > > Store-release-store(d->tail, new_head) ----------------> (C)
> > > >
> > > > If we look at address dependencies, then...
> > > >
> > > > (B) depends on (A) due to new_head address dependency.
> > > > (C) depends on (A) due to new_head address dependency.
> > > >
> > > > So, dependency graph looks like this
> > > > (A)
> > > > / \
> > > > v v
> > > > (B) (C)
> > > >
> > > > There is no implicit dependence between (B) and (C), I think
> > > > this is the issue you are brining up.
> > > > Even though there is no dependence between the two,
> > > > the store-release of (C) ensures that (B) won't drop below it.
> > > > Therefore, the above graph can be turned into an ordered
> > > > sequence as shown below..
> > > >
> > > > (A) -> (B) -> (C)
> > >
> > > I do agree that with current implementation of
> > > __rte_ring_headtail_move_head()
> > > in lib/ring/rte_ring_c11_pvt.h the order of these 3 operations (A->B->C)
> > > should be sequential.
> > > The problem I am talking about is a different one:
> > > thread can see 'latest' 'cons.head' value, with 'previous' value for 'prod.tail' or
> > > visa-versa.
> > > In other words: 'cons.head' value depends on 'prod.tail', so before making
> > > latest 'cons.head'
> > > value visible to other threads, we need to ensure that latest 'prod.tail' is also
> > > visible.
> > > Let me try to explain it on the example:
> > >
> > > Suppose at some moment we have:
> > > prod={.head=2,.tail=1};
> > > cons={.head=0,.tail=0};
> > > I.e. thead #1 is in process to enqueue one more element into the ring.
> > >
> > > Thread #1 Thread #2
> > > T0:
> > > store(&prod.tail, 2, release);
> > > /*AFAIU: this is somewhat equivalent to: wmb(); prod.tail=2;
> > > * I.E. - it guarantees that all stores initiated before that operation will be
> > > visible
> > > * by other threads at the same moment or before new value of prod.tail wlll
> > > become
> > > * visible, but it doesn't guarantee that new prod.tail value will be visible to
> > > other
> > > * threads immediately.
> > > */
> > > ...
> > > move_cons_head(...,n=2)
> > move_cons_head(...,n=1)
> > > ... ...
> > > T1:
> > > *old_head = load(&cons.head, relaxed);
> > > fence(acquire);
> > > /*old_head == 0, no surprises */
> > > stail = load(&prod.tail, acquire);
> > > /*stail == 2, no surprises */
> > > *entries = (capacity + stail - *old_head);
> > > *new_head = *old_head + n;
> > > /* *entries == (2 - 0) == 2; *new_head==2; all good */
> > > ...
> > > T2:
> > > *old_head =
> > > load(&cons.head, relaxed);
> > > fence(acquire);
> > > /*old_head == 0, no
> > surprises
> > > */
> > > stail = load(&prod.tail,
> > > acquire);
> > > /* !!!!! stail == 1 !!!!! for Thread 2
> > > * Even though we do use acquire here - there was no *release* after
> > > store(&prod.tail).
> > > * So, still no guarantee that Thread 2 will see latest prod.tail value.
> > > */
> > > *entries = (capacity + stail -
> > > *old_head);
> > > /* *entries == (1 - 0) == 1,
> > still
> > > ok */
> > > *new_head = *old_head + n;
> > > /* *new_head == 1 */
> > > T3:
> > > success = CAS(&cons.head,
> > > old_head /*==0/, *new_head /*==2*/,
> > > relaxed, relaxed);
> > > /*success==1, cons.head==2*/
> > > ...
> > > T4:
> > > success = CAS(&cons.head,
> > > old_head /*==0/, *new_head
> > > /*==1*/,
> > > relaxed, relaxed);
> > > /*success==0, *old_head==2*/
> > > /* CAS() failed and provided Thread 2 with latest valued for cons.head(==2)
> > > * Thread 2 repeats attempt, starts second iteration
> > > */
> > > fence(acquire);
> > > stail = load(&prod.tail, acquire);
> > > /* !!!!! stail == 1 !!!!! for Thread 2
> > > * Still no *release* had happened after store(&prod.tail) at T0.
> > > * So, still no guarantee that Thread 2 will see latest prod.tail value.
> > > */
> > > *entries = (capacity + stail -
> > > *old_head);
> > > *new_head = *old_head + n;
> > >
> > > /* !!!!! *entries == (1 - 2) == -1(UINT32_MAX); *new_head==(2+1)==3; !!!!!
> > > * we are about to corrupt our ring !!!!!
> > > */
> > >
> > > >
> > > > I haven't looked at the so-ring yet. Could it be possible that the
> > > > issue is due to something else introduced in that code?
> > >
> > > Well, as I said, so far I wasn't able to re-produce this problem with
> > > conventional
> > > ring (ring_stress_autotest), only soring_stress_autotest is failing and
> > > for now - only on one specific ARM platform.
> > > Regarding soring specific fix (without touching common code) -
> > > sure it is also possible, pls see patch #2.
> > > There I just added 'fence(release);' straight after 'store(&tail);'
> > > That's seems enough to fix that problem within the soring only.
> > > Though, from my understanding rte_ring might also be affected,
> > > that's why I went ahead and prepared that patch.
> > > If you feel, that I a missing something here - pls shout.
> > > Konstantin
> > >
> > >
> > > > Thanks,
> > > >
> > > > --wathsala
> > > >
> > > > > This patch aims several purposes:
> > > > > - provide an alternative (and I think a better) way to fix the
> > > > > issue discussed in previous patch:
> > > > > "ring/soring: fix synchronization issue between head and tail values"
> > > > > - make sure that such problem wouldn’t happen within other usages of
> > > > > __rte_ring_headtail_move_head() – both current rte_ring
> > > > > implementation and possible future use-cases.
> > > > > - step towards unification of move_head() implementations and
> > > > > removing rte_ring_generic_pvt.h
> > > > > It uses Acquire-Release memory ordering for CAS operation in
> > > move_head().
> > > > > That guarantees that corresponding ‘tail’ updates will be visible before
> > > current
> > > > > ‘head’ is updated.
> > > > > As I said before: I think that in theory the problem described in previous
> > > patch
> > > > > might happen with our conventional rte_ring too (when
> > > > > RTE_USE_C11_MEM_MODEL enabled).
> > > > > But, so far I didn’t manage to reproduce it in reality.
> > > > > For that reason and also because it touches a critical rte_ring code-path, I
> > > put
> > > > > these changes into a separate patch. Expect all interested stakeholders to
> > > come-
> > > > > up with their comments and observations.
> > > > > Regarding performance impact – on my boxes both ring_perf_autotest
> > and
> > > > > ring_stress_autotest – show a mixed set of results: some of them
> > become
> > > few
> > > > > cycles faster, another few cycles slower.
> > > > > But so far, I didn’t notice any real degradations with that patch.
> > > > >
> > > > > Fixes: b5458e2cc483 ("ring: introduce staged ordered ring")
> > > > > Fixes: 1cc363b8ce06 ("ring: introduce HTS ring mode")
> > > > > Fixes: e6ba4731c0f3 ("ring: introduce RTS ring mode")
> > > > > Fixes: 49594a63147a ("ring/c11: relax ordering for load and store of the
> > > head")
> > > > >
> > > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > > > > ---
> > > > > lib/ring/rte_ring_c11_pvt.h | 27 +++++++++++++++++----------
> > > > > lib/ring/rte_ring_hts_elem_pvt.h | 6 ++++--
> > > lib/ring/rte_ring_rts_elem_pvt.h
> > > > > | 6 ++++--
> > > > > lib/ring/soring.c | 5 -----
> > > > > 4 files changed, 25 insertions(+), 19 deletions(-)
> > > > >
> > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> > index
> > > > > 0845cd6dcf..6d1c46df9a 100644
> > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > @@ -77,20 +77,19 @@ __rte_ring_headtail_move_head(struct
> > > > > rte_ring_headtail *d,
> > > > > int success;
> > > > > unsigned int max = n;
> > > > >
> > > > > + /* Ensure the head is read before tail */
> > > > > *old_head = rte_atomic_load_explicit(&d->head,
> > > > > - rte_memory_order_relaxed);
> > > > > + rte_memory_order_acquire);
> > > > > do {
> > > > > /* Reset n to the initial burst count */
> > > > > n = max;
> > > > >
> > > > > - /* Ensure the head is read before tail */
> > > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > > -
> > > > > - /* load-acquire synchronize with store-release of ht->tail
> > > > > - * in update_tail.
> > > > > + /*
> > > > > + * Read s->tail value. Note that it will be loaded after
> > > > > + * d->head load, but before CAS operation for the d->head.
> > > > > */
> > > > > stail = rte_atomic_load_explicit(&s->tail,
> > > > > - rte_memory_order_acquire);
> > > > > + rte_memory_order_relaxed);
> > > > >
> > > > > /* The subtraction is done between two unsigned 32bits value
> > > > > * (the result is always modulo 32 bits even if we have @@ -
> > > > > 112,11 +111,19 @@ __rte_ring_headtail_move_head(struct
> > > rte_ring_headtail
> > > > > *d,
> > > > > d->head = *new_head;
> > > > > success = 1;
> > > > > } else
> > > > > - /* on failure, *old_head is updated */
> > > > > + /*
> > > > > + * on failure, *old_head is updated.
> > > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist
> > > > > + * barrier to prevent:
> > > > > + * - OOO reads of cons tail value
> > > > > + * - OOO copy of elems from the ring
> > > > > + * Also RELEASE guarantees that latest tail value
> > > > > + * will become visible before the new head value.
> > > > > + */
> > > > > success =
> > > > > rte_atomic_compare_exchange_strong_explicit(
> > > > > &d->head, old_head, *new_head,
> > > > > - rte_memory_order_relaxed,
> > > > > - rte_memory_order_relaxed);
> > > > > + rte_memory_order_acq_rel,
> > > > > + rte_memory_order_acquire);
> > > > > } while (unlikely(success == 0));
> > > > > return n;
> > > > > }
> > > > > diff --git a/lib/ring/rte_ring_hts_elem_pvt.h
> > > b/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > index c59e5f6420..cc593433b9 100644
> > > > > --- a/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > +++ b/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > @@ -116,13 +116,15 @@ __rte_ring_hts_move_head(struct
> > > > > rte_ring_hts_headtail *d,
> > > > > np.pos.head = op.pos.head + n;
> > > > >
> > > > > /*
> > > > > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
> > > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
> > > > > * - OOO reads of cons tail value
> > > > > * - OOO copy of elems from the ring
> > > > > + * Also RELEASE guarantees that latest tail value
> > > > > + * will become visible before the new head value.
> > > > > */
> > > > > } while (rte_atomic_compare_exchange_strong_explicit(&d->ht.raw,
> > > > > (uint64_t *)(uintptr_t)&op.raw, np.raw,
> > > > > - rte_memory_order_acquire,
> > > > > + rte_memory_order_acq_rel,
> > > > > rte_memory_order_acquire) == 0);
> > > > >
> > > > > *old_head = op.pos.head;
> > > > > diff --git a/lib/ring/rte_ring_rts_elem_pvt.h
> > > b/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > index 509fa674fb..860b13cc61 100644
> > > > > --- a/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > +++ b/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > @@ -131,13 +131,15 @@ __rte_ring_rts_move_head(struct
> > > > > rte_ring_rts_headtail *d,
> > > > > nh.val.cnt = oh.val.cnt + 1;
> > > > >
> > > > > /*
> > > > > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
> > > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
> > > > > * - OOO reads of cons tail value
> > > > > * - OOO copy of elems to the ring
> > > > > + * Also RELEASE guarantees that latest tail value
> > > > > + * will become visible before the new head value.
> > > > > */
> > > > > } while (rte_atomic_compare_exchange_strong_explicit(&d-
> > > >head.raw,
> > > > > (uint64_t *)(uintptr_t)&oh.raw, nh.raw,
> > > > > - rte_memory_order_acquire,
> > > > > + rte_memory_order_acq_rel,
> > > > > rte_memory_order_acquire) == 0);
> > > > >
> > > > > *old_head = oh.val.pos;
> > > > > diff --git a/lib/ring/soring.c b/lib/ring/soring.c index
> > > 7bcbf35516..21a1a27e24
> > > > > 100644
> > > > > --- a/lib/ring/soring.c
> > > > > +++ b/lib/ring/soring.c
> > > > > @@ -123,8 +123,6 @@ __rte_soring_stage_finalize(struct
> > > > > soring_stage_headtail *sht, uint32_t stage,
> > > > > rte_atomic_store_explicit(&sht->tail.raw, ot.raw,
> > > > > rte_memory_order_release);
> > > > >
> > > > > - /* make sure that new tail value is visible */
> > > > > - rte_atomic_thread_fence(rte_memory_order_release);
> > > > > return i;
> > > > > }
> > > > >
> > > > > @@ -219,9 +217,6 @@ __rte_soring_update_tail(struct
> > > __rte_ring_headtail
> > > > > *rht,
> > > > > /* unsupported mode, shouldn't be here */
> > > > > RTE_ASSERT(0);
> > > > > }
> > > > > -
> > > > > - /* make sure that new tail value is visible */
> > > > > - rte_atomic_thread_fence(rte_memory_order_release);
> > > > > }
> > > > >
> > > > > static __rte_always_inline uint32_t
> > > > > --
> > > > > 2.43.0
Hi Konstantin,
> > > That cannot happen in Arm v8/v9 because tail update is a
> > > store-release and a load-acquire that program order follows it can
> > > only be issued after all the cores have observed the store-release
> > > (there is a synchronization with relationship to store-release and
> > > load-acquire pairs).
> > >
> > > In the example you have provided Thread #1's store(&prod.tail, 2,
> > > release) is observed by all the cores in the system by the time same
> > > thread performs load(&prod.tail, acquire) at T2. As per Arm v8/v9
> > > memory model Thread #2 should observe
> > > prod.tail=2 (not prod.tail=1).
> > >
> > > Arm Architecture Reference Manual section B2.10.11 states…
> > >
> > > "Where a Load-Acquire appears in program order after a
> > > Store-Release, the memory access generated by the Store-Release
> > > instruction is observed by each PE to the extent that PE is required
> > > to observe the access coherently, before the memory access generated
> > > by the Load-Acquire instruction is observed by that PE, to the
> > > extent that the PE is required to observe the access coherently."
>
> Interesting, thanks for pointing to that.
> Indeed, according to the docs, acquire/release operations use a sequentially
> consistent model.
> But we not always getting pure LDAR instruction here, it depends on compiler
> march/mcpu flags.
> For some HW arch gcc (and clang) generate LDAR instruction, for others -
> LDPAPR.
> And according to the doc:
> https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-
> Store-Release-instructions
> for Load-AcquirePC the requirement that Load-Acquires are observed in order
> with preceding Store-Releases is dropped.
> To be more specific here, on my box:
> 1) meson with -Dmachine=default (gcc options: "-march=armv8-a+crc")
> generates code with LDAR instruction https://godbolt.org/z/5cEscdxrb
> 2) meson with -Dmachine=native (gcc options: " -mcpu=neoverse-n1") generates
> code with LDAPR instruction https://godbolt.org/z/8KjcvYxch
>
> soring_stress_test for 1) (LDAR) passes
> soring_stress_test for 2) (LDAPR) fails with symptoms I described earlier
>
This makes sense, Load-AcquirePC paired with Store-Release gives you
Release-Consistency-program-consistency (RCpc) whereas Load-Acquire
paired with a Store-Release gives you
Release-Consistency-sequential-consistency (RCsc) a stronger consistency
model compared to RCpc.
As you have pointed out this is what causes the issue you have observed.
LDAPR on the same address on the same core that program order follows
STLR doesn't cause other cores to observe the value written by STLR.
However, contrary to this behavior C11 expects __ATOMIC_RELEASE and
__ATOMIC_ACQUIRE pairs to be sequentially consistent which the ring relies
on. Thus, __ATOMIC_ACQUIRE cannot emit an LDAPR if paired with an
__ATOMIC_RELEASE.
To correct this behavior, I suggest changing both the tail update to
store(&prod.tail, 2, release) to store(&prod.tail, 2, ceq_cst) and
load(&prod.tail, acquire) to d(&prod.tail, ceq_cst);
This would result in an STLR and LDAR (instead of an LDAPR), so
essentially, we are going back to the desired old behavior.
I have raised this issue with the compiler folks to get some clarity on
apparent conflict with the C11 spec.
Additionally, when you specify " -mcpu=neoverse-n1" GCC knows that
Neoverse-n1 CPUs have FEAT_LRCPC which means it implements LDAPR
Instruction. But on the same platform when you only specify
"-march=armv8-a+crc" GCC doesn't know that LDAPR is supported unless
you also append +rcpc to that string, so you end up with LDAR.
(Why we made the recent build changes)
Thanks
--wathsala
> >
> > In soring is this the pair that update the tail and move head?
> >
> > __rte_soring_update_tail:
> > __rte_ring_update_tail(&rht->ht, head, next, st, enq);
>
> That function is used to update prod.tail and cons.tail
>
> > __rte_soring_move_cons_head:
> > __rte_ring_headtail_move_head(&r->cons.ht, &r->stage[stage].ht, 0,
> > ...);
> >
>
> __rte_ring_headtail_move_head is invoked to update prod.head (by
> __rte_soring_move_prod_head) and cons.head (by
> __rte_soring_move_cons_head).
>
> > If so, &rht->ht and &r->stage[stage].ht are the same address?
>
> No.
> In conventional ring we have just two stages: 'prod' and 'cons'.
> So moving prod.head depends on cons.tail and visa-versa: moving cons.head
> depends on prod.tail.
> With soring we have multiple (N) extra stages in between.
> So the dependency chain is a bit longer:
> prod.head depends on cons.tail
> stage[0].head depends on prod.tail,
> stage[1].head depends on stage[0].tail
> ...
> stage{N-1].head depends on stage[N-2].tail cons.head depends on stage[N-1].tail
>
> > If they are
> > not, then you will run into the issue you have seen (a.k.a
> > "Other-multi-copy atomic" which is legit in Arm v8 and above).
>
> Can you probably elaborate a bit more for me here?
> Thanks
> Konstantin
>
>
> > Thanks.
> >
> > --wathsala
> >
> >
> > > > -----Original Message-----
> > > > From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > > > Sent: Monday, May 26, 2025 6:54 AM
> > > > To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>;
> > > > dev@dpdk.org
> > > > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> > > > jerinj@marvell.com; hemant.agrawal@nxp.com; drc@linux.ibm.com; nd
> > > > <nd@arm.com>
> > > > Subject: RE: [PATCH v1 3/4] ring: fix potential sync issue between
> > > > head and
> > > tail
> > > > values
> > > >
> > > > Hi Wathsala,
> > > >
> > > > >
> > > > > Hi Konstanin,
> > > > >
> > > > > In rte_ring the store-release on tail update guarantees that CAS
> > > > > won't get reordered with the store-released of the tail update.
> > > > >
> > > > > So, the sequence of events would look like this (combined view
> > > > > of head and tail update)
> > > > >
> > > > > Releaxed-load(new_head, N) ----------------> (A)
> > > > > Relaxed-CAS(d->head, new_head, old_head) ----------------> (B)
> > > > > Store-release-store(d->tail, new_head) ----------------> (C)
> > > > >
> > > > > If we look at address dependencies, then...
> > > > >
> > > > > (B) depends on (A) due to new_head address dependency.
> > > > > (C) depends on (A) due to new_head address dependency.
> > > > >
> > > > > So, dependency graph looks like this
> > > > > (A)
> > > > > / \
> > > > > v v
> > > > > (B) (C)
> > > > >
> > > > > There is no implicit dependence between (B) and (C), I think
> > > > > this is the issue you are brining up.
> > > > > Even though there is no dependence between the two, the
> > > > > store-release of (C) ensures that (B) won't drop below it.
> > > > > Therefore, the above graph can be turned into an ordered
> > > > > sequence as shown below..
> > > > >
> > > > > (A) -> (B) -> (C)
> > > >
> > > > I do agree that with current implementation of
> > > > __rte_ring_headtail_move_head()
> > > > in lib/ring/rte_ring_c11_pvt.h the order of these 3 operations
> > > > (A->B->C) should be sequential.
> > > > The problem I am talking about is a different one:
> > > > thread can see 'latest' 'cons.head' value, with 'previous' value
> > > > for 'prod.tail' or visa-versa.
> > > > In other words: 'cons.head' value depends on 'prod.tail', so
> > > > before making latest 'cons.head'
> > > > value visible to other threads, we need to ensure that latest
> > > > 'prod.tail' is also visible.
> > > > Let me try to explain it on the example:
> > > >
> > > > Suppose at some moment we have:
> > > > prod={.head=2,.tail=1};
> > > > cons={.head=0,.tail=0};
> > > > I.e. thead #1 is in process to enqueue one more element into the ring.
> > > >
> > > > Thread #1 Thread #2
> > > > T0:
> > > > store(&prod.tail, 2, release);
> > > > /*AFAIU: this is somewhat equivalent to: wmb(); prod.tail=2;
> > > > * I.E. - it guarantees that all stores initiated before that
> > > > operation will be visible
> > > > * by other threads at the same moment or before new value of
> > > > prod.tail wlll become
> > > > * visible, but it doesn't guarantee that new prod.tail value
> > > > will be visible to other
> > > > * threads immediately.
> > > > */
> > > > ...
> > > > move_cons_head(...,n=2)
> > > move_cons_head(...,n=1)
> > > > ... ...
> > > > T1:
> > > > *old_head = load(&cons.head, relaxed);
> > > > fence(acquire);
> > > > /*old_head == 0, no surprises */
> > > > stail = load(&prod.tail, acquire);
> > > > /*stail == 2, no surprises */
> > > > *entries = (capacity + stail - *old_head); *new_head = *old_head
> > > > + n;
> > > > /* *entries == (2 - 0) == 2; *new_head==2; all good */ ...
> > > > T2:
> > > >
> > > > *old_head = load(&cons.head, relaxed);
> > > > fence(acquire);
> > > >
> > > > /*old_head == 0, no
> > > surprises
> > > > */
> > > >
> > > > stail = load(&prod.tail, acquire);
> > > > /* !!!!! stail == 1 !!!!! for Thread 2
> > > > * Even though we do use acquire here - there was no *release*
> > > > after store(&prod.tail).
> > > > * So, still no guarantee that Thread 2 will see latest prod.tail value.
> > > > */
> > > >
> > > > *entries = (capacity + stail - *old_head);
> > > >
> > > > /* *entries == (1 - 0) == 1,
> > > still
> > > > ok */
> > > > *new_head =
> *old_head + n;
> > > >
> > > > /* *new_head == 1 */
> > > > T3:
> > > > success = CAS(&cons.head,
> > > > old_head /*==0/, *new_head /*==2*/,
> > > > relaxed, relaxed);
> > > > /*success==1, cons.head==2*/
> > > > ...
> > > > T4:
> > > > success =
> CAS(&cons.head,
> > > >
> > > > old_head /*==0/, *new_head /*==1*/,
> > > > relaxed, relaxed);
> > > >
> > > > /*success==0, *old_head==2*/
> > > > /* CAS() failed and provided Thread 2 with latest valued for
> > > > cons.head(==2)
> > > > * Thread 2 repeats attempt, starts second iteration */
> > > > fence(acquire);
> > > >
> > > > stail = load(&prod.tail, acquire);
> > > > /* !!!!! stail == 1 !!!!! for Thread 2
> > > > * Still no *release* had happened after store(&prod.tail) at T0.
> > > > * So, still no guarantee that Thread 2 will see latest prod.tail value.
> > > > */
> > > >
> > > > *entries = (capacity + stail - *old_head);
> > > >
> > > > *new_head = *old_head + n;
> > > >
> > > > /* !!!!! *entries == (1 - 2) == -1(UINT32_MAX); *new_head==(2+1)==3; !!!!!
> > > > * we are about to corrupt our ring !!!!!
> > > > */
> > > >
> > > > >
> > > > > I haven't looked at the so-ring yet. Could it be possible that
> > > > > the issue is due to something else introduced in that code?
> > > >
> > > > Well, as I said, so far I wasn't able to re-produce this problem
> > > > with conventional ring (ring_stress_autotest), only
> > > > soring_stress_autotest is failing and for now - only on one
> > > > specific ARM platform.
> > > > Regarding soring specific fix (without touching common code) -
> > > > sure it is also possible, pls see patch #2.
> > > > There I just added 'fence(release);' straight after 'store(&tail);'
> > > > That's seems enough to fix that problem within the soring only.
> > > > Though, from my understanding rte_ring might also be affected,
> > > > that's why I went ahead and prepared that patch.
> > > > If you feel, that I a missing something here - pls shout.
> > > > Konstantin
> > > >
> > > >
> > > > > Thanks,
> > > > >
> > > > > --wathsala
> > > > >
> > > > > > This patch aims several purposes:
> > > > > > - provide an alternative (and I think a better) way to fix the
> > > > > > issue discussed in previous patch:
> > > > > > "ring/soring: fix synchronization issue between head and tail values"
> > > > > > - make sure that such problem wouldn’t happen within other usages of
> > > > > > __rte_ring_headtail_move_head() – both current rte_ring
> > > > > > implementation and possible future use-cases.
> > > > > > - step towards unification of move_head() implementations and
> > > > > > removing rte_ring_generic_pvt.h It uses Acquire-Release
> > > > > > memory ordering for CAS operation in
> > > > move_head().
> > > > > > That guarantees that corresponding ‘tail’ updates will be
> > > > > > visible before
> > > > current
> > > > > > ‘head’ is updated.
> > > > > > As I said before: I think that in theory the problem described
> > > > > > in previous
> > > > patch
> > > > > > might happen with our conventional rte_ring too (when
> > > > > > RTE_USE_C11_MEM_MODEL enabled).
> > > > > > But, so far I didn’t manage to reproduce it in reality.
> > > > > > For that reason and also because it touches a critical
> > > > > > rte_ring code-path, I
> > > > put
> > > > > > these changes into a separate patch. Expect all interested
> > > > > > stakeholders to
> > > > come-
> > > > > > up with their comments and observations.
> > > > > > Regarding performance impact – on my boxes both
> > > > > > ring_perf_autotest
> > > and
> > > > > > ring_stress_autotest – show a mixed set of results: some of
> > > > > > them
> > > become
> > > > few
> > > > > > cycles faster, another few cycles slower.
> > > > > > But so far, I didn’t notice any real degradations with that patch.
> > > > > >
> > > > > > Fixes: b5458e2cc483 ("ring: introduce staged ordered ring")
> > > > > > Fixes: 1cc363b8ce06 ("ring: introduce HTS ring mode")
> > > > > > Fixes: e6ba4731c0f3 ("ring: introduce RTS ring mode")
> > > > > > Fixes: 49594a63147a ("ring/c11: relax ordering for load and
> > > > > > store of the
> > > > head")
> > > > > >
> > > > > > Signed-off-by: Konstantin Ananyev
> > > > > > <konstantin.ananyev@huawei.com>
> > > > > > ---
> > > > > > lib/ring/rte_ring_c11_pvt.h | 27 +++++++++++++++++----------
> > > > > > lib/ring/rte_ring_hts_elem_pvt.h | 6 ++++--
> > > > lib/ring/rte_ring_rts_elem_pvt.h
> > > > > > | 6 ++++--
> > > > > > lib/ring/soring.c | 5 -----
> > > > > > 4 files changed, 25 insertions(+), 19 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > > > > b/lib/ring/rte_ring_c11_pvt.h
> > > index
> > > > > > 0845cd6dcf..6d1c46df9a 100644
> > > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > > @@ -77,20 +77,19 @@ __rte_ring_headtail_move_head(struct
> > > > > > rte_ring_headtail *d,
> > > > > > int success;
> > > > > > unsigned int max = n;
> > > > > >
> > > > > > + /* Ensure the head is read before tail */
> > > > > > *old_head = rte_atomic_load_explicit(&d->head,
> > > > > > - rte_memory_order_relaxed);
> > > > > > + rte_memory_order_acquire);
> > > > > > do {
> > > > > > /* Reset n to the initial burst count */
> > > > > > n = max;
> > > > > >
> > > > > > - /* Ensure the head is read before tail */
> > > > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > > > -
> > > > > > - /* load-acquire synchronize with store-release of ht-
> >tail
> > > > > > - * in update_tail.
> > > > > > + /*
> > > > > > + * Read s->tail value. Note that it will be loaded after
> > > > > > + * d->head load, but before CAS operation for the d-
> >head.
> > > > > > */
> > > > > > stail = rte_atomic_load_explicit(&s->tail,
> > > > > > - rte_memory_order_acquire);
> > > > > > + rte_memory_order_relaxed);
> > > > > >
> > > > > > /* The subtraction is done between two unsigned 32bits
> value
> > > > > > * (the result is always modulo 32 bits even if we have
> @@
> > > > > > -
> > > > > > 112,11 +111,19 @@ __rte_ring_headtail_move_head(struct
> > > > rte_ring_headtail
> > > > > > *d,
> > > > > > d->head = *new_head;
> > > > > > success = 1;
> > > > > > } else
> > > > > > - /* on failure, *old_head is updated */
> > > > > > + /*
> > > > > > + * on failure, *old_head is updated.
> > > > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a
> hoist
> > > > > > + * barrier to prevent:
> > > > > > + * - OOO reads of cons tail value
> > > > > > + * - OOO copy of elems from the ring
> > > > > > + * Also RELEASE guarantees that latest tail
> value
> > > > > > + * will become visible before the new head
> value.
> > > > > > + */
> > > > > > success =
> > > > > > rte_atomic_compare_exchange_strong_explicit(
> > > > > > &d->head, old_head,
> *new_head,
> > > > > > - rte_memory_order_relaxed,
> > > > > > - rte_memory_order_relaxed);
> > > > > > + rte_memory_order_acq_rel,
> > > > > > + rte_memory_order_acquire);
> > > > > > } while (unlikely(success == 0));
> > > > > > return n;
> > > > > > }
> > > > > > diff --git a/lib/ring/rte_ring_hts_elem_pvt.h
> > > > b/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > > index c59e5f6420..cc593433b9 100644
> > > > > > --- a/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > > +++ b/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > > @@ -116,13 +116,15 @@ __rte_ring_hts_move_head(struct
> > > > > > rte_ring_hts_headtail *d,
> > > > > > np.pos.head = op.pos.head + n;
> > > > > >
> > > > > > /*
> > > > > > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to
> prevent:
> > > > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to
> prevent:
> > > > > > * - OOO reads of cons tail value
> > > > > > * - OOO copy of elems from the ring
> > > > > > + * Also RELEASE guarantees that latest tail value
> > > > > > + * will become visible before the new head value.
> > > > > > */
> > > > > > } while (rte_atomic_compare_exchange_strong_explicit(&d-
> >ht.raw,
> > > > > > (uint64_t *)(uintptr_t)&op.raw, np.raw,
> > > > > > - rte_memory_order_acquire,
> > > > > > + rte_memory_order_acq_rel,
> > > > > > rte_memory_order_acquire) == 0);
> > > > > >
> > > > > > *old_head = op.pos.head;
> > > > > > diff --git a/lib/ring/rte_ring_rts_elem_pvt.h
> > > > b/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > > index 509fa674fb..860b13cc61 100644
> > > > > > --- a/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > > +++ b/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > > @@ -131,13 +131,15 @@ __rte_ring_rts_move_head(struct
> > > > > > rte_ring_rts_headtail *d,
> > > > > > nh.val.cnt = oh.val.cnt + 1;
> > > > > >
> > > > > > /*
> > > > > > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to
> prevent:
> > > > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to
> prevent:
> > > > > > * - OOO reads of cons tail value
> > > > > > * - OOO copy of elems to the ring
> > > > > > + * Also RELEASE guarantees that latest tail value
> > > > > > + * will become visible before the new head value.
> > > > > > */
> > > > > > } while (rte_atomic_compare_exchange_strong_explicit(&d-
> > > > >head.raw,
> > > > > > (uint64_t *)(uintptr_t)&oh.raw, nh.raw,
> > > > > > - rte_memory_order_acquire,
> > > > > > + rte_memory_order_acq_rel,
> > > > > > rte_memory_order_acquire) == 0);
> > > > > >
> > > > > > *old_head = oh.val.pos;
> > > > > > diff --git a/lib/ring/soring.c b/lib/ring/soring.c index
> > > > 7bcbf35516..21a1a27e24
> > > > > > 100644
> > > > > > --- a/lib/ring/soring.c
> > > > > > +++ b/lib/ring/soring.c
> > > > > > @@ -123,8 +123,6 @@ __rte_soring_stage_finalize(struct
> > > > > > soring_stage_headtail *sht, uint32_t stage,
> > > > > > rte_atomic_store_explicit(&sht->tail.raw, ot.raw,
> > > > > > rte_memory_order_release);
> > > > > >
> > > > > > - /* make sure that new tail value is visible */
> > > > > > - rte_atomic_thread_fence(rte_memory_order_release);
> > > > > > return i;
> > > > > > }
> > > > > >
> > > > > > @@ -219,9 +217,6 @@ __rte_soring_update_tail(struct
> > > > __rte_ring_headtail
> > > > > > *rht,
> > > > > > /* unsupported mode, shouldn't be here */
> > > > > > RTE_ASSERT(0);
> > > > > > }
> > > > > > -
> > > > > > - /* make sure that new tail value is visible */
> > > > > > - rte_atomic_thread_fence(rte_memory_order_release);
> > > > > > }
> > > > > >
> > > > > > static __rte_always_inline uint32_t
> > > > > > --
> > > > > > 2.43.0
Hi Konstantin,
I have some updates to the previous claims I have made.
> > > > That cannot happen in Arm v8/v9 because tail update is a
> > > > store-release and a load-acquire that program order follows it can
> > > > only be issued after all the cores have observed the store-release
> > > > (there is a synchronization with relationship to store-release and
> > > > load-acquire pairs).
> > > >
> > > > In the example you have provided Thread #1's store(&prod.tail, 2,
> > > > release) is observed by all the cores in the system by the time same
> > > > thread performs load(&prod.tail, acquire) at T2. As per Arm v8/v9
> > > > memory model Thread #2 should observe
> > > > prod.tail=2 (not prod.tail=1).
> > > >
> > > > Arm Architecture Reference Manual section B2.10.11 states…
> > > >
> > > > "Where a Load-Acquire appears in program order after a
> > > > Store-Release, the memory access generated by the Store-Release
> > > > instruction is observed by each PE to the extent that PE is required
> > > > to observe the access coherently, before the memory access generated
> > > > by the Load-Acquire instruction is observed by that PE, to the
> > > > extent that the PE is required to observe the access coherently."
> >
> > Interesting, thanks for pointing to that.
> > Indeed, according to the docs, acquire/release operations use a sequentially
> > consistent model.
> > But we not always getting pure LDAR instruction here, it depends on compiler
> > march/mcpu flags.
> > For some HW arch gcc (and clang) generate LDAR instruction, for others -
> > LDPAPR.
> > And according to the doc:
> > https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-
> > Store-Release-instructions
> > for Load-AcquirePC the requirement that Load-Acquires are observed in
> order
> > with preceding Store-Releases is dropped.
> > To be more specific here, on my box:
> > 1) meson with -Dmachine=default (gcc options: "-march=armv8-a+crc")
> > generates code with LDAR instruction https://godbolt.org/z/5cEscdxrb
> > 2) meson with -Dmachine=native (gcc options: " -mcpu=neoverse-n1")
> generates
> > code with LDAPR instruction https://godbolt.org/z/8KjcvYxch
> >
> > soring_stress_test for 1) (LDAR) passes
> > soring_stress_test for 2) (LDAPR) fails with symptoms I described earlier
> >
>
> This makes sense, Load-AcquirePC paired with Store-Release gives you
> Release-Consistency-program-consistency (RCpc) whereas Load-Acquire
> paired with a Store-Release gives you
> Release-Consistency-sequential-consistency (RCsc) a stronger consistency
> model compared to RCpc.
> As you have pointed out this is what causes the issue you have observed.
> LDAPR on the same address on the same core that program order follows
> STLR doesn't cause other cores to observe the value written by STLR.
It turned that LDAPR still maintains the synchronizes with relationship with STLR
on the same address. It's just that it interleaves unrelated synchronization with
(critical sections) that operate on independent memory addresses.
On the other hand, STLR and LDAR pair would not allow interleaving of
Independent critical sections.
> However, contrary to this behavior C11 expects __ATOMIC_RELEASE and
> __ATOMIC_ACQUIRE pairs to be sequentially consistent which the ring relies
> on. Thus, __ATOMIC_ACQUIRE cannot emit an LDAPR if paired with an
> __ATOMIC_RELEASE.
>
Due to above reason __ATOMIC_RELEASE and __ATOMIC_ACQUIRE pairs
provides synchronization-with relationship with LDAPR as required by
C11.
__ATOMIC_SEQ_CST uses LDAR as it requires observing of all the releases
happened before acquire (including releases on other addresses).
> To correct this behavior, I suggest changing both the tail update to
> store(&prod.tail, 2, release) to store(&prod.tail, 2, ceq_cst) and
> load(&prod.tail, acquire) to d(&prod.tail, ceq_cst);
> This would result in an STLR and LDAR (instead of an LDAPR), so
> essentially, we are going back to the desired old behavior.
>
__ATOMIC_RELEASE and __ATOMIC_ACQUIRE is sufficient for DPDK
ring as release and acquire happens on the same address. So, no
change is required in rte_ring (STLR and LDAPR)
But soring seems to rely on sequentially-consistent ordering rather than
pure release-acquire ordering as with rte_ring. So, I would rather change
soring from __ATOMIC_RELEASE/__ATOMIC_ACQUIRE to __ATOMIC_SEQ_CST.
> I have raised this issue with the compiler folks to get some clarity on
> apparent conflict with the C11 spec.
>
See above.
Thanks.
--wathsala
> Additionally, when you specify " -mcpu=neoverse-n1" GCC knows that
> Neoverse-n1 CPUs have FEAT_LRCPC which means it implements LDAPR
> Instruction. But on the same platform when you only specify
> "-march=armv8-a+crc" GCC doesn't know that LDAPR is supported unless
> you also append +rcpc to that string, so you end up with LDAR.
> (Why we made the recent build changes)
>
> Thanks
>
> --wathsala
>
> > >
> > > In soring is this the pair that update the tail and move head?
> > >
> > > __rte_soring_update_tail:
> > > __rte_ring_update_tail(&rht->ht, head, next, st, enq);
> >
> > That function is used to update prod.tail and cons.tail
> >
> > > __rte_soring_move_cons_head:
> > > __rte_ring_headtail_move_head(&r->cons.ht, &r->stage[stage].ht, 0,
> > > ...);
> > >
> >
> > __rte_ring_headtail_move_head is invoked to update prod.head (by
> > __rte_soring_move_prod_head) and cons.head (by
> > __rte_soring_move_cons_head).
> >
> > > If so, &rht->ht and &r->stage[stage].ht are the same address?
> >
> > No.
> > In conventional ring we have just two stages: 'prod' and 'cons'.
> > So moving prod.head depends on cons.tail and visa-versa: moving cons.head
> > depends on prod.tail.
> > With soring we have multiple (N) extra stages in between.
> > So the dependency chain is a bit longer:
> > prod.head depends on cons.tail
> > stage[0].head depends on prod.tail,
> > stage[1].head depends on stage[0].tail
> > ...
> > stage{N-1].head depends on stage[N-2].tail cons.head depends on stage[N-
> 1].tail
> >
> > > If they are
> > > not, then you will run into the issue you have seen (a.k.a
> > > "Other-multi-copy atomic" which is legit in Arm v8 and above).
> >
> > Can you probably elaborate a bit more for me here?
> > Thanks
> > Konstantin
> >
> >
> > > Thanks.
> > >
> > > --wathsala
> > >
> > >
> > > > > -----Original Message-----
> > > > > From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > > > > Sent: Monday, May 26, 2025 6:54 AM
> > > > > To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>;
> > > > > dev@dpdk.org
> > > > > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> > > > > jerinj@marvell.com; hemant.agrawal@nxp.com; drc@linux.ibm.com; nd
> > > > > <nd@arm.com>
> > > > > Subject: RE: [PATCH v1 3/4] ring: fix potential sync issue between
> > > > > head and
> > > > tail
> > > > > values
> > > > >
> > > > > Hi Wathsala,
> > > > >
> > > > > >
> > > > > > Hi Konstanin,
> > > > > >
> > > > > > In rte_ring the store-release on tail update guarantees that CAS
> > > > > > won't get reordered with the store-released of the tail update.
> > > > > >
> > > > > > So, the sequence of events would look like this (combined view
> > > > > > of head and tail update)
> > > > > >
> > > > > > Releaxed-load(new_head, N) ----------------> (A)
> > > > > > Relaxed-CAS(d->head, new_head, old_head) ----------------> (B)
> > > > > > Store-release-store(d->tail, new_head) ----------------> (C)
> > > > > >
> > > > > > If we look at address dependencies, then...
> > > > > >
> > > > > > (B) depends on (A) due to new_head address dependency.
> > > > > > (C) depends on (A) due to new_head address dependency.
> > > > > >
> > > > > > So, dependency graph looks like this
> > > > > > (A)
> > > > > > / \
> > > > > > v v
> > > > > > (B) (C)
> > > > > >
> > > > > > There is no implicit dependence between (B) and (C), I think
> > > > > > this is the issue you are brining up.
> > > > > > Even though there is no dependence between the two, the
> > > > > > store-release of (C) ensures that (B) won't drop below it.
> > > > > > Therefore, the above graph can be turned into an ordered
> > > > > > sequence as shown below..
> > > > > >
> > > > > > (A) -> (B) -> (C)
> > > > >
> > > > > I do agree that with current implementation of
> > > > > __rte_ring_headtail_move_head()
> > > > > in lib/ring/rte_ring_c11_pvt.h the order of these 3 operations
> > > > > (A->B->C) should be sequential.
> > > > > The problem I am talking about is a different one:
> > > > > thread can see 'latest' 'cons.head' value, with 'previous' value
> > > > > for 'prod.tail' or visa-versa.
> > > > > In other words: 'cons.head' value depends on 'prod.tail', so
> > > > > before making latest 'cons.head'
> > > > > value visible to other threads, we need to ensure that latest
> > > > > 'prod.tail' is also visible.
> > > > > Let me try to explain it on the example:
> > > > >
> > > > > Suppose at some moment we have:
> > > > > prod={.head=2,.tail=1};
> > > > > cons={.head=0,.tail=0};
> > > > > I.e. thead #1 is in process to enqueue one more element into the ring.
> > > > >
> > > > > Thread #1 Thread #2
> > > > > T0:
> > > > > store(&prod.tail, 2, release);
> > > > > /*AFAIU: this is somewhat equivalent to: wmb(); prod.tail=2;
> > > > > * I.E. - it guarantees that all stores initiated before that
> > > > > operation will be visible
> > > > > * by other threads at the same moment or before new value of
> > > > > prod.tail wlll become
> > > > > * visible, but it doesn't guarantee that new prod.tail value
> > > > > will be visible to other
> > > > > * threads immediately.
> > > > > */
> > > > > ...
> > > > > move_cons_head(...,n=2)
> > > > move_cons_head(...,n=1)
> > > > > ... ...
> > > > > T1:
> > > > > *old_head = load(&cons.head, relaxed);
> > > > > fence(acquire);
> > > > > /*old_head == 0, no surprises */
> > > > > stail = load(&prod.tail, acquire);
> > > > > /*stail == 2, no surprises */
> > > > > *entries = (capacity + stail - *old_head); *new_head = *old_head
> > > > > + n;
> > > > > /* *entries == (2 - 0) == 2; *new_head==2; all good */ ...
> > > > > T2:
> > > > >
> > > > > *old_head = load(&cons.head, relaxed);
> > > > > fence(acquire);
> > > > >
> > > > > /*old_head == 0, no
> > > > surprises
> > > > > */
> > > > >
> > > > > stail = load(&prod.tail, acquire);
> > > > > /* !!!!! stail == 1 !!!!! for Thread 2
> > > > > * Even though we do use acquire here - there was no *release*
> > > > > after store(&prod.tail).
> > > > > * So, still no guarantee that Thread 2 will see latest prod.tail value.
> > > > > */
> > > > >
> > > > > *entries = (capacity + stail - *old_head);
> > > > >
> > > > > /* *entries == (1 - 0) == 1,
> > > > still
> > > > > ok */
> > > > > *new_head =
> > *old_head + n;
> > > > >
> > > > > /* *new_head == 1 */
> > > > > T3:
> > > > > success = CAS(&cons.head,
> > > > > old_head /*==0/, *new_head /*==2*/,
> > > > > relaxed, relaxed);
> > > > > /*success==1, cons.head==2*/
> > > > > ...
> > > > > T4:
> > > > > success =
> > CAS(&cons.head,
> > > > >
> > > > > old_head /*==0/, *new_head /*==1*/,
> > > > > relaxed, relaxed);
> > > > >
> > > > > /*success==0, *old_head==2*/
> > > > > /* CAS() failed and provided Thread 2 with latest valued for
> > > > > cons.head(==2)
> > > > > * Thread 2 repeats attempt, starts second iteration */
> > > > > fence(acquire);
> > > > >
> > > > > stail = load(&prod.tail, acquire);
> > > > > /* !!!!! stail == 1 !!!!! for Thread 2
> > > > > * Still no *release* had happened after store(&prod.tail) at T0.
> > > > > * So, still no guarantee that Thread 2 will see latest prod.tail value.
> > > > > */
> > > > >
> > > > > *entries = (capacity + stail - *old_head);
> > > > >
> > > > > *new_head = *old_head + n;
> > > > >
> > > > > /* !!!!! *entries == (1 - 2) == -1(UINT32_MAX); *new_head==(2+1)==3;
> !!!!!
> > > > > * we are about to corrupt our ring !!!!!
> > > > > */
> > > > >
> > > > > >
> > > > > > I haven't looked at the so-ring yet. Could it be possible that
> > > > > > the issue is due to something else introduced in that code?
> > > > >
> > > > > Well, as I said, so far I wasn't able to re-produce this problem
> > > > > with conventional ring (ring_stress_autotest), only
> > > > > soring_stress_autotest is failing and for now - only on one
> > > > > specific ARM platform.
> > > > > Regarding soring specific fix (without touching common code) -
> > > > > sure it is also possible, pls see patch #2.
> > > > > There I just added 'fence(release);' straight after 'store(&tail);'
> > > > > That's seems enough to fix that problem within the soring only.
> > > > > Though, from my understanding rte_ring might also be affected,
> > > > > that's why I went ahead and prepared that patch.
> > > > > If you feel, that I a missing something here - pls shout.
> > > > > Konstantin
> > > > >
> > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > --wathsala
> > > > > >
> > > > > > > This patch aims several purposes:
> > > > > > > - provide an alternative (and I think a better) way to fix the
> > > > > > > issue discussed in previous patch:
> > > > > > > "ring/soring: fix synchronization issue between head and tail
> values"
> > > > > > > - make sure that such problem wouldn’t happen within other usages
> of
> > > > > > > __rte_ring_headtail_move_head() – both current rte_ring
> > > > > > > implementation and possible future use-cases.
> > > > > > > - step towards unification of move_head() implementations and
> > > > > > > removing rte_ring_generic_pvt.h It uses Acquire-Release
> > > > > > > memory ordering for CAS operation in
> > > > > move_head().
> > > > > > > That guarantees that corresponding ‘tail’ updates will be
> > > > > > > visible before
> > > > > current
> > > > > > > ‘head’ is updated.
> > > > > > > As I said before: I think that in theory the problem described
> > > > > > > in previous
> > > > > patch
> > > > > > > might happen with our conventional rte_ring too (when
> > > > > > > RTE_USE_C11_MEM_MODEL enabled).
> > > > > > > But, so far I didn’t manage to reproduce it in reality.
> > > > > > > For that reason and also because it touches a critical
> > > > > > > rte_ring code-path, I
> > > > > put
> > > > > > > these changes into a separate patch. Expect all interested
> > > > > > > stakeholders to
> > > > > come-
> > > > > > > up with their comments and observations.
> > > > > > > Regarding performance impact – on my boxes both
> > > > > > > ring_perf_autotest
> > > > and
> > > > > > > ring_stress_autotest – show a mixed set of results: some of
> > > > > > > them
> > > > become
> > > > > few
> > > > > > > cycles faster, another few cycles slower.
> > > > > > > But so far, I didn’t notice any real degradations with that patch.
> > > > > > >
> > > > > > > Fixes: b5458e2cc483 ("ring: introduce staged ordered ring")
> > > > > > > Fixes: 1cc363b8ce06 ("ring: introduce HTS ring mode")
> > > > > > > Fixes: e6ba4731c0f3 ("ring: introduce RTS ring mode")
> > > > > > > Fixes: 49594a63147a ("ring/c11: relax ordering for load and
> > > > > > > store of the
> > > > > head")
> > > > > > >
> > > > > > > Signed-off-by: Konstantin Ananyev
> > > > > > > <konstantin.ananyev@huawei.com>
> > > > > > > ---
> > > > > > > lib/ring/rte_ring_c11_pvt.h | 27 +++++++++++++++++----------
> > > > > > > lib/ring/rte_ring_hts_elem_pvt.h | 6 ++++--
> > > > > lib/ring/rte_ring_rts_elem_pvt.h
> > > > > > > | 6 ++++--
> > > > > > > lib/ring/soring.c | 5 -----
> > > > > > > 4 files changed, 25 insertions(+), 19 deletions(-)
> > > > > > >
> > > > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > b/lib/ring/rte_ring_c11_pvt.h
> > > > index
> > > > > > > 0845cd6dcf..6d1c46df9a 100644
> > > > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > > > @@ -77,20 +77,19 @@ __rte_ring_headtail_move_head(struct
> > > > > > > rte_ring_headtail *d,
> > > > > > > int success;
> > > > > > > unsigned int max = n;
> > > > > > >
> > > > > > > + /* Ensure the head is read before tail */
> > > > > > > *old_head = rte_atomic_load_explicit(&d->head,
> > > > > > > - rte_memory_order_relaxed);
> > > > > > > + rte_memory_order_acquire);
> > > > > > > do {
> > > > > > > /* Reset n to the initial burst count */
> > > > > > > n = max;
> > > > > > >
> > > > > > > - /* Ensure the head is read before tail */
> > > > > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > > > > -
> > > > > > > - /* load-acquire synchronize with store-release of ht-
> > >tail
> > > > > > > - * in update_tail.
> > > > > > > + /*
> > > > > > > + * Read s->tail value. Note that it will be loaded after
> > > > > > > + * d->head load, but before CAS operation for the d-
> > >head.
> > > > > > > */
> > > > > > > stail = rte_atomic_load_explicit(&s->tail,
> > > > > > > - rte_memory_order_acquire);
> > > > > > > + rte_memory_order_relaxed);
> > > > > > >
> > > > > > > /* The subtraction is done between two unsigned
> 32bits
> > value
> > > > > > > * (the result is always modulo 32 bits even if we have
> > @@
> > > > > > > -
> > > > > > > 112,11 +111,19 @@ __rte_ring_headtail_move_head(struct
> > > > > rte_ring_headtail
> > > > > > > *d,
> > > > > > > d->head = *new_head;
> > > > > > > success = 1;
> > > > > > > } else
> > > > > > > - /* on failure, *old_head is updated */
> > > > > > > + /*
> > > > > > > + * on failure, *old_head is updated.
> > > > > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a
> > hoist
> > > > > > > + * barrier to prevent:
> > > > > > > + * - OOO reads of cons tail value
> > > > > > > + * - OOO copy of elems from the ring
> > > > > > > + * Also RELEASE guarantees that latest tail
> > value
> > > > > > > + * will become visible before the new head
> > value.
> > > > > > > + */
> > > > > > > success =
> > > > > > > rte_atomic_compare_exchange_strong_explicit(
> > > > > > > &d->head, old_head,
> > *new_head,
> > > > > > > - rte_memory_order_relaxed,
> > > > > > > - rte_memory_order_relaxed);
> > > > > > > + rte_memory_order_acq_rel,
> > > > > > > + rte_memory_order_acquire);
> > > > > > > } while (unlikely(success == 0));
> > > > > > > return n;
> > > > > > > }
> > > > > > > diff --git a/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > b/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > > > index c59e5f6420..cc593433b9 100644
> > > > > > > --- a/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > > > +++ b/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > > > @@ -116,13 +116,15 @@ __rte_ring_hts_move_head(struct
> > > > > > > rte_ring_hts_headtail *d,
> > > > > > > np.pos.head = op.pos.head + n;
> > > > > > >
> > > > > > > /*
> > > > > > > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to
> > prevent:
> > > > > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to
> > prevent:
> > > > > > > * - OOO reads of cons tail value
> > > > > > > * - OOO copy of elems from the ring
> > > > > > > + * Also RELEASE guarantees that latest tail value
> > > > > > > + * will become visible before the new head value.
> > > > > > > */
> > > > > > > } while (rte_atomic_compare_exchange_strong_explicit(&d-
> > >ht.raw,
> > > > > > > (uint64_t *)(uintptr_t)&op.raw, np.raw,
> > > > > > > - rte_memory_order_acquire,
> > > > > > > + rte_memory_order_acq_rel,
> > > > > > > rte_memory_order_acquire) == 0);
> > > > > > >
> > > > > > > *old_head = op.pos.head;
> > > > > > > diff --git a/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > b/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > > > index 509fa674fb..860b13cc61 100644
> > > > > > > --- a/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > > > +++ b/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > > > @@ -131,13 +131,15 @@ __rte_ring_rts_move_head(struct
> > > > > > > rte_ring_rts_headtail *d,
> > > > > > > nh.val.cnt = oh.val.cnt + 1;
> > > > > > >
> > > > > > > /*
> > > > > > > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to
> > prevent:
> > > > > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to
> > prevent:
> > > > > > > * - OOO reads of cons tail value
> > > > > > > * - OOO copy of elems to the ring
> > > > > > > + * Also RELEASE guarantees that latest tail value
> > > > > > > + * will become visible before the new head value.
> > > > > > > */
> > > > > > > } while (rte_atomic_compare_exchange_strong_explicit(&d-
> > > > > >head.raw,
> > > > > > > (uint64_t *)(uintptr_t)&oh.raw, nh.raw,
> > > > > > > - rte_memory_order_acquire,
> > > > > > > + rte_memory_order_acq_rel,
> > > > > > > rte_memory_order_acquire) == 0);
> > > > > > >
> > > > > > > *old_head = oh.val.pos;
> > > > > > > diff --git a/lib/ring/soring.c b/lib/ring/soring.c index
> > > > > 7bcbf35516..21a1a27e24
> > > > > > > 100644
> > > > > > > --- a/lib/ring/soring.c
> > > > > > > +++ b/lib/ring/soring.c
> > > > > > > @@ -123,8 +123,6 @@ __rte_soring_stage_finalize(struct
> > > > > > > soring_stage_headtail *sht, uint32_t stage,
> > > > > > > rte_atomic_store_explicit(&sht->tail.raw, ot.raw,
> > > > > > > rte_memory_order_release);
> > > > > > >
> > > > > > > - /* make sure that new tail value is visible */
> > > > > > > - rte_atomic_thread_fence(rte_memory_order_release);
> > > > > > > return i;
> > > > > > > }
> > > > > > >
> > > > > > > @@ -219,9 +217,6 @@ __rte_soring_update_tail(struct
> > > > > __rte_ring_headtail
> > > > > > > *rht,
> > > > > > > /* unsupported mode, shouldn't be here */
> > > > > > > RTE_ASSERT(0);
> > > > > > > }
> > > > > > > -
> > > > > > > - /* make sure that new tail value is visible */
> > > > > > > - rte_atomic_thread_fence(rte_memory_order_release);
> > > > > > > }
> > > > > > >
> > > > > > > static __rte_always_inline uint32_t
> > > > > > > --
> > > > > > > 2.43.0
Hi Wathsala,
>
> I have some updates to the previous claims I have made.
Thanks for the update, unfortunately I don't have time to look at it properly right now.
Will try to return back to it it in 3-4 weeks.
Just one question for now, see below.
>
> > > > > That cannot happen in Arm v8/v9 because tail update is a
> > > > > store-release and a load-acquire that program order follows it can
> > > > > only be issued after all the cores have observed the store-release
> > > > > (there is a synchronization with relationship to store-release and
> > > > > load-acquire pairs).
> > > > >
> > > > > In the example you have provided Thread #1's store(&prod.tail, 2,
> > > > > release) is observed by all the cores in the system by the time same
> > > > > thread performs load(&prod.tail, acquire) at T2. As per Arm v8/v9
> > > > > memory model Thread #2 should observe
> > > > > prod.tail=2 (not prod.tail=1).
> > > > >
> > > > > Arm Architecture Reference Manual section B2.10.11 states…
> > > > >
> > > > > "Where a Load-Acquire appears in program order after a
> > > > > Store-Release, the memory access generated by the Store-Release
> > > > > instruction is observed by each PE to the extent that PE is required
> > > > > to observe the access coherently, before the memory access generated
> > > > > by the Load-Acquire instruction is observed by that PE, to the
> > > > > extent that the PE is required to observe the access coherently."
> > >
> > > Interesting, thanks for pointing to that.
> > > Indeed, according to the docs, acquire/release operations use a sequentially
> > > consistent model.
> > > But we not always getting pure LDAR instruction here, it depends on compiler
> > > march/mcpu flags.
> > > For some HW arch gcc (and clang) generate LDAR instruction, for others -
> > > LDPAPR.
> > > And according to the doc:
> > > https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-
> > > Store-Release-instructions
> > > for Load-AcquirePC the requirement that Load-Acquires are observed in
> > order
> > > with preceding Store-Releases is dropped.
> > > To be more specific here, on my box:
> > > 1) meson with -Dmachine=default (gcc options: "-march=armv8-a+crc")
> > > generates code with LDAR instruction https://godbolt.org/z/5cEscdxrb
> > > 2) meson with -Dmachine=native (gcc options: " -mcpu=neoverse-n1")
> > generates
> > > code with LDAPR instruction https://godbolt.org/z/8KjcvYxch
> > >
> > > soring_stress_test for 1) (LDAR) passes
> > > soring_stress_test for 2) (LDAPR) fails with symptoms I described earlier
> > >
> >
> > This makes sense, Load-AcquirePC paired with Store-Release gives you
> > Release-Consistency-program-consistency (RCpc) whereas Load-Acquire
> > paired with a Store-Release gives you
> > Release-Consistency-sequential-consistency (RCsc) a stronger consistency
> > model compared to RCpc.
> > As you have pointed out this is what causes the issue you have observed.
> > LDAPR on the same address on the same core that program order follows
> > STLR doesn't cause other cores to observe the value written by STLR.
>
> It turned that LDAPR still maintains the synchronizes with relationship with STLR
> on the same address. It's just that it interleaves unrelated synchronization with
> (critical sections) that operate on independent memory addresses.
Is it documented somewhere inside ARM, docs?
Another question, what is considered as 'independent memory addresses'?
Not equal ones/belong to different cache-lines/something else?
> On the other hand, STLR and LDAR pair would not allow interleaving of
> Independent critical sections.
>
> > However, contrary to this behavior C11 expects __ATOMIC_RELEASE and
> > __ATOMIC_ACQUIRE pairs to be sequentially consistent which the ring relies
> > on. Thus, __ATOMIC_ACQUIRE cannot emit an LDAPR if paired with an
> > __ATOMIC_RELEASE.
> >
>
> Due to above reason __ATOMIC_RELEASE and __ATOMIC_ACQUIRE pairs
> provides synchronization-with relationship with LDAPR as required by
> C11.
> __ATOMIC_SEQ_CST uses LDAR as it requires observing of all the releases
> happened before acquire (including releases on other addresses).
>
>
> > To correct this behavior, I suggest changing both the tail update to
> > store(&prod.tail, 2, release) to store(&prod.tail, 2, ceq_cst) and
> > load(&prod.tail, acquire) to d(&prod.tail, ceq_cst);
> > This would result in an STLR and LDAR (instead of an LDAPR), so
> > essentially, we are going back to the desired old behavior.
> >
>
> __ATOMIC_RELEASE and __ATOMIC_ACQUIRE is sufficient for DPDK
> ring as release and acquire happens on the same address. So, no
> change is required in rte_ring (STLR and LDAPR)
>
> But soring seems to rely on sequentially-consistent ordering rather than
> pure release-acquire ordering as with rte_ring. So, I would rather change
> soring from __ATOMIC_RELEASE/__ATOMIC_ACQUIRE to __ATOMIC_SEQ_CST.
>
>
> > I have raised this issue with the compiler folks to get some clarity on
> > apparent conflict with the C11 spec.
> >
>
> See above.
>
> Thanks.
>
> --wathsala
>
>
> > Additionally, when you specify " -mcpu=neoverse-n1" GCC knows that
> > Neoverse-n1 CPUs have FEAT_LRCPC which means it implements LDAPR
> > Instruction. But on the same platform when you only specify
> > "-march=armv8-a+crc" GCC doesn't know that LDAPR is supported unless
> > you also append +rcpc to that string, so you end up with LDAR.
> > (Why we made the recent build changes)
> >
> > Thanks
> >
> > --wathsala
> >
> > > >
> > > > In soring is this the pair that update the tail and move head?
> > > >
> > > > __rte_soring_update_tail:
> > > > __rte_ring_update_tail(&rht->ht, head, next, st, enq);
> > >
> > > That function is used to update prod.tail and cons.tail
> > >
> > > > __rte_soring_move_cons_head:
> > > > __rte_ring_headtail_move_head(&r->cons.ht, &r->stage[stage].ht, 0,
> > > > ...);
> > > >
> > >
> > > __rte_ring_headtail_move_head is invoked to update prod.head (by
> > > __rte_soring_move_prod_head) and cons.head (by
> > > __rte_soring_move_cons_head).
> > >
> > > > If so, &rht->ht and &r->stage[stage].ht are the same address?
> > >
> > > No.
> > > In conventional ring we have just two stages: 'prod' and 'cons'.
> > > So moving prod.head depends on cons.tail and visa-versa: moving cons.head
> > > depends on prod.tail.
> > > With soring we have multiple (N) extra stages in between.
> > > So the dependency chain is a bit longer:
> > > prod.head depends on cons.tail
> > > stage[0].head depends on prod.tail,
> > > stage[1].head depends on stage[0].tail
> > > ...
> > > stage{N-1].head depends on stage[N-2].tail cons.head depends on stage[N-
> > 1].tail
> > >
> > > > If they are
> > > > not, then you will run into the issue you have seen (a.k.a
> > > > "Other-multi-copy atomic" which is legit in Arm v8 and above).
> > >
> > > Can you probably elaborate a bit more for me here?
> > > Thanks
> > > Konstantin
> > >
> > >
> > > > Thanks.
> > > >
> > > > --wathsala
> > > >
> > > >
> > > > > > -----Original Message-----
> > > > > > From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > > > > > Sent: Monday, May 26, 2025 6:54 AM
> > > > > > To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>;
> > > > > > dev@dpdk.org
> > > > > > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> > > > > > jerinj@marvell.com; hemant.agrawal@nxp.com; drc@linux.ibm.com; nd
> > > > > > <nd@arm.com>
> > > > > > Subject: RE: [PATCH v1 3/4] ring: fix potential sync issue between
> > > > > > head and
> > > > > tail
> > > > > > values
> > > > > >
> > > > > > Hi Wathsala,
> > > > > >
> > > > > > >
> > > > > > > Hi Konstanin,
> > > > > > >
> > > > > > > In rte_ring the store-release on tail update guarantees that CAS
> > > > > > > won't get reordered with the store-released of the tail update.
> > > > > > >
> > > > > > > So, the sequence of events would look like this (combined view
> > > > > > > of head and tail update)
> > > > > > >
> > > > > > > Releaxed-load(new_head, N) ----------------> (A)
> > > > > > > Relaxed-CAS(d->head, new_head, old_head) ----------------> (B)
> > > > > > > Store-release-store(d->tail, new_head) ----------------> (C)
> > > > > > >
> > > > > > > If we look at address dependencies, then...
> > > > > > >
> > > > > > > (B) depends on (A) due to new_head address dependency.
> > > > > > > (C) depends on (A) due to new_head address dependency.
> > > > > > >
> > > > > > > So, dependency graph looks like this
> > > > > > > (A)
> > > > > > > / \
> > > > > > > v v
> > > > > > > (B) (C)
> > > > > > >
> > > > > > > There is no implicit dependence between (B) and (C), I think
> > > > > > > this is the issue you are brining up.
> > > > > > > Even though there is no dependence between the two, the
> > > > > > > store-release of (C) ensures that (B) won't drop below it.
> > > > > > > Therefore, the above graph can be turned into an ordered
> > > > > > > sequence as shown below..
> > > > > > >
> > > > > > > (A) -> (B) -> (C)
> > > > > >
> > > > > > I do agree that with current implementation of
> > > > > > __rte_ring_headtail_move_head()
> > > > > > in lib/ring/rte_ring_c11_pvt.h the order of these 3 operations
> > > > > > (A->B->C) should be sequential.
> > > > > > The problem I am talking about is a different one:
> > > > > > thread can see 'latest' 'cons.head' value, with 'previous' value
> > > > > > for 'prod.tail' or visa-versa.
> > > > > > In other words: 'cons.head' value depends on 'prod.tail', so
> > > > > > before making latest 'cons.head'
> > > > > > value visible to other threads, we need to ensure that latest
> > > > > > 'prod.tail' is also visible.
> > > > > > Let me try to explain it on the example:
> > > > > >
> > > > > > Suppose at some moment we have:
> > > > > > prod={.head=2,.tail=1};
> > > > > > cons={.head=0,.tail=0};
> > > > > > I.e. thead #1 is in process to enqueue one more element into the ring.
> > > > > >
> > > > > > Thread #1 Thread #2
> > > > > > T0:
> > > > > > store(&prod.tail, 2, release);
> > > > > > /*AFAIU: this is somewhat equivalent to: wmb(); prod.tail=2;
> > > > > > * I.E. - it guarantees that all stores initiated before that
> > > > > > operation will be visible
> > > > > > * by other threads at the same moment or before new value of
> > > > > > prod.tail wlll become
> > > > > > * visible, but it doesn't guarantee that new prod.tail value
> > > > > > will be visible to other
> > > > > > * threads immediately.
> > > > > > */
> > > > > > ...
> > > > > > move_cons_head(...,n=2)
> > > > > move_cons_head(...,n=1)
> > > > > > ... ...
> > > > > > T1:
> > > > > > *old_head = load(&cons.head, relaxed);
> > > > > > fence(acquire);
> > > > > > /*old_head == 0, no surprises */
> > > > > > stail = load(&prod.tail, acquire);
> > > > > > /*stail == 2, no surprises */
> > > > > > *entries = (capacity + stail - *old_head); *new_head = *old_head
> > > > > > + n;
> > > > > > /* *entries == (2 - 0) == 2; *new_head==2; all good */ ...
> > > > > > T2:
> > > > > >
> > > > > > *old_head = load(&cons.head, relaxed);
> > > > > > fence(acquire);
> > > > > >
> > > > > > /*old_head == 0, no
> > > > > surprises
> > > > > > */
> > > > > >
> > > > > > stail = load(&prod.tail, acquire);
> > > > > > /* !!!!! stail == 1 !!!!! for Thread 2
> > > > > > * Even though we do use acquire here - there was no *release*
> > > > > > after store(&prod.tail).
> > > > > > * So, still no guarantee that Thread 2 will see latest prod.tail value.
> > > > > > */
> > > > > >
> > > > > > *entries = (capacity + stail - *old_head);
> > > > > >
> > > > > > /* *entries == (1 - 0) == 1,
> > > > > still
> > > > > > ok */
> > > > > > *new_head =
> > > *old_head + n;
> > > > > >
> > > > > > /* *new_head == 1 */
> > > > > > T3:
> > > > > > success = CAS(&cons.head,
> > > > > > old_head /*==0/, *new_head /*==2*/,
> > > > > > relaxed, relaxed);
> > > > > > /*success==1, cons.head==2*/
> > > > > > ...
> > > > > > T4:
> > > > > > success =
> > > CAS(&cons.head,
> > > > > >
> > > > > > old_head /*==0/, *new_head /*==1*/,
> > > > > > relaxed, relaxed);
> > > > > >
> > > > > > /*success==0, *old_head==2*/
> > > > > > /* CAS() failed and provided Thread 2 with latest valued for
> > > > > > cons.head(==2)
> > > > > > * Thread 2 repeats attempt, starts second iteration */
> > > > > > fence(acquire);
> > > > > >
> > > > > > stail = load(&prod.tail, acquire);
> > > > > > /* !!!!! stail == 1 !!!!! for Thread 2
> > > > > > * Still no *release* had happened after store(&prod.tail) at T0.
> > > > > > * So, still no guarantee that Thread 2 will see latest prod.tail value.
> > > > > > */
> > > > > >
> > > > > > *entries = (capacity + stail - *old_head);
> > > > > >
> > > > > > *new_head = *old_head + n;
> > > > > >
> > > > > > /* !!!!! *entries == (1 - 2) == -1(UINT32_MAX); *new_head==(2+1)==3;
> > !!!!!
> > > > > > * we are about to corrupt our ring !!!!!
> > > > > > */
> > > > > >
> > > > > > >
> > > > > > > I haven't looked at the so-ring yet. Could it be possible that
> > > > > > > the issue is due to something else introduced in that code?
> > > > > >
> > > > > > Well, as I said, so far I wasn't able to re-produce this problem
> > > > > > with conventional ring (ring_stress_autotest), only
> > > > > > soring_stress_autotest is failing and for now - only on one
> > > > > > specific ARM platform.
> > > > > > Regarding soring specific fix (without touching common code) -
> > > > > > sure it is also possible, pls see patch #2.
> > > > > > There I just added 'fence(release);' straight after 'store(&tail);'
> > > > > > That's seems enough to fix that problem within the soring only.
> > > > > > Though, from my understanding rte_ring might also be affected,
> > > > > > that's why I went ahead and prepared that patch.
> > > > > > If you feel, that I a missing something here - pls shout.
> > > > > > Konstantin
> > > > > >
> > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > --wathsala
> > > > > > >
> > > > > > > > This patch aims several purposes:
> > > > > > > > - provide an alternative (and I think a better) way to fix the
> > > > > > > > issue discussed in previous patch:
> > > > > > > > "ring/soring: fix synchronization issue between head and tail
> > values"
> > > > > > > > - make sure that such problem wouldn’t happen within other usages
> > of
> > > > > > > > __rte_ring_headtail_move_head() – both current rte_ring
> > > > > > > > implementation and possible future use-cases.
> > > > > > > > - step towards unification of move_head() implementations and
> > > > > > > > removing rte_ring_generic_pvt.h It uses Acquire-Release
> > > > > > > > memory ordering for CAS operation in
> > > > > > move_head().
> > > > > > > > That guarantees that corresponding ‘tail’ updates will be
> > > > > > > > visible before
> > > > > > current
> > > > > > > > ‘head’ is updated.
> > > > > > > > As I said before: I think that in theory the problem described
> > > > > > > > in previous
> > > > > > patch
> > > > > > > > might happen with our conventional rte_ring too (when
> > > > > > > > RTE_USE_C11_MEM_MODEL enabled).
> > > > > > > > But, so far I didn’t manage to reproduce it in reality.
> > > > > > > > For that reason and also because it touches a critical
> > > > > > > > rte_ring code-path, I
> > > > > > put
> > > > > > > > these changes into a separate patch. Expect all interested
> > > > > > > > stakeholders to
> > > > > > come-
> > > > > > > > up with their comments and observations.
> > > > > > > > Regarding performance impact – on my boxes both
> > > > > > > > ring_perf_autotest
> > > > > and
> > > > > > > > ring_stress_autotest – show a mixed set of results: some of
> > > > > > > > them
> > > > > become
> > > > > > few
> > > > > > > > cycles faster, another few cycles slower.
> > > > > > > > But so far, I didn’t notice any real degradations with that patch.
> > > > > > > >
> > > > > > > > Fixes: b5458e2cc483 ("ring: introduce staged ordered ring")
> > > > > > > > Fixes: 1cc363b8ce06 ("ring: introduce HTS ring mode")
> > > > > > > > Fixes: e6ba4731c0f3 ("ring: introduce RTS ring mode")
> > > > > > > > Fixes: 49594a63147a ("ring/c11: relax ordering for load and
> > > > > > > > store of the
> > > > > > head")
> > > > > > > >
> > > > > > > > Signed-off-by: Konstantin Ananyev
> > > > > > > > <konstantin.ananyev@huawei.com>
> > > > > > > > ---
> > > > > > > > lib/ring/rte_ring_c11_pvt.h | 27 +++++++++++++++++----------
> > > > > > > > lib/ring/rte_ring_hts_elem_pvt.h | 6 ++++--
> > > > > > lib/ring/rte_ring_rts_elem_pvt.h
> > > > > > > > | 6 ++++--
> > > > > > > > lib/ring/soring.c | 5 -----
> > > > > > > > 4 files changed, 25 insertions(+), 19 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > > b/lib/ring/rte_ring_c11_pvt.h
> > > > > index
> > > > > > > > 0845cd6dcf..6d1c46df9a 100644
> > > > > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > > > > @@ -77,20 +77,19 @@ __rte_ring_headtail_move_head(struct
> > > > > > > > rte_ring_headtail *d,
> > > > > > > > int success;
> > > > > > > > unsigned int max = n;
> > > > > > > >
> > > > > > > > + /* Ensure the head is read before tail */
> > > > > > > > *old_head = rte_atomic_load_explicit(&d->head,
> > > > > > > > - rte_memory_order_relaxed);
> > > > > > > > + rte_memory_order_acquire);
> > > > > > > > do {
> > > > > > > > /* Reset n to the initial burst count */
> > > > > > > > n = max;
> > > > > > > >
> > > > > > > > - /* Ensure the head is read before tail */
> > > > > > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > > > > > -
> > > > > > > > - /* load-acquire synchronize with store-release of ht-
> > > >tail
> > > > > > > > - * in update_tail.
> > > > > > > > + /*
> > > > > > > > + * Read s->tail value. Note that it will be loaded after
> > > > > > > > + * d->head load, but before CAS operation for the d-
> > > >head.
> > > > > > > > */
> > > > > > > > stail = rte_atomic_load_explicit(&s->tail,
> > > > > > > > - rte_memory_order_acquire);
> > > > > > > > + rte_memory_order_relaxed);
> > > > > > > >
> > > > > > > > /* The subtraction is done between two unsigned
> > 32bits
> > > value
> > > > > > > > * (the result is always modulo 32 bits even if we have
> > > @@
> > > > > > > > -
> > > > > > > > 112,11 +111,19 @@ __rte_ring_headtail_move_head(struct
> > > > > > rte_ring_headtail
> > > > > > > > *d,
> > > > > > > > d->head = *new_head;
> > > > > > > > success = 1;
> > > > > > > > } else
> > > > > > > > - /* on failure, *old_head is updated */
> > > > > > > > + /*
> > > > > > > > + * on failure, *old_head is updated.
> > > > > > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a
> > > hoist
> > > > > > > > + * barrier to prevent:
> > > > > > > > + * - OOO reads of cons tail value
> > > > > > > > + * - OOO copy of elems from the ring
> > > > > > > > + * Also RELEASE guarantees that latest tail
> > > value
> > > > > > > > + * will become visible before the new head
> > > value.
> > > > > > > > + */
> > > > > > > > success =
> > > > > > > > rte_atomic_compare_exchange_strong_explicit(
> > > > > > > > &d->head, old_head,
> > > *new_head,
> > > > > > > > - rte_memory_order_relaxed,
> > > > > > > > - rte_memory_order_relaxed);
> > > > > > > > + rte_memory_order_acq_rel,
> > > > > > > > + rte_memory_order_acquire);
> > > > > > > > } while (unlikely(success == 0));
> > > > > > > > return n;
> > > > > > > > }
> > > > > > > > diff --git a/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > > b/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > > > > index c59e5f6420..cc593433b9 100644
> > > > > > > > --- a/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > > > > +++ b/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > > > > @@ -116,13 +116,15 @@ __rte_ring_hts_move_head(struct
> > > > > > > > rte_ring_hts_headtail *d,
> > > > > > > > np.pos.head = op.pos.head + n;
> > > > > > > >
> > > > > > > > /*
> > > > > > > > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to
> > > prevent:
> > > > > > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to
> > > prevent:
> > > > > > > > * - OOO reads of cons tail value
> > > > > > > > * - OOO copy of elems from the ring
> > > > > > > > + * Also RELEASE guarantees that latest tail value
> > > > > > > > + * will become visible before the new head value.
> > > > > > > > */
> > > > > > > > } while (rte_atomic_compare_exchange_strong_explicit(&d-
> > > >ht.raw,
> > > > > > > > (uint64_t *)(uintptr_t)&op.raw, np.raw,
> > > > > > > > - rte_memory_order_acquire,
> > > > > > > > + rte_memory_order_acq_rel,
> > > > > > > > rte_memory_order_acquire) == 0);
> > > > > > > >
> > > > > > > > *old_head = op.pos.head;
> > > > > > > > diff --git a/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > > b/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > > > > index 509fa674fb..860b13cc61 100644
> > > > > > > > --- a/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > > > > +++ b/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > > > > @@ -131,13 +131,15 @@ __rte_ring_rts_move_head(struct
> > > > > > > > rte_ring_rts_headtail *d,
> > > > > > > > nh.val.cnt = oh.val.cnt + 1;
> > > > > > > >
> > > > > > > > /*
> > > > > > > > - * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to
> > > prevent:
> > > > > > > > + * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to
> > > prevent:
> > > > > > > > * - OOO reads of cons tail value
> > > > > > > > * - OOO copy of elems to the ring
> > > > > > > > + * Also RELEASE guarantees that latest tail value
> > > > > > > > + * will become visible before the new head value.
> > > > > > > > */
> > > > > > > > } while (rte_atomic_compare_exchange_strong_explicit(&d-
> > > > > > >head.raw,
> > > > > > > > (uint64_t *)(uintptr_t)&oh.raw, nh.raw,
> > > > > > > > - rte_memory_order_acquire,
> > > > > > > > + rte_memory_order_acq_rel,
> > > > > > > > rte_memory_order_acquire) == 0);
> > > > > > > >
> > > > > > > > *old_head = oh.val.pos;
> > > > > > > > diff --git a/lib/ring/soring.c b/lib/ring/soring.c index
> > > > > > 7bcbf35516..21a1a27e24
> > > > > > > > 100644
> > > > > > > > --- a/lib/ring/soring.c
> > > > > > > > +++ b/lib/ring/soring.c
> > > > > > > > @@ -123,8 +123,6 @@ __rte_soring_stage_finalize(struct
> > > > > > > > soring_stage_headtail *sht, uint32_t stage,
> > > > > > > > rte_atomic_store_explicit(&sht->tail.raw, ot.raw,
> > > > > > > > rte_memory_order_release);
> > > > > > > >
> > > > > > > > - /* make sure that new tail value is visible */
> > > > > > > > - rte_atomic_thread_fence(rte_memory_order_release);
> > > > > > > > return i;
> > > > > > > > }
> > > > > > > >
> > > > > > > > @@ -219,9 +217,6 @@ __rte_soring_update_tail(struct
> > > > > > __rte_ring_headtail
> > > > > > > > *rht,
> > > > > > > > /* unsupported mode, shouldn't be here */
> > > > > > > > RTE_ASSERT(0);
> > > > > > > > }
> > > > > > > > -
> > > > > > > > - /* make sure that new tail value is visible */
> > > > > > > > - rte_atomic_thread_fence(rte_memory_order_release);
> > > > > > > > }
> > > > > > > >
> > > > > > > > static __rte_always_inline uint32_t
> > > > > > > > --
> > > > > > > > 2.43.0
@@ -77,20 +77,19 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d,
int success;
unsigned int max = n;
+ /* Ensure the head is read before tail */
*old_head = rte_atomic_load_explicit(&d->head,
- rte_memory_order_relaxed);
+ rte_memory_order_acquire);
do {
/* Reset n to the initial burst count */
n = max;
- /* Ensure the head is read before tail */
- rte_atomic_thread_fence(rte_memory_order_acquire);
-
- /* load-acquire synchronize with store-release of ht->tail
- * in update_tail.
+ /*
+ * Read s->tail value. Note that it will be loaded after
+ * d->head load, but before CAS operation for the d->head.
*/
stail = rte_atomic_load_explicit(&s->tail,
- rte_memory_order_acquire);
+ rte_memory_order_relaxed);
/* The subtraction is done between two unsigned 32bits value
* (the result is always modulo 32 bits even if we have
@@ -112,11 +111,19 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d,
d->head = *new_head;
success = 1;
} else
- /* on failure, *old_head is updated */
+ /*
+ * on failure, *old_head is updated.
+ * this CAS(ACQ_REL, ACQUIRE) serves as a hoist
+ * barrier to prevent:
+ * - OOO reads of cons tail value
+ * - OOO copy of elems from the ring
+ * Also RELEASE guarantees that latest tail value
+ * will become visible before the new head value.
+ */
success = rte_atomic_compare_exchange_strong_explicit(
&d->head, old_head, *new_head,
- rte_memory_order_relaxed,
- rte_memory_order_relaxed);
+ rte_memory_order_acq_rel,
+ rte_memory_order_acquire);
} while (unlikely(success == 0));
return n;
}
@@ -116,13 +116,15 @@ __rte_ring_hts_move_head(struct rte_ring_hts_headtail *d,
np.pos.head = op.pos.head + n;
/*
- * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
+ * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
* - OOO reads of cons tail value
* - OOO copy of elems from the ring
+ * Also RELEASE guarantees that latest tail value
+ * will become visible before the new head value.
*/
} while (rte_atomic_compare_exchange_strong_explicit(&d->ht.raw,
(uint64_t *)(uintptr_t)&op.raw, np.raw,
- rte_memory_order_acquire,
+ rte_memory_order_acq_rel,
rte_memory_order_acquire) == 0);
*old_head = op.pos.head;
@@ -131,13 +131,15 @@ __rte_ring_rts_move_head(struct rte_ring_rts_headtail *d,
nh.val.cnt = oh.val.cnt + 1;
/*
- * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
+ * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
* - OOO reads of cons tail value
* - OOO copy of elems to the ring
+ * Also RELEASE guarantees that latest tail value
+ * will become visible before the new head value.
*/
} while (rte_atomic_compare_exchange_strong_explicit(&d->head.raw,
(uint64_t *)(uintptr_t)&oh.raw, nh.raw,
- rte_memory_order_acquire,
+ rte_memory_order_acq_rel,
rte_memory_order_acquire) == 0);
*old_head = oh.val.pos;
@@ -123,8 +123,6 @@ __rte_soring_stage_finalize(struct soring_stage_headtail *sht, uint32_t stage,
rte_atomic_store_explicit(&sht->tail.raw, ot.raw,
rte_memory_order_release);
- /* make sure that new tail value is visible */
- rte_atomic_thread_fence(rte_memory_order_release);
return i;
}
@@ -219,9 +217,6 @@ __rte_soring_update_tail(struct __rte_ring_headtail *rht,
/* unsupported mode, shouldn't be here */
RTE_ASSERT(0);
}
-
- /* make sure that new tail value is visible */
- rte_atomic_thread_fence(rte_memory_order_release);
}
static __rte_always_inline uint32_t