mbox series

[v3,0/5] Add non-blocking ring

Message ID 20190118152326.22686-1-gage.eads@intel.com (mailing list archive)
Headers
Series Add non-blocking ring |

Message

Eads, Gage Jan. 18, 2019, 3:23 p.m. UTC
  For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a non-blocking ring, on top of which a mempool can run.
Crucially, the non-blocking algorithm relies on a 128-bit compare-and-swap, so
it is currently limited to x86_64 machines. This is also an experimental API,
so RING_F_NB users must build with the ALLOW_EXPERIMENTAL_API flag.

The ring uses more compare-and-swap atomic operations than the regular rte ring:
With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations and a
dequeue of n pointers uses 2. This algorithm has worse average-case performance
than the regular rte ring (particularly a highly-contended ring with large bulk
accesses), however:
- For applications with preemptible pthreads, the regular rte ring's worst-case
  performance (i.e. one thread being preempted in the update_tail() critical
  section) is much worse than the non-blocking ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a non-blocking ring based mempool (a likely use case
  for this ring) with per-thread caching.

The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-of-use,
existing ring enqueue/dequeue functions work with both "regular" and
non-blocking rings.

This patchset also adds non-blocking versions of ring_autotest and
ring_perf_autotest, and a non-blocking ring based mempool.

This patchset makes one API change; a deprecation notice will be posted in a
separate commit.

This patchset depends on the non-blocking stack patchset[1].

[1] http://mails.dpdk.org/archives/dev/2019-January/123653.html

v3:
 - Avoid the ABI break by putting 64-bit head and tail values in the same
   cacheline as struct rte_ring's prod and cons members.
 - Don't attempt to compile rte_atomic128_cmpset without
   ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries.
 - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case someone tries
   to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
 - Update the ring mempool to use experimental APIs
 - Clarify that RINB_F_NB is only limited to x86_64 currently; ARMv8.1-A builds
   can eventually support it with the CASP instruction.

v2:
 - Merge separate docs commit into patch #5
 - Convert uintptr_t to size_t
 - Add a compile-time check for the size of size_t
 - Fix a space-after-typecast issue
 - Fix an unnecessary-parentheses checkpatch warning
 - Bump librte_ring's library version

Gage Eads (5):
  ring: add 64-bit headtail structure
  ring: add a non-blocking implementation
  test_ring: add non-blocking ring autotest
  test_ring_perf: add non-blocking ring perf test
  mempool/ring: add non-blocking ring handlers

 doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
 drivers/mempool/ring/Makefile                   |   1 +
 drivers/mempool/ring/meson.build                |   2 +
 drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
 lib/librte_eventdev/rte_event_ring.h            |   2 +-
 lib/librte_ring/Makefile                        |   3 +-
 lib/librte_ring/rte_ring.c                      |  72 ++-
 lib/librte_ring/rte_ring.h                      | 574 ++++++++++++++++++++++--
 lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
 lib/librte_ring/rte_ring_version.map            |   7 +
 test/test/test_ring.c                           |  57 ++-
 test/test/test_ring_perf.c                      |  19 +-
 12 files changed, 874 insertions(+), 75 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_generic_64.h
  

Comments

Ola Liljedahl Jan. 22, 2019, 9:27 a.m. UTC | #1
On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> For some users, the rte ring's "non-preemptive" constraint is not
> acceptable;
> for example, if the application uses a mixture of pinned high-
> priority threads
> and multiplexed low-priority threads that share a mempool.
>
> This patchset introduces a non-blocking ring, on top of which a
> mempool can run.
> Crucially, the non-blocking algorithm relies on a 128-bit compare-
> and-swap, so
> it is currently limited to x86_64 machines. This is also an
> experimental API,
> so RING_F_NB users must build with the ALLOW_EXPERIMENTAL_API flag.
>
> The ring uses more compare-and-swap atomic operations than the
> regular rte ring:
> With no contention, an enqueue of n pointers uses (1 + 2n) CAS
> operations and a
> dequeue of n pointers uses 2. This algorithm has worse average-case
> performance
> than the regular rte ring (particularly a highly-contended ring with
> large bulk
> accesses), however:
> - For applications with preemptible pthreads, the regular rte ring's
> worst-case
>   performance (i.e. one thread being preempted in the update_tail()
> critical
>   section) is much worse than the non-blocking ring's.
> - Software caching can mitigate the average case performance for
> ring-based
>   algorithms. For example, a non-blocking ring based mempool (a
> likely use case
>   for this ring) with per-thread caching.
>
> The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-
> of-use,
> existing ring enqueue/dequeue functions work with both "regular" and
> non-blocking rings.
>
> This patchset also adds non-blocking versions of ring_autotest and
> ring_perf_autotest, and a non-blocking ring based mempool.
>
> This patchset makes one API change; a deprecation notice will be
> posted in a
> separate commit.
>
> This patchset depends on the non-blocking stack patchset[1].
>
> [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
>
> v3:
>  - Avoid the ABI break by putting 64-bit head and tail values in the
> same
>    cacheline as struct rte_ring's prod and cons members.
>  - Don't attempt to compile rte_atomic128_cmpset without
>    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> libraries.
>  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> someone tries
>    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
>  - Update the ring mempool to use experimental APIs
>  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> ARMv8.1-A builds
>    can eventually support it with the CASP instruction.
ARMv8.0 should be able to implement a 128-bit atomic compare exchange
operation using LDXP/STXP.

From an ARM perspective, I want all atomic operations to take memory
ordering arguments (e.g. acquire, release). Not all usages of e.g.
atomic compare exchange require sequential consistency (which I think
what x86 cmpxchg instruction provides). DPDK functions should not be
modelled after x86 behaviour.

Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64
are available here:
https://github.com/ARM-software/progress64/blob/master/src/lockfree.h

>
> v2:
>  - Merge separate docs commit into patch #5
>  - Convert uintptr_t to size_t
>  - Add a compile-time check for the size of size_t
>  - Fix a space-after-typecast issue
>  - Fix an unnecessary-parentheses checkpatch warning
>  - Bump librte_ring's library version
>
> Gage Eads (5):
>   ring: add 64-bit headtail structure
>   ring: add a non-blocking implementation
>   test_ring: add non-blocking ring autotest
>   test_ring_perf: add non-blocking ring perf test
>   mempool/ring: add non-blocking ring handlers
>
>  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
>  drivers/mempool/ring/Makefile                   |   1 +
>  drivers/mempool/ring/meson.build                |   2 +
>  drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
>  lib/librte_eventdev/rte_event_ring.h            |   2 +-
>  lib/librte_ring/Makefile                        |   3 +-
>  lib/librte_ring/rte_ring.c                      |  72 ++-
>  lib/librte_ring/rte_ring.h                      | 574
> ++++++++++++++++++++++--
>  lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
>  lib/librte_ring/rte_ring_version.map            |   7 +
>  test/test/test_ring.c                           |  57 ++-
>  test/test/test_ring_perf.c                      |  19 +-
>  12 files changed, 874 insertions(+), 75 deletions(-)
>  create mode 100644 lib/librte_ring/rte_ring_generic_64.h
>
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  
Ola Liljedahl Jan. 22, 2019, 10:15 a.m. UTC | #2
Sorry about the confidental footer. I tried to remove it using some Exhange
magic but it seems not to work with Evolution. I'll try some other way.

-- Ola

On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote:
> On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> >
> > For some users, the rte ring's "non-preemptive" constraint is not
> > acceptable;
> > for example, if the application uses a mixture of pinned high-
> > priority threads
> > and multiplexed low-priority threads that share a mempool.
> >
   I. This patchset introduces a non-blocking ring, on top of which a
> > mempool can run.
> > Crucially, the non-blocking algorithm relies on a 128-bit compare-
> > and-swap, so
> > it is currently limited to x86_64 machines. This is also an
> > experimental API,
> > so RING_F_NB users must build with the ALLOW_EXPERIMENTAL_API flag.
> >
> > The ring uses more compare-and-swap atomic operations than the
> > regular rte ring:
> > With no contention, an enqueue of n pointers uses (1 + 2n) CAS
> > operations and a
> > dequeue of n pointers uses 2. This algorithm has worse average-case
> > performance
> > than the regular rte ring (particularly a highly-contended ring with
> > large bulk
> > accesses), however:
> > - For applications with preemptible pthreads, the regular rte ring's
> > worst-case
> >   performance (i.e. one thread being preempted in the update_tail()
> > critical
> >   section) is much worse than the non-blocking ring's.
> > - Software caching can mitigate the average case performance for
> > ring-based
> >   algorithms. For example, a non-blocking ring based mempool (a
> > likely use case
> >   for this ring) with per-thread caching.
> >
> > The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-
> > of-use,
> > existing ring enqueue/dequeue functions work with both "regular" and
> > non-blocking rings.
> >
> > This patchset also adds non-blocking versions of ring_autotest and
> > ring_perf_autotest, and a non-blocking ring based mempool.
> >
> > This patchset makes one API change; a deprecation notice will be
> > posted in a
> > separate commit.
> >
> > This patchset depends on the non-blocking stack patchset[1].
> >
> > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
> >
> > v3:
> >  - Avoid the ABI break by putting 64-bit head and tail values in the
> > same
> >    cacheline as struct rte_ring's prod and cons members.
> >  - Don't attempt to compile rte_atomic128_cmpset without
> >    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> > libraries.
> >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > someone tries
> >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> >  - Update the ring mempool to use experimental APIs
> >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > ARMv8.1-A builds
> >    can eventually support it with the CASP instruction.
> ARMv8.0 should be able to implement a 128-bit atomic compare exchange
> operation using LDXP/STXP.
>
> From an ARM perspective, I want all atomic operations to take memory
> ordering arguments (e.g. acquire, release). Not all usages of e.g.
> atomic compare exchange require sequential consistency (which I think
> what x86 cmpxchg instruction provides). DPDK functions should not be
> modelled after x86 behaviour.
>
> Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64
> are available here:
> https://github.com/ARM-software/progress64/blob/master/src/lockfree.h
>
> >
> >
> > v2:
> >  - Merge separate docs commit into patch #5
> >  - Convert uintptr_t to size_t
> >  - Add a compile-time check for the size of size_t
> >  - Fix a space-after-typecast issue
> >  - Fix an unnecessary-parentheses checkpatch warning
> >  - Bump librte_ring's library version
> >
> > Gage Eads (5):
> >   ring: add 64-bit headtail structure
> >   ring: add a non-blocking implementation
> >   test_ring: add non-blocking ring autotest
> >   test_ring_perf: add non-blocking ring perf test
> >   mempool/ring: add non-blocking ring handlers
> >
> >  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
> >  drivers/mempool/ring/Makefile                   |   1 +
> >  drivers/mempool/ring/meson.build                |   2 +
> >  drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
> >  lib/librte_eventdev/rte_event_ring.h            |   2 +-
> >  lib/librte_ring/Makefile                        |   3 +-
> >  lib/librte_ring/rte_ring.c                      |  72 ++-
> >  lib/librte_ring/rte_ring.h                      | 574
> > ++++++++++++++++++++++--
> >  lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
> >  lib/librte_ring/rte_ring_version.map            |   7 +
> >  test/test/test_ring.c                           |  57 ++-
> >  test/test/test_ring_perf.c                      |  19 +-
> >  12 files changed, 874 insertions(+), 75 deletions(-)
> >  create mode 100644 lib/librte_ring/rte_ring_generic_64.h
> >
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  
Eads, Gage Jan. 22, 2019, 7:15 p.m. UTC | #3
> -----Original Message-----
> From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> Sent: Tuesday, January 22, 2019 3:28 AM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; stephen@networkplumber.org; Richardson, Bruce
> <bruce.richardson@intel.com>; arybchenko@solarflare.com; Ananyev,
> Konstantin <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> 
> On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> > For some users, the rte ring's "non-preemptive" constraint is not
> > acceptable; for example, if the application uses a mixture of pinned
> > high- priority threads and multiplexed low-priority threads that share
> > a mempool.
> >
> > This patchset introduces a non-blocking ring, on top of which a
> > mempool can run.
> > Crucially, the non-blocking algorithm relies on a 128-bit compare-
> > and-swap, so it is currently limited to x86_64 machines. This is also
> > an experimental API, so RING_F_NB users must build with the
> > ALLOW_EXPERIMENTAL_API flag.
> >
> > The ring uses more compare-and-swap atomic operations than the regular
> > rte ring:
> > With no contention, an enqueue of n pointers uses (1 + 2n) CAS
> > operations and a dequeue of n pointers uses 2. This algorithm has
> > worse average-case performance than the regular rte ring (particularly
> > a highly-contended ring with large bulk accesses), however:
> > - For applications with preemptible pthreads, the regular rte ring's
> > worst-case
> >   performance (i.e. one thread being preempted in the update_tail()
> > critical
> >   section) is much worse than the non-blocking ring's.
> > - Software caching can mitigate the average case performance for
> > ring-based
> >   algorithms. For example, a non-blocking ring based mempool (a likely
> > use case
> >   for this ring) with per-thread caching.
> >
> > The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-
> > of-use, existing ring enqueue/dequeue functions work with both
> > "regular" and non-blocking rings.
> >
> > This patchset also adds non-blocking versions of ring_autotest and
> > ring_perf_autotest, and a non-blocking ring based mempool.
> >
> > This patchset makes one API change; a deprecation notice will be
> > posted in a separate commit.
> >
> > This patchset depends on the non-blocking stack patchset[1].
> >
> > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
> >
> > v3:
> >  - Avoid the ABI break by putting 64-bit head and tail values in the
> > same
> >    cacheline as struct rte_ring's prod and cons members.
> >  - Don't attempt to compile rte_atomic128_cmpset without
> >    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> > libraries.
> >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > someone tries
> >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> >  - Update the ring mempool to use experimental APIs
> >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > ARMv8.1-A builds
> >    can eventually support it with the CASP instruction.
> ARMv8.0 should be able to implement a 128-bit atomic compare exchange
> operation using LDXP/STXP.

I see, I wasn't aware these instructions were available.

> 
> From an ARM perspective, I want all atomic operations to take memory ordering
> arguments (e.g. acquire, release). Not all usages of e.g.
> atomic compare exchange require sequential consistency (which I think what
> x86 cmpxchg instruction provides). DPDK functions should not be modelled after
> x86 behaviour.
> 
> Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64 are
> available here:
> https://github.com/ARM-software/progress64/blob/master/src/lockfree.h
> 

Sure, I'll address this in the next patchset.

Thanks,
Gage
  
Jerin Jacob Kollanukkaran Jan. 23, 2019, 4:02 p.m. UTC | #4
On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote:
> On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> > v3:
> >  - Avoid the ABI break by putting 64-bit head and tail values in
> > the
> > same
> >    cacheline as struct rte_ring's prod and cons members.
> >  - Don't attempt to compile rte_atomic128_cmpset without
> >    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> > libraries.
> >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > someone tries
> >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> >  - Update the ring mempool to use experimental APIs
> >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > ARMv8.1-A builds
> >    can eventually support it with the CASP instruction.
> ARMv8.0 should be able to implement a 128-bit atomic compare exchange
> operation using LDXP/STXP.

Just wondering what would the performance difference between CASP vs
LDXP/STXP on LSE supported machine?

I think, We can not detect the presese of LSE support in compile time.
Right?

The dynamic one will be costly like,

if (hwcaps & HWCAP_ATOMICS) {
	casp
} else {
	ldxp
	stxp
}

> From an ARM perspective, I want all atomic operations to take memory
> ordering arguments (e.g. acquire, release). Not all usages of e.g.

+1

> atomic compare exchange require sequential consistency (which I think
> what x86 cmpxchg instruction provides). DPDK functions should not be
> modelled after x86 behaviour.
> 
> Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64
> are available here:
> https://github.com/ARM-software/progress64/blob/master/src/lockfree.h
>
  
Ola Liljedahl Jan. 23, 2019, 4:29 p.m. UTC | #5
On Wed, 2019-01-23 at 16:02 +0000, Jerin Jacob Kollanukkaran wrote:
> On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote:
> > 
> > On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> > > 
> > > v3:
> > >  - Avoid the ABI break by putting 64-bit head and tail values in
> > > the
> > > same
> > >    cacheline as struct rte_ring's prod and cons members.
> > >  - Don't attempt to compile rte_atomic128_cmpset without
> > >    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> > > libraries.
> > >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > > someone tries
> > >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> > >  - Update the ring mempool to use experimental APIs
> > >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > > ARMv8.1-A builds
> > >    can eventually support it with the CASP instruction.
> > ARMv8.0 should be able to implement a 128-bit atomic compare exchange
> > operation using LDXP/STXP.
> Just wondering what would the performance difference between CASP vs
> LDXP/STXP on LSE supported machine?
I think that is up to the microarchitecture. But one the ideas behind
introducing the LSE atomics was that they should be "better" than the equivalent
code using exclusives. I think non-conditional LDxxx and STxxx atomics could be
better than using exclusives while conditional atomics (CAS, CASP) might not be
so different (the reason has to do with cache coherency, a core can
speculatively snoop-unique the cache line which is targetted by an atomic
instruction but to what extent that provides a benefit could be depend on
whether the atomic actually performs a store or not).

> 
> I think, We can not detect the presese of LSE support in compile time.
> Right?
Unfortunately, AFAIK GCC doesn't notify the source code that it is targetting
v8.1+ with LSE support. If there were intrinsics for (certain) LSE instructions
(e.g. those not generated by the compiler, e.g. STxxx and CASP), we could use
some corresponding preprocessor define to detect the presence of such intrinsics
(they exist for other intrinsics, e.g. __ARM_FEATURE_QRDMX for SQRDMLAH/SQRDMLSH
instructions and corresponding intrinsics).

I have tried to interest the Arm GCC developers in this but have not yet
succeeded. Perhaps if we have more use cases were atomics intrinsics would be
useful, we could convince them to add such intrinsics to the ACLE (ARM C
Language Extensions). But we will never get intrinsics for exclusives, they are
deemed unsafe for explicit use from C. Instead need to provide inline assembler
that contains the complete exclusives sequence. But in practice it seems to work
with using inline assembler for LDXR and STXR as I do in the lockfree code
linked below.

> 
> The dynamic one will be costly like,
Do you think so? Shouldn't this branch be perfectly predictable? Once in a while
it will fall out of the branch history table but doesn't that mean the
application hasn't been executing this code for some time so not really
performance critical?

> 
> if (hwcaps & HWCAP_ATOMICS) {
> 	casp
> } else {
> 	ldxp
> 	stxp
> }
> 
> > 
> > From an ARM perspective, I want all atomic operations to take memory
> > ordering arguments (e.g. acquire, release). Not all usages of e.g.
> +1
> 
> > 
> > atomic compare exchange require sequential consistency (which I think
> > what x86 cmpxchg instruction provides). DPDK functions should not be
> > modelled after x86 behaviour.
> > 
> > Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64
> > are available here:
> > https://github.com/ARM-software/progress64/blob/master/src/lockfree.h
> >
  
Honnappa Nagarahalli Jan. 25, 2019, 5:20 a.m. UTC | #6
Hi Gage,
	Thank you for this patch. Arm (Ola Liljedahl) had worked on a non-blocking ring algorithm. We were planning to add it to DPDK at some point this year. I am wondering if you would be open to take a look at the algorithm and collaborate?

I am yet to fully understand both the algorithms. But, Ola has reviewed your patch and can provide a quick overview of the differences here.

If you agree, we can send a RFC patch. You can review that and do performance benchmarking on your platforms. I can also benchmark your patch (may be once you fix the issue identified in __rte_ring_do_nb_enqueue_mp  function?) on Arm platforms. May be we can end up with a better combined algorithm.

Hi Thomas/Bruce,
	Please let me know if this is ok and if there is a better way to do this.

Thank you,
Honnappa

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> Sent: Friday, January 18, 2019 9:23 AM
> To: dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> stephen@networkplumber.org
> Subject: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> 
> For some users, the rte ring's "non-preemptive" constraint is not acceptable;
> for example, if the application uses a mixture of pinned high-priority threads
> and multiplexed low-priority threads that share a mempool.
> 
> This patchset introduces a non-blocking ring, on top of which a mempool can
> run.
> Crucially, the non-blocking algorithm relies on a 128-bit compare-and-swap,
> so it is currently limited to x86_64 machines. This is also an experimental API,
> so RING_F_NB users must build with the ALLOW_EXPERIMENTAL_API flag.
> 
> The ring uses more compare-and-swap atomic operations than the regular rte
> ring:
> With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations
> and a dequeue of n pointers uses 2. This algorithm has worse average-case
> performance than the regular rte ring (particularly a highly-contended ring
> with large bulk accesses), however:
> - For applications with preemptible pthreads, the regular rte ring's worst-case
>   performance (i.e. one thread being preempted in the update_tail() critical
>   section) is much worse than the non-blocking ring's.
> - Software caching can mitigate the average case performance for ring-based
>   algorithms. For example, a non-blocking ring based mempool (a likely use
> case
>   for this ring) with per-thread caching.
> 
> The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-of-use,
> existing ring enqueue/dequeue functions work with both "regular" and non-
> blocking rings.
> 
> This patchset also adds non-blocking versions of ring_autotest and
> ring_perf_autotest, and a non-blocking ring based mempool.
> 
> This patchset makes one API change; a deprecation notice will be posted in a
> separate commit.
> 
> This patchset depends on the non-blocking stack patchset[1].
> 
> [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
> 
> v3:
>  - Avoid the ABI break by putting 64-bit head and tail values in the same
>    cacheline as struct rte_ring's prod and cons members.
>  - Don't attempt to compile rte_atomic128_cmpset without
>    ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries.
>  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case someone
> tries
>    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
>  - Update the ring mempool to use experimental APIs
>  - Clarify that RINB_F_NB is only limited to x86_64 currently; ARMv8.1-A
> builds
>    can eventually support it with the CASP instruction.
> 
> v2:
>  - Merge separate docs commit into patch #5
>  - Convert uintptr_t to size_t
>  - Add a compile-time check for the size of size_t
>  - Fix a space-after-typecast issue
>  - Fix an unnecessary-parentheses checkpatch warning
>  - Bump librte_ring's library version
> 
> Gage Eads (5):
>   ring: add 64-bit headtail structure
>   ring: add a non-blocking implementation
>   test_ring: add non-blocking ring autotest
>   test_ring_perf: add non-blocking ring perf test
>   mempool/ring: add non-blocking ring handlers
> 
>  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
>  drivers/mempool/ring/Makefile                   |   1 +
>  drivers/mempool/ring/meson.build                |   2 +
>  drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
>  lib/librte_eventdev/rte_event_ring.h            |   2 +-
>  lib/librte_ring/Makefile                        |   3 +-
>  lib/librte_ring/rte_ring.c                      |  72 ++-
>  lib/librte_ring/rte_ring.h                      | 574 ++++++++++++++++++++++--
>  lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
>  lib/librte_ring/rte_ring_version.map            |   7 +
>  test/test/test_ring.c                           |  57 ++-
>  test/test/test_ring_perf.c                      |  19 +-
>  12 files changed, 874 insertions(+), 75 deletions(-)  create mode 100644
> lib/librte_ring/rte_ring_generic_64.h
> 
> --
> 2.13.6
  
Eads, Gage Jan. 25, 2019, 5:42 p.m. UTC | #7
Hi Honnappa,

Works for me -- I'm in favor of the best performing implementation, whoever provides it.

To allow an apples-to-apples comparison, I suggest Ola's/ARM's implementation be made to fit into the rte_ring API with an associated mempool handler. That'll allow us to use the existing ring and mempool performance tests as well. Feel free to use code from this patchset for the rte_ring integration, if that helps, of course.

I expect to have v4 available within the next week.

Thanks,
Gage

> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Thursday, January 24, 2019 11:21 PM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd
> <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl
> <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China)
> <Song.Zhu@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> 
> Hi Gage,
> 	Thank you for this patch. Arm (Ola Liljedahl) had worked on a non-
> blocking ring algorithm. We were planning to add it to DPDK at some point this
> year. I am wondering if you would be open to take a look at the algorithm and
> collaborate?
> 
> I am yet to fully understand both the algorithms. But, Ola has reviewed your
> patch and can provide a quick overview of the differences here.
> 
> If you agree, we can send a RFC patch. You can review that and do performance
> benchmarking on your platforms. I can also benchmark your patch (may be once
> you fix the issue identified in __rte_ring_do_nb_enqueue_mp  function?) on Arm
> platforms. May be we can end up with a better combined algorithm.
> 
> Hi Thomas/Bruce,
> 	Please let me know if this is ok and if there is a better way to do this.
> 
> Thank you,
> Honnappa
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> > Sent: Friday, January 18, 2019 9:23 AM
> > To: dev@dpdk.org
> > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> > stephen@networkplumber.org
> > Subject: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> >
> > For some users, the rte ring's "non-preemptive" constraint is not
> > acceptable; for example, if the application uses a mixture of pinned
> > high-priority threads and multiplexed low-priority threads that share a
> mempool.
> >
> > This patchset introduces a non-blocking ring, on top of which a
> > mempool can run.
> > Crucially, the non-blocking algorithm relies on a 128-bit
> > compare-and-swap, so it is currently limited to x86_64 machines. This
> > is also an experimental API, so RING_F_NB users must build with the
> ALLOW_EXPERIMENTAL_API flag.
> >
> > The ring uses more compare-and-swap atomic operations than the regular
> > rte
> > ring:
> > With no contention, an enqueue of n pointers uses (1 + 2n) CAS
> > operations and a dequeue of n pointers uses 2. This algorithm has
> > worse average-case performance than the regular rte ring (particularly
> > a highly-contended ring with large bulk accesses), however:
> > - For applications with preemptible pthreads, the regular rte ring's worst-case
> >   performance (i.e. one thread being preempted in the update_tail() critical
> >   section) is much worse than the non-blocking ring's.
> > - Software caching can mitigate the average case performance for ring-based
> >   algorithms. For example, a non-blocking ring based mempool (a likely
> > use case
> >   for this ring) with per-thread caching.
> >
> > The non-blocking ring is enabled via a new flag, RING_F_NB. For
> > ease-of-use, existing ring enqueue/dequeue functions work with both
> > "regular" and non- blocking rings.
> >
> > This patchset also adds non-blocking versions of ring_autotest and
> > ring_perf_autotest, and a non-blocking ring based mempool.
> >
> > This patchset makes one API change; a deprecation notice will be
> > posted in a separate commit.
> >
> > This patchset depends on the non-blocking stack patchset[1].
> >
> > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
> >
> > v3:
> >  - Avoid the ABI break by putting 64-bit head and tail values in the same
> >    cacheline as struct rte_ring's prod and cons members.
> >  - Don't attempt to compile rte_atomic128_cmpset without
> >    ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries.
> >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > someone tries
> >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> >  - Update the ring mempool to use experimental APIs
> >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > ARMv8.1-A builds
> >    can eventually support it with the CASP instruction.
> >
> > v2:
> >  - Merge separate docs commit into patch #5
> >  - Convert uintptr_t to size_t
> >  - Add a compile-time check for the size of size_t
> >  - Fix a space-after-typecast issue
> >  - Fix an unnecessary-parentheses checkpatch warning
> >  - Bump librte_ring's library version
> >
> > Gage Eads (5):
> >   ring: add 64-bit headtail structure
> >   ring: add a non-blocking implementation
> >   test_ring: add non-blocking ring autotest
> >   test_ring_perf: add non-blocking ring perf test
> >   mempool/ring: add non-blocking ring handlers
> >
> >  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
> >  drivers/mempool/ring/Makefile                   |   1 +
> >  drivers/mempool/ring/meson.build                |   2 +
> >  drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
> >  lib/librte_eventdev/rte_event_ring.h            |   2 +-
> >  lib/librte_ring/Makefile                        |   3 +-
> >  lib/librte_ring/rte_ring.c                      |  72 ++-
> >  lib/librte_ring/rte_ring.h                      | 574 ++++++++++++++++++++++--
> >  lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
> >  lib/librte_ring/rte_ring_version.map            |   7 +
> >  test/test/test_ring.c                           |  57 ++-
> >  test/test/test_ring_perf.c                      |  19 +-
> >  12 files changed, 874 insertions(+), 75 deletions(-)  create mode
> > 100644 lib/librte_ring/rte_ring_generic_64.h
> >
> > --
> > 2.13.6
  
Eads, Gage Jan. 25, 2019, 5:56 p.m. UTC | #8
> -----Original Message-----
> From: Eads, Gage
> Sent: Friday, January 25, 2019 11:43 AM
> To: 'Honnappa Nagarahalli' <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd
> <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl
> <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China)
> <Song.Zhu@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> 
> Hi Honnappa,
> 
> Works for me -- I'm in favor of the best performing implementation, whoever
> provides it.
> 
> To allow an apples-to-apples comparison, I suggest Ola's/ARM's implementation
> be made to fit into the rte_ring API with an associated mempool handler. That'll
> allow us to use the existing ring and mempool performance tests as well. Feel
> free to use code from this patchset for the rte_ring integration, if that helps, of
> course.
> 

But also, if Ola/ARM's algorithm is sufficiently similar to this one, it's probably better to tweak this patchset's enqueue and dequeue functions with any improvements you can identify rather than creating an entirely separate implementation.

> I expect to have v4 available within the next week.
> 
> Thanks,
> Gage
> 
> > -----Original Message-----
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Thursday, January 24, 2019 11:21 PM
> > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson,
> > Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd
> > <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl
> > <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China)
> > <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China)
> > <Song.Zhu@arm.com>; nd <nd@arm.com>
> > Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> >
> > Hi Gage,
> > 	Thank you for this patch. Arm (Ola Liljedahl) had worked on a non-
> > blocking ring algorithm. We were planning to add it to DPDK at some
> > point this year. I am wondering if you would be open to take a look at
> > the algorithm and collaborate?
> >
> > I am yet to fully understand both the algorithms. But, Ola has
> > reviewed your patch and can provide a quick overview of the differences here.
> >
> > If you agree, we can send a RFC patch. You can review that and do
> > performance benchmarking on your platforms. I can also benchmark your
> > patch (may be once you fix the issue identified in
> > __rte_ring_do_nb_enqueue_mp  function?) on Arm platforms. May be we can
> end up with a better combined algorithm.
> >
> > Hi Thomas/Bruce,
> > 	Please let me know if this is ok and if there is a better way to do this.
> >
> > Thank you,
> > Honnappa
> >
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> > > Sent: Friday, January 18, 2019 9:23 AM
> > > To: dev@dpdk.org
> > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > > bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> > > stephen@networkplumber.org
> > > Subject: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> > >
> > > For some users, the rte ring's "non-preemptive" constraint is not
> > > acceptable; for example, if the application uses a mixture of pinned
> > > high-priority threads and multiplexed low-priority threads that
> > > share a
> > mempool.
> > >
> > > This patchset introduces a non-blocking ring, on top of which a
> > > mempool can run.
> > > Crucially, the non-blocking algorithm relies on a 128-bit
> > > compare-and-swap, so it is currently limited to x86_64 machines.
> > > This is also an experimental API, so RING_F_NB users must build with
> > > the
> > ALLOW_EXPERIMENTAL_API flag.
> > >
> > > The ring uses more compare-and-swap atomic operations than the
> > > regular rte
> > > ring:
> > > With no contention, an enqueue of n pointers uses (1 + 2n) CAS
> > > operations and a dequeue of n pointers uses 2. This algorithm has
> > > worse average-case performance than the regular rte ring
> > > (particularly a highly-contended ring with large bulk accesses), however:
> > > - For applications with preemptible pthreads, the regular rte ring's worst-
> case
> > >   performance (i.e. one thread being preempted in the update_tail() critical
> > >   section) is much worse than the non-blocking ring's.
> > > - Software caching can mitigate the average case performance for ring-
> based
> > >   algorithms. For example, a non-blocking ring based mempool (a
> > > likely use case
> > >   for this ring) with per-thread caching.
> > >
> > > The non-blocking ring is enabled via a new flag, RING_F_NB. For
> > > ease-of-use, existing ring enqueue/dequeue functions work with both
> > > "regular" and non- blocking rings.
> > >
> > > This patchset also adds non-blocking versions of ring_autotest and
> > > ring_perf_autotest, and a non-blocking ring based mempool.
> > >
> > > This patchset makes one API change; a deprecation notice will be
> > > posted in a separate commit.
> > >
> > > This patchset depends on the non-blocking stack patchset[1].
> > >
> > > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
> > >
> > > v3:
> > >  - Avoid the ABI break by putting 64-bit head and tail values in the same
> > >    cacheline as struct rte_ring's prod and cons members.
> > >  - Don't attempt to compile rte_atomic128_cmpset without
> > >    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> libraries.
> > >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > > someone tries
> > >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> > >  - Update the ring mempool to use experimental APIs
> > >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > > ARMv8.1-A builds
> > >    can eventually support it with the CASP instruction.
> > >
> > > v2:
> > >  - Merge separate docs commit into patch #5
> > >  - Convert uintptr_t to size_t
> > >  - Add a compile-time check for the size of size_t
> > >  - Fix a space-after-typecast issue
> > >  - Fix an unnecessary-parentheses checkpatch warning
> > >  - Bump librte_ring's library version
> > >
> > > Gage Eads (5):
> > >   ring: add 64-bit headtail structure
> > >   ring: add a non-blocking implementation
> > >   test_ring: add non-blocking ring autotest
> > >   test_ring_perf: add non-blocking ring perf test
> > >   mempool/ring: add non-blocking ring handlers
> > >
> > >  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
> > >  drivers/mempool/ring/Makefile                   |   1 +
> > >  drivers/mempool/ring/meson.build                |   2 +
> > >  drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
> > >  lib/librte_eventdev/rte_event_ring.h            |   2 +-
> > >  lib/librte_ring/Makefile                        |   3 +-
> > >  lib/librte_ring/rte_ring.c                      |  72 ++-
> > >  lib/librte_ring/rte_ring.h                      | 574 ++++++++++++++++++++++--
> > >  lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
> > >  lib/librte_ring/rte_ring_version.map            |   7 +
> > >  test/test/test_ring.c                           |  57 ++-
> > >  test/test/test_ring_perf.c                      |  19 +-
> > >  12 files changed, 874 insertions(+), 75 deletions(-)  create mode
> > > 100644 lib/librte_ring/rte_ring_generic_64.h
> > >
> > > --
> > > 2.13.6
  
Ola Liljedahl Jan. 28, 2019, 10:41 a.m. UTC | #9
On Fri, 2019-01-25 at 17:56 +0000, Eads, Gage wrote:
> 
> > 
> > -----Original Message-----
> > From: Eads, Gage
> > Sent: Friday, January 25, 2019 11:43 AM
> > To: 'Honnappa Nagarahalli' <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org
> > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd
> > <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl
> > <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China)
> > <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China)
> > <Song.Zhu@arm.com>; nd <nd@arm.com>
> > Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> > 
> > Hi Honnappa,
> > 
> > Works for me -- I'm in favor of the best performing implementation, whoever
> > provides it.
> > 
> > To allow an apples-to-apples comparison, I suggest Ola's/ARM's
> > implementation
> > be made to fit into the rte_ring API with an associated mempool handler.
> > That'll
> > allow us to use the existing ring and mempool performance tests as well.
> > Feel
> > free to use code from this patchset for the rte_ring integration, if that
> > helps, of
> > course.
> > 
> But also, if Ola/ARM's algorithm is sufficiently similar to this one, it's
> probably better to tweak this patchset's enqueue and dequeue functions with
> any improvements you can identify rather than creating an entirely separate
> implementation.
There are strong similarities. But my implementation is separate from rte_ring
(whose code is a mess) which also freed me from any interoperatibility with the
rte_ring code and data structure (with two pairs of head+tail which is
unnecessary for the lock-free ring buffer).

My design and implementation is here:
https://github.com/ARM-software/progress64/blob/master/src/p64_lfring.c
I have a DPDK version in flight. Merging the relevant changes into your patch
makes sense. There are some differences we will have to agree on.

> 
> > 
> > I expect to have v4 available within the next week.
> > 
> > Thanks,
> > Gage
> > 
> > > 
> > > -----Original Message-----
> > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > > Sent: Thursday, January 24, 2019 11:21 PM
> > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson,
> > > Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd
> > > <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl
> > > <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China)
> > > <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China)
> > > <Song.Zhu@arm.com>; nd <nd@arm.com>
> > > Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> > > 
> > > Hi Gage,
> > > 	Thank you for this patch. Arm (Ola Liljedahl) had worked on a non-
> > > blocking ring algorithm. We were planning to add it to DPDK at some
> > > point this year. I am wondering if you would be open to take a look at
> > > the algorithm and collaborate?
> > > 
> > > I am yet to fully understand both the algorithms. But, Ola has
> > > reviewed your patch and can provide a quick overview of the differences
> > > here.
> > > 
> > > If you agree, we can send a RFC patch. You can review that and do
> > > performance benchmarking on your platforms. I can also benchmark your
> > > patch (may be once you fix the issue identified in
> > > __rte_ring_do_nb_enqueue_mp  function?) on Arm platforms. May be we can
> > end up with a better combined algorithm.
> > > 
> > > 
> > > Hi Thomas/Bruce,
> > > 	Please let me know if this is ok and if there is a better way to do
> > > this.
> > > 
> > > Thank you,
> > > Honnappa
> > > 
> > > > 
> > > > -----Original Message-----
> > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> > > > Sent: Friday, January 18, 2019 9:23 AM
> > > > To: dev@dpdk.org
> > > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > > > bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> > > > stephen@networkplumber.org
> > > > Subject: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> > > > 
> > > > For some users, the rte ring's "non-preemptive" constraint is not
> > > > acceptable; for example, if the application uses a mixture of pinned
> > > > high-priority threads and multiplexed low-priority threads that
> > > > share a
> > > mempool.
> > > > 
> > > > 
> > > > This patchset introduces a non-blocking ring, on top of which a
> > > > mempool can run.
> > > > Crucially, the non-blocking algorithm relies on a 128-bit
> > > > compare-and-swap, so it is currently limited to x86_64 machines.
> > > > This is also an experimental API, so RING_F_NB users must build with
> > > > the
> > > ALLOW_EXPERIMENTAL_API flag.
> > > > 
> > > > 
> > > > The ring uses more compare-and-swap atomic operations than the
> > > > regular rte
> > > > ring:
> > > > With no contention, an enqueue of n pointers uses (1 + 2n) CAS
> > > > operations and a dequeue of n pointers uses 2. This algorithm has
> > > > worse average-case performance than the regular rte ring
> > > > (particularly a highly-contended ring with large bulk accesses),
> > > > however:
> > > > - For applications with preemptible pthreads, the regular rte ring's
> > > > worst-
> > case
> > > 
> > > > 
> > > >   performance (i.e. one thread being preempted in the update_tail()
> > > > critical
> > > >   section) is much worse than the non-blocking ring's.
> > > > - Software caching can mitigate the average case performance for ring-
> > based
> > > 
> > > > 
> > > >   algorithms. For example, a non-blocking ring based mempool (a
> > > > likely use case
> > > >   for this ring) with per-thread caching.
> > > > 
> > > > The non-blocking ring is enabled via a new flag, RING_F_NB. For
> > > > ease-of-use, existing ring enqueue/dequeue functions work with both
> > > > "regular" and non- blocking rings.
> > > > 
> > > > This patchset also adds non-blocking versions of ring_autotest and
> > > > ring_perf_autotest, and a non-blocking ring based mempool.
> > > > 
> > > > This patchset makes one API change; a deprecation notice will be
> > > > posted in a separate commit.
> > > > 
> > > > This patchset depends on the non-blocking stack patchset[1].
> > > > 
> > > > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
> > > > 
> > > > v3:
> > > >  - Avoid the ABI break by putting 64-bit head and tail values in the
> > > > same
> > > >    cacheline as struct rte_ring's prod and cons members.
> > > >  - Don't attempt to compile rte_atomic128_cmpset without
> > > >    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> > libraries.
> > > 
> > > > 
> > > >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > > > someone tries
> > > >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> > > >  - Update the ring mempool to use experimental APIs
> > > >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > > > ARMv8.1-A builds
> > > >    can eventually support it with the CASP instruction.
> > > > 
> > > > v2:
> > > >  - Merge separate docs commit into patch #5
> > > >  - Convert uintptr_t to size_t
> > > >  - Add a compile-time check for the size of size_t
> > > >  - Fix a space-after-typecast issue
> > > >  - Fix an unnecessary-parentheses checkpatch warning
> > > >  - Bump librte_ring's library version
> > > > 
> > > > Gage Eads (5):
> > > >   ring: add 64-bit headtail structure
> > > >   ring: add a non-blocking implementation
> > > >   test_ring: add non-blocking ring autotest
> > > >   test_ring_perf: add non-blocking ring perf test
> > > >   mempool/ring: add non-blocking ring handlers
> > > > 
> > > >  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
> > > >  drivers/mempool/ring/Makefile                   |   1 +
> > > >  drivers/mempool/ring/meson.build                |   2 +
> > > >  drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
> > > >  lib/librte_eventdev/rte_event_ring.h            |   2 +-
> > > >  lib/librte_ring/Makefile                        |   3 +-
> > > >  lib/librte_ring/rte_ring.c                      |  72 ++-
> > > >  lib/librte_ring/rte_ring.h                      | 574
> > > > ++++++++++++++++++++++--
> > > >  lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
> > > >  lib/librte_ring/rte_ring_version.map            |   7 +
> > > >  test/test/test_ring.c                           |  57 ++-
> > > >  test/test/test_ring_perf.c                      |  19 +-
> > > >  12 files changed, 874 insertions(+), 75 deletions(-)  create mode
> > > > 100644 lib/librte_ring/rte_ring_generic_64.h
> > > > 
> > > > --
> > > > 2.13.6
  
Jerin Jacob Kollanukkaran Jan. 28, 2019, 1:10 p.m. UTC | #10
On Wed, 2019-01-23 at 16:29 +0000, Ola Liljedahl wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> On Wed, 2019-01-23 at 16:02 +0000, Jerin Jacob Kollanukkaran wrote:
> > On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote:
> > > On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> > > > v3:
> > > >  - Avoid the ABI break by putting 64-bit head and tail values
> > > > in
> > > > the
> > > > same
> > > >    cacheline as struct rte_ring's prod and cons members.
> > > >  - Don't attempt to compile rte_atomic128_cmpset without
> > > >    ALLOW_EXPERIMENTAL_API, as this would break a large number
> > > > of
> > > > libraries.
> > > >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in
> > > > case
> > > > someone tries
> > > >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> > > >  - Update the ring mempool to use experimental APIs
> > > >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > > > ARMv8.1-A builds
> > > >    can eventually support it with the CASP instruction.
> > > ARMv8.0 should be able to implement a 128-bit atomic compare
> > > exchange
> > > operation using LDXP/STXP.
> > Just wondering what would the performance difference between CASP
> > vs
> > LDXP/STXP on LSE supported machine?
> I think that is up to the microarchitecture. But one the ideas behind

Yes. This is where things are getting little messy to have generic code
where a lot of stuff is defined based on micro
architecture/IMPLEMENTATION DEFINED as arm spec. Al least, I am dealing
with three different micro archirectures now with a lot of difference.
Including the arm cores and qualcomm cores there could around >6ish
different micro archtectures.


> introducing the LSE atomics was that they should be "better" than the
> equivalent
> code using exclusives. I think non-conditional LDxxx and STxxx
> atomics could be
> better than using exclusives while conditional atomics (CAS, CASP)
> might not be
> so different (the reason has to do with cache coherency, a core can
> speculatively snoop-unique the cache line which is targetted by an
> atomic
> instruction but to what extent that provides a benefit could be
> depend on
> whether the atomic actually performs a store or not).
> 
> > I think, We can not detect the presese of LSE support in compile
> > time.
> > Right?
> Unfortunately, AFAIK GCC doesn't notify the source code that it is
> targetting
> v8.1+ with LSE support. If there were intrinsics for (certain) LSE
> instructions
> (e.g. those not generated by the compiler, e.g. STxxx and CASP), we
> could use
> some corresponding preprocessor define to detect the presence of such
> intrinsics
> (they exist for other intrinsics, e.g. __ARM_FEATURE_QRDMX for
> SQRDMLAH/SQRDMLSH
> instructions and corresponding intrinsics).
> 
> I have tried to interest the Arm GCC developers in this but have not
> yet
> succeeded. Perhaps if we have more use cases were atomics intrinsics
> would be
> useful, we could convince them to add such intrinsics to the ACLE
> (ARM C
> Language Extensions). But we will never get intrinsics for
> exclusives, they are
> deemed unsafe for explicit use from C. Instead need to provide inline
> assembler
> that contains the complete exclusives sequence. But in practice it
> seems to work
> with using inline assembler for LDXR and STXR as I do in the lockfree
> code
> linked below.
> 
> > The dynamic one will be costly like,
> Do you think so? Shouldn't this branch be perfectly predictable? Once

Not just branch predication. Right? Corresponding Load and need for
more I cache etc.

I think, for the generic build we can have either run time detection
or stick with LDXR/STXR.

We can give a compile time option for CASP based code so that for given
micro architecture if it optimized it can make use of it.(Something we
can easily expressed on meson build with MIDR value)


> in a while
> it will fall out of the branch history table but doesn't that mean
> the
> application hasn't been executing this code for some time so not
> really
> performance critical?
> 
> > if (hwcaps & HWCAP_ATOMICS) {
> > 	casp
> > } else {
> > 	ldxp
> > 	stxp
> > }
> > 
> > > From an ARM perspective, I want all atomic operations to take
> > > memory
> > > ordering arguments (e.g. acquire, release). Not all usages of
> > > e.g.
> > +1
> > 
> > > atomic compare exchange require sequential consistency (which I
> > > think
> > > what x86 cmpxchg instruction provides). DPDK functions should not
> > > be
> > > modelled after x86 behaviour.
> > > 
> > > Lock-free 128-bit atomics implementations for ARM/AArch64 and
> > > x86-64
> > > are available here:
> > > https://github.com/ARM-software/progress64/blob/master/src/lockfree.h
> > > 
> -- 
> Ola Liljedahl, Networking System Architect, Arm
> Phone +46706866373, Skype ola.liljedahl
>