Message ID | 20190118152326.22686-1-gage.eads@intel.com (mailing list archive) |
---|---|
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 08BB32B95; Fri, 18 Jan 2019 16:24:35 +0100 (CET) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by dpdk.org (Postfix) with ESMTP id A5923288C for <dev@dpdk.org>; Fri, 18 Jan 2019 16:24:33 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Jan 2019 07:24:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,491,1539673200"; d="scan'208";a="292610260" Received: from txasoft-yocto.an.intel.com (HELO txasoft-yocto.an.intel.com.) ([10.123.72.192]) by orsmga005.jf.intel.com with ESMTP; 18 Jan 2019 07:24:31 -0800 From: Gage Eads <gage.eads@intel.com> To: dev@dpdk.org Cc: olivier.matz@6wind.com, arybchenko@solarflare.com, bruce.richardson@intel.com, konstantin.ananyev@intel.com, stephen@networkplumber.org Date: Fri, 18 Jan 2019 09:23:21 -0600 Message-Id: <20190118152326.22686-1-gage.eads@intel.com> X-Mailer: git-send-email 2.13.6 In-Reply-To: <20190115235227.14013-1-gage.eads@intel.com> References: <20190115235227.14013-1-gage.eads@intel.com> Subject: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org> |
Series | Add non-blocking ring | |
Message
Eads, Gage
Jan. 18, 2019, 3:23 p.m. UTC
For some users, the rte ring's "non-preemptive" constraint is not acceptable; for example, if the application uses a mixture of pinned high-priority threads and multiplexed low-priority threads that share a mempool. This patchset introduces a non-blocking ring, on top of which a mempool can run. Crucially, the non-blocking algorithm relies on a 128-bit compare-and-swap, so it is currently limited to x86_64 machines. This is also an experimental API, so RING_F_NB users must build with the ALLOW_EXPERIMENTAL_API flag. The ring uses more compare-and-swap atomic operations than the regular rte ring: With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations and a dequeue of n pointers uses 2. This algorithm has worse average-case performance than the regular rte ring (particularly a highly-contended ring with large bulk accesses), however: - For applications with preemptible pthreads, the regular rte ring's worst-case performance (i.e. one thread being preempted in the update_tail() critical section) is much worse than the non-blocking ring's. - Software caching can mitigate the average case performance for ring-based algorithms. For example, a non-blocking ring based mempool (a likely use case for this ring) with per-thread caching. The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-of-use, existing ring enqueue/dequeue functions work with both "regular" and non-blocking rings. This patchset also adds non-blocking versions of ring_autotest and ring_perf_autotest, and a non-blocking ring based mempool. This patchset makes one API change; a deprecation notice will be posted in a separate commit. This patchset depends on the non-blocking stack patchset[1]. [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html v3: - Avoid the ABI break by putting 64-bit head and tail values in the same cacheline as struct rte_ring's prod and cons members. - Don't attempt to compile rte_atomic128_cmpset without ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries. - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case someone tries to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag. - Update the ring mempool to use experimental APIs - Clarify that RINB_F_NB is only limited to x86_64 currently; ARMv8.1-A builds can eventually support it with the CASP instruction. v2: - Merge separate docs commit into patch #5 - Convert uintptr_t to size_t - Add a compile-time check for the size of size_t - Fix a space-after-typecast issue - Fix an unnecessary-parentheses checkpatch warning - Bump librte_ring's library version Gage Eads (5): ring: add 64-bit headtail structure ring: add a non-blocking implementation test_ring: add non-blocking ring autotest test_ring_perf: add non-blocking ring perf test mempool/ring: add non-blocking ring handlers doc/guides/prog_guide/env_abstraction_layer.rst | 2 +- drivers/mempool/ring/Makefile | 1 + drivers/mempool/ring/meson.build | 2 + drivers/mempool/ring/rte_mempool_ring.c | 58 ++- lib/librte_eventdev/rte_event_ring.h | 2 +- lib/librte_ring/Makefile | 3 +- lib/librte_ring/rte_ring.c | 72 ++- lib/librte_ring/rte_ring.h | 574 ++++++++++++++++++++++-- lib/librte_ring/rte_ring_generic_64.h | 152 +++++++ lib/librte_ring/rte_ring_version.map | 7 + test/test/test_ring.c | 57 ++- test/test/test_ring_perf.c | 19 +- 12 files changed, 874 insertions(+), 75 deletions(-) create mode 100644 lib/librte_ring/rte_ring_generic_64.h
Comments
On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote: > For some users, the rte ring's "non-preemptive" constraint is not > acceptable; > for example, if the application uses a mixture of pinned high- > priority threads > and multiplexed low-priority threads that share a mempool. > > This patchset introduces a non-blocking ring, on top of which a > mempool can run. > Crucially, the non-blocking algorithm relies on a 128-bit compare- > and-swap, so > it is currently limited to x86_64 machines. This is also an > experimental API, > so RING_F_NB users must build with the ALLOW_EXPERIMENTAL_API flag. > > The ring uses more compare-and-swap atomic operations than the > regular rte ring: > With no contention, an enqueue of n pointers uses (1 + 2n) CAS > operations and a > dequeue of n pointers uses 2. This algorithm has worse average-case > performance > than the regular rte ring (particularly a highly-contended ring with > large bulk > accesses), however: > - For applications with preemptible pthreads, the regular rte ring's > worst-case > performance (i.e. one thread being preempted in the update_tail() > critical > section) is much worse than the non-blocking ring's. > - Software caching can mitigate the average case performance for > ring-based > algorithms. For example, a non-blocking ring based mempool (a > likely use case > for this ring) with per-thread caching. > > The non-blocking ring is enabled via a new flag, RING_F_NB. For ease- > of-use, > existing ring enqueue/dequeue functions work with both "regular" and > non-blocking rings. > > This patchset also adds non-blocking versions of ring_autotest and > ring_perf_autotest, and a non-blocking ring based mempool. > > This patchset makes one API change; a deprecation notice will be > posted in a > separate commit. > > This patchset depends on the non-blocking stack patchset[1]. > > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html > > v3: > - Avoid the ABI break by putting 64-bit head and tail values in the > same > cacheline as struct rte_ring's prod and cons members. > - Don't attempt to compile rte_atomic128_cmpset without > ALLOW_EXPERIMENTAL_API, as this would break a large number of > libraries. > - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case > someone tries > to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag. > - Update the ring mempool to use experimental APIs > - Clarify that RINB_F_NB is only limited to x86_64 currently; > ARMv8.1-A builds > can eventually support it with the CASP instruction. ARMv8.0 should be able to implement a 128-bit atomic compare exchange operation using LDXP/STXP. From an ARM perspective, I want all atomic operations to take memory ordering arguments (e.g. acquire, release). Not all usages of e.g. atomic compare exchange require sequential consistency (which I think what x86 cmpxchg instruction provides). DPDK functions should not be modelled after x86 behaviour. Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64 are available here: https://github.com/ARM-software/progress64/blob/master/src/lockfree.h > > v2: > - Merge separate docs commit into patch #5 > - Convert uintptr_t to size_t > - Add a compile-time check for the size of size_t > - Fix a space-after-typecast issue > - Fix an unnecessary-parentheses checkpatch warning > - Bump librte_ring's library version > > Gage Eads (5): > ring: add 64-bit headtail structure > ring: add a non-blocking implementation > test_ring: add non-blocking ring autotest > test_ring_perf: add non-blocking ring perf test > mempool/ring: add non-blocking ring handlers > > doc/guides/prog_guide/env_abstraction_layer.rst | 2 +- > drivers/mempool/ring/Makefile | 1 + > drivers/mempool/ring/meson.build | 2 + > drivers/mempool/ring/rte_mempool_ring.c | 58 ++- > lib/librte_eventdev/rte_event_ring.h | 2 +- > lib/librte_ring/Makefile | 3 +- > lib/librte_ring/rte_ring.c | 72 ++- > lib/librte_ring/rte_ring.h | 574 > ++++++++++++++++++++++-- > lib/librte_ring/rte_ring_generic_64.h | 152 +++++++ > lib/librte_ring/rte_ring_version.map | 7 + > test/test/test_ring.c | 57 ++- > test/test/test_ring_perf.c | 19 +- > 12 files changed, 874 insertions(+), 75 deletions(-) > create mode 100644 lib/librte_ring/rte_ring_generic_64.h > IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Sorry about the confidental footer. I tried to remove it using some Exhange magic but it seems not to work with Evolution. I'll try some other way. -- Ola On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote: > On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote: > > > > For some users, the rte ring's "non-preemptive" constraint is not > > acceptable; > > for example, if the application uses a mixture of pinned high- > > priority threads > > and multiplexed low-priority threads that share a mempool. > > I. This patchset introduces a non-blocking ring, on top of which a > > mempool can run. > > Crucially, the non-blocking algorithm relies on a 128-bit compare- > > and-swap, so > > it is currently limited to x86_64 machines. This is also an > > experimental API, > > so RING_F_NB users must build with the ALLOW_EXPERIMENTAL_API flag. > > > > The ring uses more compare-and-swap atomic operations than the > > regular rte ring: > > With no contention, an enqueue of n pointers uses (1 + 2n) CAS > > operations and a > > dequeue of n pointers uses 2. This algorithm has worse average-case > > performance > > than the regular rte ring (particularly a highly-contended ring with > > large bulk > > accesses), however: > > - For applications with preemptible pthreads, the regular rte ring's > > worst-case > > performance (i.e. one thread being preempted in the update_tail() > > critical > > section) is much worse than the non-blocking ring's. > > - Software caching can mitigate the average case performance for > > ring-based > > algorithms. For example, a non-blocking ring based mempool (a > > likely use case > > for this ring) with per-thread caching. > > > > The non-blocking ring is enabled via a new flag, RING_F_NB. For ease- > > of-use, > > existing ring enqueue/dequeue functions work with both "regular" and > > non-blocking rings. > > > > This patchset also adds non-blocking versions of ring_autotest and > > ring_perf_autotest, and a non-blocking ring based mempool. > > > > This patchset makes one API change; a deprecation notice will be > > posted in a > > separate commit. > > > > This patchset depends on the non-blocking stack patchset[1]. > > > > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html > > > > v3: > > - Avoid the ABI break by putting 64-bit head and tail values in the > > same > > cacheline as struct rte_ring's prod and cons members. > > - Don't attempt to compile rte_atomic128_cmpset without > > ALLOW_EXPERIMENTAL_API, as this would break a large number of > > libraries. > > - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case > > someone tries > > to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag. > > - Update the ring mempool to use experimental APIs > > - Clarify that RINB_F_NB is only limited to x86_64 currently; > > ARMv8.1-A builds > > can eventually support it with the CASP instruction. > ARMv8.0 should be able to implement a 128-bit atomic compare exchange > operation using LDXP/STXP. > > From an ARM perspective, I want all atomic operations to take memory > ordering arguments (e.g. acquire, release). Not all usages of e.g. > atomic compare exchange require sequential consistency (which I think > what x86 cmpxchg instruction provides). DPDK functions should not be > modelled after x86 behaviour. > > Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64 > are available here: > https://github.com/ARM-software/progress64/blob/master/src/lockfree.h > > > > > > > v2: > > - Merge separate docs commit into patch #5 > > - Convert uintptr_t to size_t > > - Add a compile-time check for the size of size_t > > - Fix a space-after-typecast issue > > - Fix an unnecessary-parentheses checkpatch warning > > - Bump librte_ring's library version > > > > Gage Eads (5): > > ring: add 64-bit headtail structure > > ring: add a non-blocking implementation > > test_ring: add non-blocking ring autotest > > test_ring_perf: add non-blocking ring perf test > > mempool/ring: add non-blocking ring handlers > > > > doc/guides/prog_guide/env_abstraction_layer.rst | 2 +- > > drivers/mempool/ring/Makefile | 1 + > > drivers/mempool/ring/meson.build | 2 + > > drivers/mempool/ring/rte_mempool_ring.c | 58 ++- > > lib/librte_eventdev/rte_event_ring.h | 2 +- > > lib/librte_ring/Makefile | 3 +- > > lib/librte_ring/rte_ring.c | 72 ++- > > lib/librte_ring/rte_ring.h | 574 > > ++++++++++++++++++++++-- > > lib/librte_ring/rte_ring_generic_64.h | 152 +++++++ > > lib/librte_ring/rte_ring_version.map | 7 + > > test/test/test_ring.c | 57 ++- > > test/test/test_ring_perf.c | 19 +- > > 12 files changed, 874 insertions(+), 75 deletions(-) > > create mode 100644 lib/librte_ring/rte_ring_generic_64.h > > > IMPORTANT NOTICE: The contents of this email and any attachments are > confidential and may also be privileged. If you are not the intended > recipient, please notify the sender immediately and do not disclose the > contents to any other person, use it for any purpose, or store or copy the > information in any medium. Thank you. IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
> -----Original Message----- > From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com] > Sent: Tuesday, January 22, 2019 3:28 AM > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org > Cc: olivier.matz@6wind.com; stephen@networkplumber.org; Richardson, Bruce > <bruce.richardson@intel.com>; arybchenko@solarflare.com; Ananyev, > Konstantin <konstantin.ananyev@intel.com> > Subject: Re: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring > > On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote: > > For some users, the rte ring's "non-preemptive" constraint is not > > acceptable; for example, if the application uses a mixture of pinned > > high- priority threads and multiplexed low-priority threads that share > > a mempool. > > > > This patchset introduces a non-blocking ring, on top of which a > > mempool can run. > > Crucially, the non-blocking algorithm relies on a 128-bit compare- > > and-swap, so it is currently limited to x86_64 machines. This is also > > an experimental API, so RING_F_NB users must build with the > > ALLOW_EXPERIMENTAL_API flag. > > > > The ring uses more compare-and-swap atomic operations than the regular > > rte ring: > > With no contention, an enqueue of n pointers uses (1 + 2n) CAS > > operations and a dequeue of n pointers uses 2. This algorithm has > > worse average-case performance than the regular rte ring (particularly > > a highly-contended ring with large bulk accesses), however: > > - For applications with preemptible pthreads, the regular rte ring's > > worst-case > > performance (i.e. one thread being preempted in the update_tail() > > critical > > section) is much worse than the non-blocking ring's. > > - Software caching can mitigate the average case performance for > > ring-based > > algorithms. For example, a non-blocking ring based mempool (a likely > > use case > > for this ring) with per-thread caching. > > > > The non-blocking ring is enabled via a new flag, RING_F_NB. For ease- > > of-use, existing ring enqueue/dequeue functions work with both > > "regular" and non-blocking rings. > > > > This patchset also adds non-blocking versions of ring_autotest and > > ring_perf_autotest, and a non-blocking ring based mempool. > > > > This patchset makes one API change; a deprecation notice will be > > posted in a separate commit. > > > > This patchset depends on the non-blocking stack patchset[1]. > > > > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html > > > > v3: > > - Avoid the ABI break by putting 64-bit head and tail values in the > > same > > cacheline as struct rte_ring's prod and cons members. > > - Don't attempt to compile rte_atomic128_cmpset without > > ALLOW_EXPERIMENTAL_API, as this would break a large number of > > libraries. > > - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case > > someone tries > > to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag. > > - Update the ring mempool to use experimental APIs > > - Clarify that RINB_F_NB is only limited to x86_64 currently; > > ARMv8.1-A builds > > can eventually support it with the CASP instruction. > ARMv8.0 should be able to implement a 128-bit atomic compare exchange > operation using LDXP/STXP. I see, I wasn't aware these instructions were available. > > From an ARM perspective, I want all atomic operations to take memory ordering > arguments (e.g. acquire, release). Not all usages of e.g. > atomic compare exchange require sequential consistency (which I think what > x86 cmpxchg instruction provides). DPDK functions should not be modelled after > x86 behaviour. > > Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64 are > available here: > https://github.com/ARM-software/progress64/blob/master/src/lockfree.h > Sure, I'll address this in the next patchset. Thanks, Gage
On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote: > On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote: > > v3: > > - Avoid the ABI break by putting 64-bit head and tail values in > > the > > same > > cacheline as struct rte_ring's prod and cons members. > > - Don't attempt to compile rte_atomic128_cmpset without > > ALLOW_EXPERIMENTAL_API, as this would break a large number of > > libraries. > > - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case > > someone tries > > to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag. > > - Update the ring mempool to use experimental APIs > > - Clarify that RINB_F_NB is only limited to x86_64 currently; > > ARMv8.1-A builds > > can eventually support it with the CASP instruction. > ARMv8.0 should be able to implement a 128-bit atomic compare exchange > operation using LDXP/STXP. Just wondering what would the performance difference between CASP vs LDXP/STXP on LSE supported machine? I think, We can not detect the presese of LSE support in compile time. Right? The dynamic one will be costly like, if (hwcaps & HWCAP_ATOMICS) { casp } else { ldxp stxp } > From an ARM perspective, I want all atomic operations to take memory > ordering arguments (e.g. acquire, release). Not all usages of e.g. +1 > atomic compare exchange require sequential consistency (which I think > what x86 cmpxchg instruction provides). DPDK functions should not be > modelled after x86 behaviour. > > Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64 > are available here: > https://github.com/ARM-software/progress64/blob/master/src/lockfree.h >
On Wed, 2019-01-23 at 16:02 +0000, Jerin Jacob Kollanukkaran wrote: > On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote: > > > > On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote: > > > > > > v3: > > > - Avoid the ABI break by putting 64-bit head and tail values in > > > the > > > same > > > cacheline as struct rte_ring's prod and cons members. > > > - Don't attempt to compile rte_atomic128_cmpset without > > > ALLOW_EXPERIMENTAL_API, as this would break a large number of > > > libraries. > > > - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case > > > someone tries > > > to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag. > > > - Update the ring mempool to use experimental APIs > > > - Clarify that RINB_F_NB is only limited to x86_64 currently; > > > ARMv8.1-A builds > > > can eventually support it with the CASP instruction. > > ARMv8.0 should be able to implement a 128-bit atomic compare exchange > > operation using LDXP/STXP. > Just wondering what would the performance difference between CASP vs > LDXP/STXP on LSE supported machine? I think that is up to the microarchitecture. But one the ideas behind introducing the LSE atomics was that they should be "better" than the equivalent code using exclusives. I think non-conditional LDxxx and STxxx atomics could be better than using exclusives while conditional atomics (CAS, CASP) might not be so different (the reason has to do with cache coherency, a core can speculatively snoop-unique the cache line which is targetted by an atomic instruction but to what extent that provides a benefit could be depend on whether the atomic actually performs a store or not). > > I think, We can not detect the presese of LSE support in compile time. > Right? Unfortunately, AFAIK GCC doesn't notify the source code that it is targetting v8.1+ with LSE support. If there were intrinsics for (certain) LSE instructions (e.g. those not generated by the compiler, e.g. STxxx and CASP), we could use some corresponding preprocessor define to detect the presence of such intrinsics (they exist for other intrinsics, e.g. __ARM_FEATURE_QRDMX for SQRDMLAH/SQRDMLSH instructions and corresponding intrinsics). I have tried to interest the Arm GCC developers in this but have not yet succeeded. Perhaps if we have more use cases were atomics intrinsics would be useful, we could convince them to add such intrinsics to the ACLE (ARM C Language Extensions). But we will never get intrinsics for exclusives, they are deemed unsafe for explicit use from C. Instead need to provide inline assembler that contains the complete exclusives sequence. But in practice it seems to work with using inline assembler for LDXR and STXR as I do in the lockfree code linked below. > > The dynamic one will be costly like, Do you think so? Shouldn't this branch be perfectly predictable? Once in a while it will fall out of the branch history table but doesn't that mean the application hasn't been executing this code for some time so not really performance critical? > > if (hwcaps & HWCAP_ATOMICS) { > casp > } else { > ldxp > stxp > } > > > > > From an ARM perspective, I want all atomic operations to take memory > > ordering arguments (e.g. acquire, release). Not all usages of e.g. > +1 > > > > > atomic compare exchange require sequential consistency (which I think > > what x86 cmpxchg instruction provides). DPDK functions should not be > > modelled after x86 behaviour. > > > > Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64 > > are available here: > > https://github.com/ARM-software/progress64/blob/master/src/lockfree.h > >
Hi Gage, Thank you for this patch. Arm (Ola Liljedahl) had worked on a non-blocking ring algorithm. We were planning to add it to DPDK at some point this year. I am wondering if you would be open to take a look at the algorithm and collaborate? I am yet to fully understand both the algorithms. But, Ola has reviewed your patch and can provide a quick overview of the differences here. If you agree, we can send a RFC patch. You can review that and do performance benchmarking on your platforms. I can also benchmark your patch (may be once you fix the issue identified in __rte_ring_do_nb_enqueue_mp function?) on Arm platforms. May be we can end up with a better combined algorithm. Hi Thomas/Bruce, Please let me know if this is ok and if there is a better way to do this. Thank you, Honnappa > -----Original Message----- > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads > Sent: Friday, January 18, 2019 9:23 AM > To: dev@dpdk.org > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; > bruce.richardson@intel.com; konstantin.ananyev@intel.com; > stephen@networkplumber.org > Subject: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring > > For some users, the rte ring's "non-preemptive" constraint is not acceptable; > for example, if the application uses a mixture of pinned high-priority threads > and multiplexed low-priority threads that share a mempool. > > This patchset introduces a non-blocking ring, on top of which a mempool can > run. > Crucially, the non-blocking algorithm relies on a 128-bit compare-and-swap, > so it is currently limited to x86_64 machines. This is also an experimental API, > so RING_F_NB users must build with the ALLOW_EXPERIMENTAL_API flag. > > The ring uses more compare-and-swap atomic operations than the regular rte > ring: > With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations > and a dequeue of n pointers uses 2. This algorithm has worse average-case > performance than the regular rte ring (particularly a highly-contended ring > with large bulk accesses), however: > - For applications with preemptible pthreads, the regular rte ring's worst-case > performance (i.e. one thread being preempted in the update_tail() critical > section) is much worse than the non-blocking ring's. > - Software caching can mitigate the average case performance for ring-based > algorithms. For example, a non-blocking ring based mempool (a likely use > case > for this ring) with per-thread caching. > > The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-of-use, > existing ring enqueue/dequeue functions work with both "regular" and non- > blocking rings. > > This patchset also adds non-blocking versions of ring_autotest and > ring_perf_autotest, and a non-blocking ring based mempool. > > This patchset makes one API change; a deprecation notice will be posted in a > separate commit. > > This patchset depends on the non-blocking stack patchset[1]. > > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html > > v3: > - Avoid the ABI break by putting 64-bit head and tail values in the same > cacheline as struct rte_ring's prod and cons members. > - Don't attempt to compile rte_atomic128_cmpset without > ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries. > - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case someone > tries > to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag. > - Update the ring mempool to use experimental APIs > - Clarify that RINB_F_NB is only limited to x86_64 currently; ARMv8.1-A > builds > can eventually support it with the CASP instruction. > > v2: > - Merge separate docs commit into patch #5 > - Convert uintptr_t to size_t > - Add a compile-time check for the size of size_t > - Fix a space-after-typecast issue > - Fix an unnecessary-parentheses checkpatch warning > - Bump librte_ring's library version > > Gage Eads (5): > ring: add 64-bit headtail structure > ring: add a non-blocking implementation > test_ring: add non-blocking ring autotest > test_ring_perf: add non-blocking ring perf test > mempool/ring: add non-blocking ring handlers > > doc/guides/prog_guide/env_abstraction_layer.rst | 2 +- > drivers/mempool/ring/Makefile | 1 + > drivers/mempool/ring/meson.build | 2 + > drivers/mempool/ring/rte_mempool_ring.c | 58 ++- > lib/librte_eventdev/rte_event_ring.h | 2 +- > lib/librte_ring/Makefile | 3 +- > lib/librte_ring/rte_ring.c | 72 ++- > lib/librte_ring/rte_ring.h | 574 ++++++++++++++++++++++-- > lib/librte_ring/rte_ring_generic_64.h | 152 +++++++ > lib/librte_ring/rte_ring_version.map | 7 + > test/test/test_ring.c | 57 ++- > test/test/test_ring_perf.c | 19 +- > 12 files changed, 874 insertions(+), 75 deletions(-) create mode 100644 > lib/librte_ring/rte_ring_generic_64.h > > -- > 2.13.6
Hi Honnappa, Works for me -- I'm in favor of the best performing implementation, whoever provides it. To allow an apples-to-apples comparison, I suggest Ola's/ARM's implementation be made to fit into the rte_ring API with an associated mempool handler. That'll allow us to use the existing ring and mempool performance tests as well. Feel free to use code from this patchset for the rte_ring integration, if that helps, of course. I expect to have v4 available within the next week. Thanks, Gage > -----Original Message----- > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] > Sent: Thursday, January 24, 2019 11:21 PM > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce > <bruce.richardson@intel.com>; Ananyev, Konstantin > <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd > <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl > <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China) > <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China) > <Song.Zhu@arm.com>; nd <nd@arm.com> > Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring > > Hi Gage, > Thank you for this patch. Arm (Ola Liljedahl) had worked on a non- > blocking ring algorithm. We were planning to add it to DPDK at some point this > year. I am wondering if you would be open to take a look at the algorithm and > collaborate? > > I am yet to fully understand both the algorithms. But, Ola has reviewed your > patch and can provide a quick overview of the differences here. > > If you agree, we can send a RFC patch. You can review that and do performance > benchmarking on your platforms. I can also benchmark your patch (may be once > you fix the issue identified in __rte_ring_do_nb_enqueue_mp function?) on Arm > platforms. May be we can end up with a better combined algorithm. > > Hi Thomas/Bruce, > Please let me know if this is ok and if there is a better way to do this. > > Thank you, > Honnappa > > > -----Original Message----- > > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads > > Sent: Friday, January 18, 2019 9:23 AM > > To: dev@dpdk.org > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; > > bruce.richardson@intel.com; konstantin.ananyev@intel.com; > > stephen@networkplumber.org > > Subject: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring > > > > For some users, the rte ring's "non-preemptive" constraint is not > > acceptable; for example, if the application uses a mixture of pinned > > high-priority threads and multiplexed low-priority threads that share a > mempool. > > > > This patchset introduces a non-blocking ring, on top of which a > > mempool can run. > > Crucially, the non-blocking algorithm relies on a 128-bit > > compare-and-swap, so it is currently limited to x86_64 machines. This > > is also an experimental API, so RING_F_NB users must build with the > ALLOW_EXPERIMENTAL_API flag. > > > > The ring uses more compare-and-swap atomic operations than the regular > > rte > > ring: > > With no contention, an enqueue of n pointers uses (1 + 2n) CAS > > operations and a dequeue of n pointers uses 2. This algorithm has > > worse average-case performance than the regular rte ring (particularly > > a highly-contended ring with large bulk accesses), however: > > - For applications with preemptible pthreads, the regular rte ring's worst-case > > performance (i.e. one thread being preempted in the update_tail() critical > > section) is much worse than the non-blocking ring's. > > - Software caching can mitigate the average case performance for ring-based > > algorithms. For example, a non-blocking ring based mempool (a likely > > use case > > for this ring) with per-thread caching. > > > > The non-blocking ring is enabled via a new flag, RING_F_NB. For > > ease-of-use, existing ring enqueue/dequeue functions work with both > > "regular" and non- blocking rings. > > > > This patchset also adds non-blocking versions of ring_autotest and > > ring_perf_autotest, and a non-blocking ring based mempool. > > > > This patchset makes one API change; a deprecation notice will be > > posted in a separate commit. > > > > This patchset depends on the non-blocking stack patchset[1]. > > > > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html > > > > v3: > > - Avoid the ABI break by putting 64-bit head and tail values in the same > > cacheline as struct rte_ring's prod and cons members. > > - Don't attempt to compile rte_atomic128_cmpset without > > ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries. > > - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case > > someone tries > > to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag. > > - Update the ring mempool to use experimental APIs > > - Clarify that RINB_F_NB is only limited to x86_64 currently; > > ARMv8.1-A builds > > can eventually support it with the CASP instruction. > > > > v2: > > - Merge separate docs commit into patch #5 > > - Convert uintptr_t to size_t > > - Add a compile-time check for the size of size_t > > - Fix a space-after-typecast issue > > - Fix an unnecessary-parentheses checkpatch warning > > - Bump librte_ring's library version > > > > Gage Eads (5): > > ring: add 64-bit headtail structure > > ring: add a non-blocking implementation > > test_ring: add non-blocking ring autotest > > test_ring_perf: add non-blocking ring perf test > > mempool/ring: add non-blocking ring handlers > > > > doc/guides/prog_guide/env_abstraction_layer.rst | 2 +- > > drivers/mempool/ring/Makefile | 1 + > > drivers/mempool/ring/meson.build | 2 + > > drivers/mempool/ring/rte_mempool_ring.c | 58 ++- > > lib/librte_eventdev/rte_event_ring.h | 2 +- > > lib/librte_ring/Makefile | 3 +- > > lib/librte_ring/rte_ring.c | 72 ++- > > lib/librte_ring/rte_ring.h | 574 ++++++++++++++++++++++-- > > lib/librte_ring/rte_ring_generic_64.h | 152 +++++++ > > lib/librte_ring/rte_ring_version.map | 7 + > > test/test/test_ring.c | 57 ++- > > test/test/test_ring_perf.c | 19 +- > > 12 files changed, 874 insertions(+), 75 deletions(-) create mode > > 100644 lib/librte_ring/rte_ring_generic_64.h > > > > -- > > 2.13.6
> -----Original Message----- > From: Eads, Gage > Sent: Friday, January 25, 2019 11:43 AM > To: 'Honnappa Nagarahalli' <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce > <bruce.richardson@intel.com>; Ananyev, Konstantin > <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd > <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl > <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China) > <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China) > <Song.Zhu@arm.com>; nd <nd@arm.com> > Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring > > Hi Honnappa, > > Works for me -- I'm in favor of the best performing implementation, whoever > provides it. > > To allow an apples-to-apples comparison, I suggest Ola's/ARM's implementation > be made to fit into the rte_ring API with an associated mempool handler. That'll > allow us to use the existing ring and mempool performance tests as well. Feel > free to use code from this patchset for the rte_ring integration, if that helps, of > course. > But also, if Ola/ARM's algorithm is sufficiently similar to this one, it's probably better to tweak this patchset's enqueue and dequeue functions with any improvements you can identify rather than creating an entirely separate implementation. > I expect to have v4 available within the next week. > > Thanks, > Gage > > > -----Original Message----- > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] > > Sent: Thursday, January 24, 2019 11:21 PM > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, > > Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin > > <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd > > <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl > > <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China) > > <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China) > > <Song.Zhu@arm.com>; nd <nd@arm.com> > > Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring > > > > Hi Gage, > > Thank you for this patch. Arm (Ola Liljedahl) had worked on a non- > > blocking ring algorithm. We were planning to add it to DPDK at some > > point this year. I am wondering if you would be open to take a look at > > the algorithm and collaborate? > > > > I am yet to fully understand both the algorithms. But, Ola has > > reviewed your patch and can provide a quick overview of the differences here. > > > > If you agree, we can send a RFC patch. You can review that and do > > performance benchmarking on your platforms. I can also benchmark your > > patch (may be once you fix the issue identified in > > __rte_ring_do_nb_enqueue_mp function?) on Arm platforms. May be we can > end up with a better combined algorithm. > > > > Hi Thomas/Bruce, > > Please let me know if this is ok and if there is a better way to do this. > > > > Thank you, > > Honnappa > > > > > -----Original Message----- > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads > > > Sent: Friday, January 18, 2019 9:23 AM > > > To: dev@dpdk.org > > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; > > > bruce.richardson@intel.com; konstantin.ananyev@intel.com; > > > stephen@networkplumber.org > > > Subject: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring > > > > > > For some users, the rte ring's "non-preemptive" constraint is not > > > acceptable; for example, if the application uses a mixture of pinned > > > high-priority threads and multiplexed low-priority threads that > > > share a > > mempool. > > > > > > This patchset introduces a non-blocking ring, on top of which a > > > mempool can run. > > > Crucially, the non-blocking algorithm relies on a 128-bit > > > compare-and-swap, so it is currently limited to x86_64 machines. > > > This is also an experimental API, so RING_F_NB users must build with > > > the > > ALLOW_EXPERIMENTAL_API flag. > > > > > > The ring uses more compare-and-swap atomic operations than the > > > regular rte > > > ring: > > > With no contention, an enqueue of n pointers uses (1 + 2n) CAS > > > operations and a dequeue of n pointers uses 2. This algorithm has > > > worse average-case performance than the regular rte ring > > > (particularly a highly-contended ring with large bulk accesses), however: > > > - For applications with preemptible pthreads, the regular rte ring's worst- > case > > > performance (i.e. one thread being preempted in the update_tail() critical > > > section) is much worse than the non-blocking ring's. > > > - Software caching can mitigate the average case performance for ring- > based > > > algorithms. For example, a non-blocking ring based mempool (a > > > likely use case > > > for this ring) with per-thread caching. > > > > > > The non-blocking ring is enabled via a new flag, RING_F_NB. For > > > ease-of-use, existing ring enqueue/dequeue functions work with both > > > "regular" and non- blocking rings. > > > > > > This patchset also adds non-blocking versions of ring_autotest and > > > ring_perf_autotest, and a non-blocking ring based mempool. > > > > > > This patchset makes one API change; a deprecation notice will be > > > posted in a separate commit. > > > > > > This patchset depends on the non-blocking stack patchset[1]. > > > > > > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html > > > > > > v3: > > > - Avoid the ABI break by putting 64-bit head and tail values in the same > > > cacheline as struct rte_ring's prod and cons members. > > > - Don't attempt to compile rte_atomic128_cmpset without > > > ALLOW_EXPERIMENTAL_API, as this would break a large number of > libraries. > > > - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case > > > someone tries > > > to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag. > > > - Update the ring mempool to use experimental APIs > > > - Clarify that RINB_F_NB is only limited to x86_64 currently; > > > ARMv8.1-A builds > > > can eventually support it with the CASP instruction. > > > > > > v2: > > > - Merge separate docs commit into patch #5 > > > - Convert uintptr_t to size_t > > > - Add a compile-time check for the size of size_t > > > - Fix a space-after-typecast issue > > > - Fix an unnecessary-parentheses checkpatch warning > > > - Bump librte_ring's library version > > > > > > Gage Eads (5): > > > ring: add 64-bit headtail structure > > > ring: add a non-blocking implementation > > > test_ring: add non-blocking ring autotest > > > test_ring_perf: add non-blocking ring perf test > > > mempool/ring: add non-blocking ring handlers > > > > > > doc/guides/prog_guide/env_abstraction_layer.rst | 2 +- > > > drivers/mempool/ring/Makefile | 1 + > > > drivers/mempool/ring/meson.build | 2 + > > > drivers/mempool/ring/rte_mempool_ring.c | 58 ++- > > > lib/librte_eventdev/rte_event_ring.h | 2 +- > > > lib/librte_ring/Makefile | 3 +- > > > lib/librte_ring/rte_ring.c | 72 ++- > > > lib/librte_ring/rte_ring.h | 574 ++++++++++++++++++++++-- > > > lib/librte_ring/rte_ring_generic_64.h | 152 +++++++ > > > lib/librte_ring/rte_ring_version.map | 7 + > > > test/test/test_ring.c | 57 ++- > > > test/test/test_ring_perf.c | 19 +- > > > 12 files changed, 874 insertions(+), 75 deletions(-) create mode > > > 100644 lib/librte_ring/rte_ring_generic_64.h > > > > > > -- > > > 2.13.6
On Fri, 2019-01-25 at 17:56 +0000, Eads, Gage wrote: > > > > > -----Original Message----- > > From: Eads, Gage > > Sent: Friday, January 25, 2019 11:43 AM > > To: 'Honnappa Nagarahalli' <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce > > <bruce.richardson@intel.com>; Ananyev, Konstantin > > <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd > > <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl > > <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China) > > <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China) > > <Song.Zhu@arm.com>; nd <nd@arm.com> > > Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring > > > > Hi Honnappa, > > > > Works for me -- I'm in favor of the best performing implementation, whoever > > provides it. > > > > To allow an apples-to-apples comparison, I suggest Ola's/ARM's > > implementation > > be made to fit into the rte_ring API with an associated mempool handler. > > That'll > > allow us to use the existing ring and mempool performance tests as well. > > Feel > > free to use code from this patchset for the rte_ring integration, if that > > helps, of > > course. > > > But also, if Ola/ARM's algorithm is sufficiently similar to this one, it's > probably better to tweak this patchset's enqueue and dequeue functions with > any improvements you can identify rather than creating an entirely separate > implementation. There are strong similarities. But my implementation is separate from rte_ring (whose code is a mess) which also freed me from any interoperatibility with the rte_ring code and data structure (with two pairs of head+tail which is unnecessary for the lock-free ring buffer). My design and implementation is here: https://github.com/ARM-software/progress64/blob/master/src/p64_lfring.c I have a DPDK version in flight. Merging the relevant changes into your patch makes sense. There are some differences we will have to agree on. > > > > > I expect to have v4 available within the next week. > > > > Thanks, > > Gage > > > > > > > > -----Original Message----- > > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] > > > Sent: Thursday, January 24, 2019 11:21 PM > > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org > > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, > > > Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin > > > <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd > > > <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl > > > <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China) > > > <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China) > > > <Song.Zhu@arm.com>; nd <nd@arm.com> > > > Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring > > > > > > Hi Gage, > > > Thank you for this patch. Arm (Ola Liljedahl) had worked on a non- > > > blocking ring algorithm. We were planning to add it to DPDK at some > > > point this year. I am wondering if you would be open to take a look at > > > the algorithm and collaborate? > > > > > > I am yet to fully understand both the algorithms. But, Ola has > > > reviewed your patch and can provide a quick overview of the differences > > > here. > > > > > > If you agree, we can send a RFC patch. You can review that and do > > > performance benchmarking on your platforms. I can also benchmark your > > > patch (may be once you fix the issue identified in > > > __rte_ring_do_nb_enqueue_mp function?) on Arm platforms. May be we can > > end up with a better combined algorithm. > > > > > > > > > Hi Thomas/Bruce, > > > Please let me know if this is ok and if there is a better way to do > > > this. > > > > > > Thank you, > > > Honnappa > > > > > > > > > > > -----Original Message----- > > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads > > > > Sent: Friday, January 18, 2019 9:23 AM > > > > To: dev@dpdk.org > > > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; > > > > bruce.richardson@intel.com; konstantin.ananyev@intel.com; > > > > stephen@networkplumber.org > > > > Subject: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring > > > > > > > > For some users, the rte ring's "non-preemptive" constraint is not > > > > acceptable; for example, if the application uses a mixture of pinned > > > > high-priority threads and multiplexed low-priority threads that > > > > share a > > > mempool. > > > > > > > > > > > > This patchset introduces a non-blocking ring, on top of which a > > > > mempool can run. > > > > Crucially, the non-blocking algorithm relies on a 128-bit > > > > compare-and-swap, so it is currently limited to x86_64 machines. > > > > This is also an experimental API, so RING_F_NB users must build with > > > > the > > > ALLOW_EXPERIMENTAL_API flag. > > > > > > > > > > > > The ring uses more compare-and-swap atomic operations than the > > > > regular rte > > > > ring: > > > > With no contention, an enqueue of n pointers uses (1 + 2n) CAS > > > > operations and a dequeue of n pointers uses 2. This algorithm has > > > > worse average-case performance than the regular rte ring > > > > (particularly a highly-contended ring with large bulk accesses), > > > > however: > > > > - For applications with preemptible pthreads, the regular rte ring's > > > > worst- > > case > > > > > > > > > > > performance (i.e. one thread being preempted in the update_tail() > > > > critical > > > > section) is much worse than the non-blocking ring's. > > > > - Software caching can mitigate the average case performance for ring- > > based > > > > > > > > > > > algorithms. For example, a non-blocking ring based mempool (a > > > > likely use case > > > > for this ring) with per-thread caching. > > > > > > > > The non-blocking ring is enabled via a new flag, RING_F_NB. For > > > > ease-of-use, existing ring enqueue/dequeue functions work with both > > > > "regular" and non- blocking rings. > > > > > > > > This patchset also adds non-blocking versions of ring_autotest and > > > > ring_perf_autotest, and a non-blocking ring based mempool. > > > > > > > > This patchset makes one API change; a deprecation notice will be > > > > posted in a separate commit. > > > > > > > > This patchset depends on the non-blocking stack patchset[1]. > > > > > > > > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html > > > > > > > > v3: > > > > - Avoid the ABI break by putting 64-bit head and tail values in the > > > > same > > > > cacheline as struct rte_ring's prod and cons members. > > > > - Don't attempt to compile rte_atomic128_cmpset without > > > > ALLOW_EXPERIMENTAL_API, as this would break a large number of > > libraries. > > > > > > > > > > > - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case > > > > someone tries > > > > to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag. > > > > - Update the ring mempool to use experimental APIs > > > > - Clarify that RINB_F_NB is only limited to x86_64 currently; > > > > ARMv8.1-A builds > > > > can eventually support it with the CASP instruction. > > > > > > > > v2: > > > > - Merge separate docs commit into patch #5 > > > > - Convert uintptr_t to size_t > > > > - Add a compile-time check for the size of size_t > > > > - Fix a space-after-typecast issue > > > > - Fix an unnecessary-parentheses checkpatch warning > > > > - Bump librte_ring's library version > > > > > > > > Gage Eads (5): > > > > ring: add 64-bit headtail structure > > > > ring: add a non-blocking implementation > > > > test_ring: add non-blocking ring autotest > > > > test_ring_perf: add non-blocking ring perf test > > > > mempool/ring: add non-blocking ring handlers > > > > > > > > doc/guides/prog_guide/env_abstraction_layer.rst | 2 +- > > > > drivers/mempool/ring/Makefile | 1 + > > > > drivers/mempool/ring/meson.build | 2 + > > > > drivers/mempool/ring/rte_mempool_ring.c | 58 ++- > > > > lib/librte_eventdev/rte_event_ring.h | 2 +- > > > > lib/librte_ring/Makefile | 3 +- > > > > lib/librte_ring/rte_ring.c | 72 ++- > > > > lib/librte_ring/rte_ring.h | 574 > > > > ++++++++++++++++++++++-- > > > > lib/librte_ring/rte_ring_generic_64.h | 152 +++++++ > > > > lib/librte_ring/rte_ring_version.map | 7 + > > > > test/test/test_ring.c | 57 ++- > > > > test/test/test_ring_perf.c | 19 +- > > > > 12 files changed, 874 insertions(+), 75 deletions(-) create mode > > > > 100644 lib/librte_ring/rte_ring_generic_64.h > > > > > > > > -- > > > > 2.13.6
On Wed, 2019-01-23 at 16:29 +0000, Ola Liljedahl wrote: > External Email > > ------------------------------------------------------------------- > --- > On Wed, 2019-01-23 at 16:02 +0000, Jerin Jacob Kollanukkaran wrote: > > On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote: > > > On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote: > > > > v3: > > > > - Avoid the ABI break by putting 64-bit head and tail values > > > > in > > > > the > > > > same > > > > cacheline as struct rte_ring's prod and cons members. > > > > - Don't attempt to compile rte_atomic128_cmpset without > > > > ALLOW_EXPERIMENTAL_API, as this would break a large number > > > > of > > > > libraries. > > > > - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in > > > > case > > > > someone tries > > > > to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag. > > > > - Update the ring mempool to use experimental APIs > > > > - Clarify that RINB_F_NB is only limited to x86_64 currently; > > > > ARMv8.1-A builds > > > > can eventually support it with the CASP instruction. > > > ARMv8.0 should be able to implement a 128-bit atomic compare > > > exchange > > > operation using LDXP/STXP. > > Just wondering what would the performance difference between CASP > > vs > > LDXP/STXP on LSE supported machine? > I think that is up to the microarchitecture. But one the ideas behind Yes. This is where things are getting little messy to have generic code where a lot of stuff is defined based on micro architecture/IMPLEMENTATION DEFINED as arm spec. Al least, I am dealing with three different micro archirectures now with a lot of difference. Including the arm cores and qualcomm cores there could around >6ish different micro archtectures. > introducing the LSE atomics was that they should be "better" than the > equivalent > code using exclusives. I think non-conditional LDxxx and STxxx > atomics could be > better than using exclusives while conditional atomics (CAS, CASP) > might not be > so different (the reason has to do with cache coherency, a core can > speculatively snoop-unique the cache line which is targetted by an > atomic > instruction but to what extent that provides a benefit could be > depend on > whether the atomic actually performs a store or not). > > > I think, We can not detect the presese of LSE support in compile > > time. > > Right? > Unfortunately, AFAIK GCC doesn't notify the source code that it is > targetting > v8.1+ with LSE support. If there were intrinsics for (certain) LSE > instructions > (e.g. those not generated by the compiler, e.g. STxxx and CASP), we > could use > some corresponding preprocessor define to detect the presence of such > intrinsics > (they exist for other intrinsics, e.g. __ARM_FEATURE_QRDMX for > SQRDMLAH/SQRDMLSH > instructions and corresponding intrinsics). > > I have tried to interest the Arm GCC developers in this but have not > yet > succeeded. Perhaps if we have more use cases were atomics intrinsics > would be > useful, we could convince them to add such intrinsics to the ACLE > (ARM C > Language Extensions). But we will never get intrinsics for > exclusives, they are > deemed unsafe for explicit use from C. Instead need to provide inline > assembler > that contains the complete exclusives sequence. But in practice it > seems to work > with using inline assembler for LDXR and STXR as I do in the lockfree > code > linked below. > > > The dynamic one will be costly like, > Do you think so? Shouldn't this branch be perfectly predictable? Once Not just branch predication. Right? Corresponding Load and need for more I cache etc. I think, for the generic build we can have either run time detection or stick with LDXR/STXR. We can give a compile time option for CASP based code so that for given micro architecture if it optimized it can make use of it.(Something we can easily expressed on meson build with MIDR value) > in a while > it will fall out of the branch history table but doesn't that mean > the > application hasn't been executing this code for some time so not > really > performance critical? > > > if (hwcaps & HWCAP_ATOMICS) { > > casp > > } else { > > ldxp > > stxp > > } > > > > > From an ARM perspective, I want all atomic operations to take > > > memory > > > ordering arguments (e.g. acquire, release). Not all usages of > > > e.g. > > +1 > > > > > atomic compare exchange require sequential consistency (which I > > > think > > > what x86 cmpxchg instruction provides). DPDK functions should not > > > be > > > modelled after x86 behaviour. > > > > > > Lock-free 128-bit atomics implementations for ARM/AArch64 and > > > x86-64 > > > are available here: > > > https://github.com/ARM-software/progress64/blob/master/src/lockfree.h > > > > -- > Ola Liljedahl, Networking System Architect, Arm > Phone +46706866373, Skype ola.liljedahl >