mbox series

[v7,0/6] Add lock-free ring and mempool handler

Message ID 20190318213555.17345-1-gage.eads@intel.com (mailing list archive)
Headers
Series Add lock-free ring and mempool handler |

Message

Eads, Gage March 18, 2019, 9:35 p.m. UTC
  For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a lock-free ring and a mempool based on it. The
lock-free algorithm relies on a double-pointer compare-and-swap, so for 64-bit
architectures it is currently limited to x86_64.

The ring uses more compare-and-swap atomic operations than the regular rte ring:
With no contention, an enqueue of n pointers uses (1 + n) CAS operations and a
dequeue of n pointers uses 1. This algorithm has worse average-case performance
than the regular rte ring (particularly a highly-contended ring with large bulk
accesses), however:
- For applications with preemptible pthreads, the regular rte ring's worst-case
  performance (i.e. one thread being preempted in the update_tail() critical
  section) is much worse than the lock-free ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a lock-free ring based mempool (a likely use case
  for this ring) with per-thread caching.

The lock-free ring is enabled via a new flag, RING_F_LF. For ease-of-use,
existing ring enqueue/dequeue functions work with both standard and lock-free
rings. This is also an experimental API, so RING_F_LF users must build with the
ALLOW_EXPERIMENTAL_API flag.

This patchset also adds lock-free versions of ring_autotest and
ring_perf_autotest, and a lock-free ring based mempool.

This patchset makes one API change; a deprecation notice was posted in a
separate commit[1].

This patchset depends on the 128-bit compare-and-set patch[2].

[1] http://mails.dpdk.org/archives/dev/2019-February/124321.html
[2] http://mails.dpdk.org/archives/dev/2019-March/125751.html

v7:
- Added ARM copyright to rte_ring_generic.h and rte_ring_c11_mem.h, since the
  lock-free algorithm is based on ARM's lfring (see v5 notes)
- Rename __rte_ring_reload_tail() -> __rte_ring_lf_load_tail()
- Remove the unused return value from __rte_ring_lf_load_tail()
- Rename 'prev_tail' to 'next_tail' in the multi-producer lock-free enqueue

v6:
- Rebase patchset onto master (test/test/ -> app/test/)

v5:
 - Incorporated lfring's enqueue and dequeue logic from
   http://mails.dpdk.org/archives/dev/2019-January/124242.html
 - Renamed non-blocking -> lock-free and NB -> LF to align with a similar
   change in the lock-free stack patchset:
   http://mails.dpdk.org/archives/dev/2019-March/125797.html
 - Added support for 32-bit architectures by using the full 32b of the
   modification counter and requiring LF rings on these architectures to be at
   least 1024 entries.
 - Updated to the latest rte_atomic128_cmp_exchange() interface.
 - Added ring start marker to struct rte_ring

v4:
 - Split out nb_enqueue and nb_dequeue functions in generic and C11 versions,
   with the necessary memory ordering behavior for weakly consistent machines.
 - Convert size_t variables (from v2) to uint64_t and no-longer-applicable
   comment about variably-sized ring indexes.
 - Fix bug in nb_enqueue_mp that the breaks the non-blocking guarantee.
 - Split the ring_ptr cast into two lines.
 - Change the dependent patchset from the non-blocking stack patch series
   to one only containing the 128b CAS commit

v3:
 - Avoid the ABI break by putting 64-bit head and tail values in the same
   cacheline as struct rte_ring's prod and cons members.
 - Don't attempt to compile rte_atomic128_cmpset without
   ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries.
 - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case someone tries
   to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
 - Update the ring mempool to use experimental APIs
 - Clarify that RINB_F_NB is only limited to x86_64 currently; e.g. ARMv8 has the
   ISA support for 128-bit CAS to eventually support it.

v2:
 - Merge separate docs commit into patch #5
 - Convert uintptr_t to size_t
 - Add a compile-time check for the size of size_t
 - Fix a space-after-typecast issue
 - Fix an unnecessary-parentheses checkpatch warning
 - Bump librte_ring's library version

Gage Eads (6):
  ring: add a pointer-width headtail structure
  ring: add a ring start marker
  ring: add a lock-free implementation
  test_ring: add lock-free ring autotest
  test_ring_perf: add lock-free ring perf test
  mempool/ring: add lock-free ring handlers

 app/test/test_ring.c                            |  61 +--
 app/test/test_ring_perf.c                       |  19 +-
 doc/guides/prog_guide/env_abstraction_layer.rst |  10 +
 drivers/mempool/ring/Makefile                   |   1 +
 drivers/mempool/ring/meson.build                |   2 +
 drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
 lib/librte_ring/rte_ring.c                      |  92 ++++-
 lib/librte_ring/rte_ring.h                      | 334 ++++++++++++++--
 lib/librte_ring/rte_ring_c11_mem.h              | 501 ++++++++++++++++++++++++
 lib/librte_ring/rte_ring_generic.h              | 485 ++++++++++++++++++++++-
 lib/librte_ring/rte_ring_version.map            |   7 +
 11 files changed, 1492 insertions(+), 78 deletions(-)
  

Comments

Eads, Gage March 18, 2019, 9:49 p.m. UTC | #1
[Snip]

> 
> This patchset makes one API change; a deprecation notice was posted in a
> separate commit[1].
> 
> This patchset depends on the 128-bit compare-and-set patch[2].
> 
> [1] http://mails.dpdk.org/archives/dev/2019-February/124321.html
> [2] http://mails.dpdk.org/archives/dev/2019-March/125751.html
> 

Hi all,

Friendly reminder that in order to get this feature into 19.08 (assuming folks also want that :)), the API deprecation notice needs to be merged into 19.05.

Thanks,
Gage
  
Stephen Hemminger March 19, 2019, 3:51 p.m. UTC | #2
On Mon, 18 Mar 2019 21:49:44 +0000
"Eads, Gage" <gage.eads@intel.com> wrote:

> Hi all,
> 
> Friendly reminder that in order to get this feature into 19.08 (assuming folks also want that :)), the API deprecation notice needs to be merged into 19.05.
> 
> Thanks,
> Gage

Given the recent API/ABI stability discussion, this is the kind of patch that really
needs to be examined and justified.
  
Eads, Gage April 1, 2019, 7:23 p.m. UTC | #3
> 
> On Mon, 18 Mar 2019 21:49:44 +0000
> "Eads, Gage" <gage.eads@intel.com> wrote:
> 
> > Hi all,
> >
> > Friendly reminder that in order to get this feature into 19.08 (assuming
> folks also want that :)), the API deprecation notice needs to be merged into
> 19.05.
> >
> > Thanks,
> > Gage
> 
> Given the recent API/ABI stability discussion, this is the kind of patch that
> really needs to be examined and justified.

Can you point me to the discussion (assuming it was on the ML)? I'm aware of Ferruh's changes to the docs, but not the discussion you referenced.

The lock-free ring functionality itself is a valuable addition to DPDK, primarily because it lacks the standard ring's non-preemptive constraint. The non-preemptive constraint is important in an application with both high priority, performance-sensitive data-plane components and low-priority control-plane components. This was important enough to warrant further clarification recently[1], and has been a discussion topic for some time[2][3].

The modified API, rte_ring_get_memsize(), was added to allow users to initialize rings in separately allocated memory. This function isn't called in DPDK's examples/apps/drivers, and a quick google search didn't turn up any open source projects that call the function, so I suspect that a majority of ring code uses rte_ring_create() instead of rte_ring_get_memsize() + rte_ring_init(). So I suspect this interface change will affect a small percentage of DPDK users.

As a straw-man counter-proposal, we could instead introduce a lock-free specific function rte_lf_ring_get_memsize() that lock-free ring users would call instead of rte_ring_get_memsize(). This would avoid the API modification, but:
- It's awkward to have one rte_lf_ring_* function and otherwise access the lock-free ring through rte_ring_* functions.
- It's also easy to envision a user incorrectly calling rte_ring_get_memsize() rather than rte_lf_ring_get_memsize() for a lock-free ring, since otherwise rte_ring_* functions are used. DPDK would have no way to detect that the allocated memory is too small, and if such a bug occurs it would manifest itself as memory corruption.
- Changing rte_ring_get_memsize() to take a flags argument may be the better long-term design, if another flag is introduced that likewise uses a different ring size.

Another approach is to break out the lock-free ring into a fully separate API. One of the goals of my patchset was to allow applications to switch to lock-free rings with minimal code change; I think the value of the lock-free ring warrants such an approach.

(Unfortunately without hard numbers on the cost or benefit of such a change, these arguments are fairly subjective.)

Thanks,
Gage

[1] https://patches.dpdk.org/patch/43122/
[2] http://mails.dpdk.org/archives/dev/2013-November/000714.html
[3] http://mails.dpdk.org/archives/dev/2014-January/001163.html
  
Ola Liljedahl April 2, 2019, 10:16 a.m. UTC | #4
On Mon, 2019-04-01 at 19:23 +0000, Eads, Gage wrote:
> > 
> > 
> > On Mon, 18 Mar 2019 21:49:44 +0000
> > "Eads, Gage" <gage.eads@intel.com> wrote:
> > 
> > > 
> > > Hi all,
> > > 
> > > Friendly reminder that in order to get this feature into 19.08 (assuming
> > folks also want that :)), the API deprecation notice needs to be merged into
> > 19.05.
> > > 
> > > 
> > > Thanks,
> > > Gage
> > Given the recent API/ABI stability discussion, this is the kind of patch
> > that
> > really needs to be examined and justified.
> Can you point me to the discussion (assuming it was on the ML)? I'm aware of
> Ferruh's changes to the docs, but not the discussion you referenced.
> 
> The lock-free ring functionality itself is a valuable addition to DPDK,
> primarily because it lacks the standard ring's non-preemptive constraint. The
> non-preemptive constraint is important in an application with both high
> priority, performance-sensitive data-plane components and low-priority
> control-plane components. This was important enough to warrant further
> clarification recently[1], and has been a discussion topic for some
> time[2][3].
> 
> The modified API, rte_ring_get_memsize(), was added to allow users to
> initialize rings in separately allocated memory. This function isn't called in
> DPDK's examples/apps/drivers, and a quick google search didn't turn up any
> open source projects that call the function, so I suspect that a majority of
> ring code uses rte_ring_create() instead of rte_ring_get_memsize() +
> rte_ring_init(). So I suspect this interface change will affect a small
> percentage of DPDK users.
> 
> As a straw-man counter-proposal, we could instead introduce a lock-free
> specific function rte_lf_ring_get_memsize() that lock-free ring users would
> call instead of rte_ring_get_memsize(). This would avoid the API modification,
> but:
> - It's awkward to have one rte_lf_ring_* function and otherwise access the
> lock-free ring through rte_ring_* functions.
> - It's also easy to envision a user incorrectly calling rte_ring_get_memsize()
> rather than rte_lf_ring_get_memsize() for a lock-free ring, since otherwise
> rte_ring_* functions are used. DPDK would have no way to detect that the
> allocated memory is too small, and if such a bug occurs it would manifest
> itself as memory corruption.
> - Changing rte_ring_get_memsize() to take a flags argument may be the better
> long-term design, if another flag is introduced that likewise uses a different
> ring size.
> 
> Another approach is to break out the lock-free ring into a fully separate API.
As in the RFC I posted: http://patches.dpdk.org/patch/50095/
Cleaner API, simpler implementation.

 One of the goals of my patchset was to allow applications to switch to lock-free rings with minimal code change; I think the value of the lock-free ring warrants such an approach.

A noble goal but personally I think DPDK API's and implementations are getting
more and more messy and thus difficult to use and difficult to maintain.
Is it so much worse to have separate but structurally equivalent API's?
Yes, blocking vs non-blocking can no longer be selected at runtime (startup
time), I think this is the biggest limitation.

-- Ola


(Unfortunately without hard numbers on the cost or benefit of such a change, these arguments are fairly subjective.)

Thanks,
Gage

[1] https://patches.dpdk.org/patch/43122/
[2] http://mails.dpdk.org/archives/dev/2013-November/000714.html
[3] http://mails.dpdk.org/archives/dev/2014-January/001163.html
  
Eads, Gage April 4, 2019, 10:28 p.m. UTC | #5
> On Mon, 2019-04-01 at 19:23 +0000, Eads, Gage wrote:
> > >
> > >
> > > On Mon, 18 Mar 2019 21:49:44 +0000
> > > "Eads, Gage" <gage.eads@intel.com> wrote:
> > >
> > > >
> > > > Hi all,
> > > >
> > > > Friendly reminder that in order to get this feature into 19.08
> > > > (assuming
> > > folks also want that :)), the API deprecation notice needs to be
> > > merged into 19.05.
> > > >
> > > >
> > > > Thanks,
> > > > Gage
> > > Given the recent API/ABI stability discussion, this is the kind of
> > > patch that really needs to be examined and justified.
> > Can you point me to the discussion (assuming it was on the ML)? I'm
> > aware of Ferruh's changes to the docs, but not the discussion you
> referenced.
> >
> > The lock-free ring functionality itself is a valuable addition to
> > DPDK, primarily because it lacks the standard ring's non-preemptive
> > constraint. The non-preemptive constraint is important in an
> > application with both high priority, performance-sensitive data-plane
> > components and low-priority control-plane components. This was
> > important enough to warrant further clarification recently[1], and has
> > been a discussion topic for some time[2][3].
> >
> > The modified API, rte_ring_get_memsize(), was added to allow users to
> > initialize rings in separately allocated memory. This function isn't
> > called in DPDK's examples/apps/drivers, and a quick google search
> > didn't turn up any open source projects that call the function, so I
> > suspect that a majority of ring code uses rte_ring_create() instead of
> > rte_ring_get_memsize() + rte_ring_init(). So I suspect this interface
> > change will affect a small percentage of DPDK users.
> >
> > As a straw-man counter-proposal, we could instead introduce a
> > lock-free specific function rte_lf_ring_get_memsize() that lock-free
> > ring users would call instead of rte_ring_get_memsize(). This would
> > avoid the API modification,
> > but:
> > - It's awkward to have one rte_lf_ring_* function and otherwise access
> > the lock-free ring through rte_ring_* functions.
> > - It's also easy to envision a user incorrectly calling
> > rte_ring_get_memsize() rather than rte_lf_ring_get_memsize() for a
> > lock-free ring, since otherwise
> > rte_ring_* functions are used. DPDK would have no way to detect that
> > the allocated memory is too small, and if such a bug occurs it would
> > manifest itself as memory corruption.
> > - Changing rte_ring_get_memsize() to take a flags argument may be the
> > better long-term design, if another flag is introduced that likewise
> > uses a different ring size.
> >
> > Another approach is to break out the lock-free ring into a fully separate
> API.
> As in the RFC I posted: http://patches.dpdk.org/patch/50095/
> Cleaner API, simpler implementation.
> 
> >  One of the goals of my patchset was to allow applications to switch to lock-
> > free rings with minimal code change; I think the value of the lock-free ring
> > warrants such an approach.
> 
> A noble goal but personally I think DPDK API's and implementations are
> getting more and more messy and thus difficult to use and difficult to
> maintain.
> Is it so much worse to have separate but structurally equivalent API's?

No -- I think both are valid options, with their own tradeoffs.

> Yes, blocking vs non-blocking can no longer be selected at runtime (startup
> time), I think this is the biggest limitation.

Run-time selection with the LF flag means we can easily re-use a large amount of pre-existing ring code -- e.g. the way the ring tests are re-purposed for lock-free rings in this patchset.

The implementation may not be as easily maintained, but we don't have to maintain LF-specific tests/benchmarks/etc.

Maintainers/tech leads, do y'all have a preference?

Thanks,
Gage