mbox series

[RFC,0/6] New sync modes for ring

Message ID 20200224113515.1744-1-konstantin.ananyev@intel.com (mailing list archive)
Headers
Series New sync modes for ring |

Message

Ananyev, Konstantin Feb. 24, 2020, 11:35 a.m. UTC
  Upfront note - that RFC is not a complete patch.
It introduces an ABI breakage, plus it doesn't update ring_elem
code properly, etc.
I plan to deal with all these things in later versions.
Right now I seek an initial feedback about proposed ideas.
Would also ask people to repeat performance tests (see below)
on their platforms to confirm the impact.

More and more customers use(/try to use) DPDK based apps within
overcommitted systems (multiple acttive threads over same pysical cores):
VM, container deployments, etc.
One quite common problem they hit: Lock-Holder-Preemption with rte_ring.
LHP is quite a common problem for spin-based sync primitives
(spin-locks, etc.) on overcommitted systems.
The situation gets much worse when some sort of
fair-locking technique is used (ticket-lock, etc.).
As now not only lock-owner but also lock-waiters scheduling
order matters a lot.
This is a well-known problem for kernel within VMs:
http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
The problem with rte_ring is that while head accusion is sort of
un-fair locking, waiting on tail is very similar to ticket lock schema -
tail has to be updated in particular order.
That makes current rte_ring implementation to perform
really pure on some overcommited scenarios.
While it is probably not possible to completely resolve this problem in
userspace only (without some kernel communication/intervention),
removing fairness in tail update can mitigate it significantly.
So this RFC proposes two new optional ring synchronization modes:
1) Head/Tail Sync (HTS) mode
In that mode enqueue/dequeue operation is fully serialized:
    only one thread at a time is allowed to perform given op.
    As another enhancement provide ability to split enqueue/dequeue
    operation into two phases:
      - enqueue/dequeue start
      - enqueue/dequeue finish
    That allows user to inspect objects in the ring without removing
    them from it (aka MT safe peek).
2) Relaxed Tail Sync (RTS)
The main difference from original MP/MC algorithm is that
tail value is increased not by every thread that finished enqueue/dequeue,
but only by the last one.
That allows threads to avoid spinning on ring tail value,
leaving actual tail value change to the last thread in the update queue.

Test results on IA (see below) show significant improvements
for average enqueue/dequeue op times on overcommitted systems.
For 'classic' DPDK deployments (one thread per core) original MP/MC
algorithm still shows best numbers, though for 64-bit target
RTS numbers are not that far away.
Numbers were produced by ring_stress_*autotest
(first patch in these series).

X86_64 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
DEQ+ENQ average cycles/obj

                                                MP/MC      HTS     RTS
1thread@1core(--lcores=6-7)                     8.00       8.15    8.99
2thread@2core(--lcores=6-8)                     19.14      19.61   20.35
4thread@4core(--lcores=6-10)                    29.43      29.79   31.82
8thread@8core(--lcores=6-14)                    110.59     192.81  119.50
16thread@16core(--lcores=6-22)                  461.03     813.12  495.59
32thread/@32core(--lcores='6-22,55-70')         982.90     1972.38 1160.51

2thread@1core(--lcores='6,(10-11)@7'            20140.50   23.58   25.14
4thread@2core(--lcores='6,(10-11)@7,(20-21)@8'  153680.60  76.88   80.05
8thread@2core(--lcores='6,(10-13)@7,(20-23)@8'  280314.32  294.72  318.79
16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 643176.59  1144.02 1175.14
32thread@2core(--lcores='6,(10-25)@7,(30-45)@8' 4264238.80 4627.48 4892.68

8thread@2core(--lcores='6,(10-17)@(7,8))'       321085.98  298.59  307.47
16thread@4core(--lcores='6,(20-35)@(7-10))'     1900705.61 575.35  678.29
32thread@4core(--lcores='6,(20-51)@(7-10))'     5510445.85 2164.36 2714.12

i686 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
DEQ+ENQ average cycles/obj

                                                MP/MC      HTS     RTS
1thread@1core(--lcores=6-7)                     7.85       12.13   11.31
2thread@2core(--lcores=6-8)                     17.89      24.52   21.86
8thread@8core(--lcores=6-14)                    32.58      354.20  54.58
32thread/@32core(--lcores='6-22,55-70')         813.77     6072.41 2169.91

2thread@1core(--lcores='6,(10-11)@7'            16095.00   36.06   34.74
8thread@2core(--lcores='6,(10-13)@7,(20-23)@8'  1140354.54 346.61  361.57
16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 1920417.86 1314.90 1416.65

8thread@2core(--lcores='6,(10-17)@(7,8))'       594358.61  332.70  357.74
32thread@4core(--lcores='6,(20-51)@(7-10))'     5319896.86 2836.44 3028.87

Konstantin Ananyev (6):
  test/ring: add contention stress test
  ring: rework ring layout to allow new sync schemes
  ring: introduce RTS ring mode
  test/ring: add contention stress test for RTS ring
  ring: introduce HTS ring mode
  test/ring: add contention stress test for HTS ring

 app/test/Makefile                      |   3 +
 app/test/meson.build                   |   3 +
 app/test/test_pdump.c                  |   6 +-
 app/test/test_ring_hts_stress.c        |  28 ++
 app/test/test_ring_rts_stress.c        |  28 ++
 app/test/test_ring_stress.c            |  27 ++
 app/test/test_ring_stress.h            | 477 +++++++++++++++++++
 lib/librte_pdump/rte_pdump.c           |   2 +-
 lib/librte_port/rte_port_ring.c        |  12 +-
 lib/librte_ring/Makefile               |   4 +-
 lib/librte_ring/meson.build            |   4 +-
 lib/librte_ring/rte_ring.c             |  84 +++-
 lib/librte_ring/rte_ring.h             | 619 +++++++++++++++++++++++--
 lib/librte_ring/rte_ring_elem.h        |   8 +-
 lib/librte_ring/rte_ring_hts_generic.h | 228 +++++++++
 lib/librte_ring/rte_ring_rts_generic.h | 240 ++++++++++
 16 files changed, 1721 insertions(+), 52 deletions(-)
 create mode 100644 app/test/test_ring_hts_stress.c
 create mode 100644 app/test/test_ring_rts_stress.c
 create mode 100644 app/test/test_ring_stress.c
 create mode 100644 app/test/test_ring_stress.h
 create mode 100644 lib/librte_ring/rte_ring_hts_generic.h
 create mode 100644 lib/librte_ring/rte_ring_rts_generic.h
  

Comments

Stephen Hemminger Feb. 24, 2020, 4:59 p.m. UTC | #1
On Mon, 24 Feb 2020 11:35:09 +0000
Konstantin Ananyev <konstantin.ananyev@intel.com> wrote:

> Upfront note - that RFC is not a complete patch.
> It introduces an ABI breakage, plus it doesn't update ring_elem
> code properly, etc.
> I plan to deal with all these things in later versions.
> Right now I seek an initial feedback about proposed ideas.
> Would also ask people to repeat performance tests (see below)
> on their platforms to confirm the impact.
> 
> More and more customers use(/try to use) DPDK based apps within
> overcommitted systems (multiple acttive threads over same pysical cores):
> VM, container deployments, etc.
> One quite common problem they hit: Lock-Holder-Preemption with rte_ring.
> LHP is quite a common problem for spin-based sync primitives
> (spin-locks, etc.) on overcommitted systems.
> The situation gets much worse when some sort of
> fair-locking technique is used (ticket-lock, etc.).
> As now not only lock-owner but also lock-waiters scheduling
> order matters a lot.
> This is a well-known problem for kernel within VMs:
> http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> The problem with rte_ring is that while head accusion is sort of
> un-fair locking, waiting on tail is very similar to ticket lock schema -
> tail has to be updated in particular order.
> That makes current rte_ring implementation to perform
> really pure on some overcommited scenarios.

Rather than reform rte_ring to fit this scenario, it would make
more sense to me to introduce another primitive. The current lockless
ring performs very well for the isolated thread model that DPDK
was built around. This looks like a case of customers violating
the usage model of the DPDK and then being surprised at the fallout.
  
Jerin Jacob Feb. 24, 2020, 5:59 p.m. UTC | #2
On Mon, Feb 24, 2020 at 10:29 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Mon, 24 Feb 2020 11:35:09 +0000
> Konstantin Ananyev <konstantin.ananyev@intel.com> wrote:
>
> > Upfront note - that RFC is not a complete patch.
> > It introduces an ABI breakage, plus it doesn't update ring_elem
> > code properly, etc.
> > I plan to deal with all these things in later versions.
> > Right now I seek an initial feedback about proposed ideas.
> > Would also ask people to repeat performance tests (see below)
> > on their platforms to confirm the impact.
> >
> > More and more customers use(/try to use) DPDK based apps within
> > overcommitted systems (multiple acttive threads over same pysical cores):
> > VM, container deployments, etc.
> > One quite common problem they hit: Lock-Holder-Preemption with rte_ring.
> > LHP is quite a common problem for spin-based sync primitives
> > (spin-locks, etc.) on overcommitted systems.
> > The situation gets much worse when some sort of
> > fair-locking technique is used (ticket-lock, etc.).
> > As now not only lock-owner but also lock-waiters scheduling
> > order matters a lot.
> > This is a well-known problem for kernel within VMs:
> > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > The problem with rte_ring is that while head accusion is sort of
> > un-fair locking, waiting on tail is very similar to ticket lock schema -
> > tail has to be updated in particular order.
> > That makes current rte_ring implementation to perform
> > really pure on some overcommited scenarios.
>
> Rather than reform rte_ring to fit this scenario, it would make
> more sense to me to introduce another primitive. The current lockless
> ring performs very well for the isolated thread model that DPDK
> was built around. This looks like a case of customers violating
> the usage model of the DPDK and then being surprised at the fallout.

I agree with Stephen here.

I think, adding more runtime check in the enqueue() and dequeue() will
have a bad effect on the low-end cores too.
But I agree with the problem statement that in the virtualization use
case, It may be possible to have N virtual cores runs on a physical
core.

IMO, The best solution would be keeping the ring API same and have a
different flavor in "compile-time". Something like
liburcu did for accommodating different flavors.

i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. The
application can simply include ONE header file in a C file based on
the flavor.
If need both at runtime. Need to have function pointer or so in the
application and define the function in different c file by including
the approaite flavor in C file.

#include <urcu-qsbr.h> /* QSBR RCU flavor */
#include <urcu-bp.h> /* Bulletproof RCU flavor */













>
  
Stephen Hemminger Feb. 24, 2020, 7:35 p.m. UTC | #3
On Mon, 24 Feb 2020 23:29:57 +0530
Jerin Jacob <jerinjacobk@gmail.com> wrote:

> On Mon, Feb 24, 2020 at 10:29 PM Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> >
> > On Mon, 24 Feb 2020 11:35:09 +0000
> > Konstantin Ananyev <konstantin.ananyev@intel.com> wrote:
> >  
> > > Upfront note - that RFC is not a complete patch.
> > > It introduces an ABI breakage, plus it doesn't update ring_elem
> > > code properly, etc.
> > > I plan to deal with all these things in later versions.
> > > Right now I seek an initial feedback about proposed ideas.
> > > Would also ask people to repeat performance tests (see below)
> > > on their platforms to confirm the impact.
> > >
> > > More and more customers use(/try to use) DPDK based apps within
> > > overcommitted systems (multiple acttive threads over same pysical cores):
> > > VM, container deployments, etc.
> > > One quite common problem they hit: Lock-Holder-Preemption with rte_ring.
> > > LHP is quite a common problem for spin-based sync primitives
> > > (spin-locks, etc.) on overcommitted systems.
> > > The situation gets much worse when some sort of
> > > fair-locking technique is used (ticket-lock, etc.).
> > > As now not only lock-owner but also lock-waiters scheduling
> > > order matters a lot.
> > > This is a well-known problem for kernel within VMs:
> > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > > The problem with rte_ring is that while head accusion is sort of
> > > un-fair locking, waiting on tail is very similar to ticket lock schema -
> > > tail has to be updated in particular order.
> > > That makes current rte_ring implementation to perform
> > > really pure on some overcommited scenarios.  
> >
> > Rather than reform rte_ring to fit this scenario, it would make
> > more sense to me to introduce another primitive. The current lockless
> > ring performs very well for the isolated thread model that DPDK
> > was built around. This looks like a case of customers violating
> > the usage model of the DPDK and then being surprised at the fallout.  
> 
> I agree with Stephen here.
> 
> I think, adding more runtime check in the enqueue() and dequeue() will
> have a bad effect on the low-end cores too.
> But I agree with the problem statement that in the virtualization use
> case, It may be possible to have N virtual cores runs on a physical
> core.
> 
> IMO, The best solution would be keeping the ring API same and have a
> different flavor in "compile-time". Something like
> liburcu did for accommodating different flavors.
> 
> i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. The
> application can simply include ONE header file in a C file based on
> the flavor.
> If need both at runtime. Need to have function pointer or so in the
> application and define the function in different c file by including
> the approaite flavor in C file.

This would also be a good time to consider the tradeoffs of the
heavy use of inlining that is done in rte_ring vs the impact that
has on API/ABI stability.
  
Honnappa Nagarahalli Feb. 24, 2020, 8:52 p.m. UTC | #4
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Stephen Hemminger
> Sent: Monday, February 24, 2020 1:35 PM
> To: Jerin Jacob <jerinjacobk@gmail.com>
> Cc: Konstantin Ananyev <konstantin.ananyev@intel.com>; dpdk-dev
> <dev@dpdk.org>; Olivier Matz <olivier.matz@6wind.com>
> Subject: Re: [dpdk-dev] [RFC 0/6] New sync modes for ring
> 
> On Mon, 24 Feb 2020 23:29:57 +0530
> Jerin Jacob <jerinjacobk@gmail.com> wrote:
> 
> > On Mon, Feb 24, 2020 at 10:29 PM Stephen Hemminger
> > <stephen@networkplumber.org> wrote:
> > >
> > > On Mon, 24 Feb 2020 11:35:09 +0000
> > > Konstantin Ananyev <konstantin.ananyev@intel.com> wrote:
> > >
> > > > Upfront note - that RFC is not a complete patch.
> > > > It introduces an ABI breakage, plus it doesn't update ring_elem
> > > > code properly, etc.
> > > > I plan to deal with all these things in later versions.
> > > > Right now I seek an initial feedback about proposed ideas.
> > > > Would also ask people to repeat performance tests (see below) on
> > > > their platforms to confirm the impact.
> > > >
> > > > More and more customers use(/try to use) DPDK based apps within
> > > > overcommitted systems (multiple acttive threads over same pysical
> cores):
> > > > VM, container deployments, etc.
> > > > One quite common problem they hit: Lock-Holder-Preemption with
> rte_ring.
> > > > LHP is quite a common problem for spin-based sync primitives
> > > > (spin-locks, etc.) on overcommitted systems.
> > > > The situation gets much worse when some sort of fair-locking
> > > > technique is used (ticket-lock, etc.).
> > > > As now not only lock-owner but also lock-waiters scheduling order
> > > > matters a lot.
> > > > This is a well-known problem for kernel within VMs:
> > > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > > > The problem with rte_ring is that while head accusion is sort of
> > > > un-fair locking, waiting on tail is very similar to ticket lock
> > > > schema - tail has to be updated in particular order.
> > > > That makes current rte_ring implementation to perform really pure
> > > > on some overcommited scenarios.
> > >
> > > Rather than reform rte_ring to fit this scenario, it would make more
> > > sense to me to introduce another primitive. The current lockless
> > > ring performs very well for the isolated thread model that DPDK was
> > > built around. This looks like a case of customers violating the
> > > usage model of the DPDK and then being surprised at the fallout.
> >
> > I agree with Stephen here.
> >
> > I think, adding more runtime check in the enqueue() and dequeue() will
> > have a bad effect on the low-end cores too.
> > But I agree with the problem statement that in the virtualization use
> > case, It may be possible to have N virtual cores runs on a physical
> > core.
> >
> > IMO, The best solution would be keeping the ring API same and have a
> > different flavor in "compile-time". Something like liburcu did for
> > accommodating different flavors.
> >
> > i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. The
> > application can simply include ONE header file in a C file based on
> > the flavor.
> > If need both at runtime. Need to have function pointer or so in the
> > application and define the function in different c file by including
> > the approaite flavor in C file.
> 
> This would also be a good time to consider the tradeoffs of the heavy use of
> inlining that is done in rte_ring vs the impact that has on API/ABI stability.
> 
I was working on few requirements in rte_ring library for RCU defer APIs. RFC is at https://patchwork.dpdk.org/cover/66020/.
  
Honnappa Nagarahalli Feb. 25, 2020, 12:58 a.m. UTC | #5
<snip>

> 
> On Mon, Feb 24, 2020 at 10:29 PM Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> >
> > On Mon, 24 Feb 2020 11:35:09 +0000
> > Konstantin Ananyev <konstantin.ananyev@intel.com> wrote:
> >
> > > Upfront note - that RFC is not a complete patch.
> > > It introduces an ABI breakage, plus it doesn't update ring_elem code
> > > properly, etc.
> > > I plan to deal with all these things in later versions.
> > > Right now I seek an initial feedback about proposed ideas.
> > > Would also ask people to repeat performance tests (see below) on
> > > their platforms to confirm the impact.
> > >
> > > More and more customers use(/try to use) DPDK based apps within
> > > overcommitted systems (multiple acttive threads over same pysical cores):
> > > VM, container deployments, etc.
> > > One quite common problem they hit: Lock-Holder-Preemption with
> rte_ring.
> > > LHP is quite a common problem for spin-based sync primitives
> > > (spin-locks, etc.) on overcommitted systems.
> > > The situation gets much worse when some sort of fair-locking
> > > technique is used (ticket-lock, etc.).
> > > As now not only lock-owner but also lock-waiters scheduling order
> > > matters a lot.
> > > This is a well-known problem for kernel within VMs:
> > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
These slides seem to indicate that the problems are mitigated through the Hypervisor configuration. Do we still need to address the issues?

> > > The problem with rte_ring is that while head accusion is sort of
> > > un-fair locking, waiting on tail is very similar to ticket lock
> > > schema - tail has to be updated in particular order.
> > > That makes current rte_ring implementation to perform really pure on
> > > some overcommited scenarios.
> >
> > Rather than reform rte_ring to fit this scenario, it would make more
> > sense to me to introduce another primitive. The current lockless ring
> > performs very well for the isolated thread model that DPDK was built
> > around. This looks like a case of customers violating the usage model
> > of the DPDK and then being surprised at the fallout.
> 
> I agree with Stephen here.
> 
> I think, adding more runtime check in the enqueue() and dequeue() will have a
> bad effect on the low-end cores too.
> But I agree with the problem statement that in the virtualization use case, It
> may be possible to have N virtual cores runs on a physical core.
It is hard to imagine that there are data plane applications deployed in such environments. Wouldn't this affect the performance terribly?

> 
> IMO, The best solution would be keeping the ring API same and have a
> different flavor in "compile-time". Something like liburcu did for
> accommodating different flavors.
> 
> i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. The application
> can simply include ONE header file in a C file based on the flavor.
> If need both at runtime. Need to have function pointer or so in the application
> and define the function in different c file by including the approaite flavor in C
> file.
> 
> #include <urcu-qsbr.h> /* QSBR RCU flavor */ #include <urcu-bp.h> /*
> Bulletproof RCU flavor */
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> >
  
Ananyev, Konstantin Feb. 25, 2020, 11:45 a.m. UTC | #6
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Stephen Hemminger
> > Sent: Monday, February 24, 2020 1:35 PM
> > To: Jerin Jacob <jerinjacobk@gmail.com>
> > Cc: Konstantin Ananyev <konstantin.ananyev@intel.com>; dpdk-dev
> > <dev@dpdk.org>; Olivier Matz <olivier.matz@6wind.com>
> > Subject: Re: [dpdk-dev] [RFC 0/6] New sync modes for ring
> >
> > On Mon, 24 Feb 2020 23:29:57 +0530
> > Jerin Jacob <jerinjacobk@gmail.com> wrote:
> >
> > > On Mon, Feb 24, 2020 at 10:29 PM Stephen Hemminger
> > > <stephen@networkplumber.org> wrote:
> > > >
> > > > On Mon, 24 Feb 2020 11:35:09 +0000
> > > > Konstantin Ananyev <konstantin.ananyev@intel.com> wrote:
> > > >
> > > > > Upfront note - that RFC is not a complete patch.
> > > > > It introduces an ABI breakage, plus it doesn't update ring_elem
> > > > > code properly, etc.
> > > > > I plan to deal with all these things in later versions.
> > > > > Right now I seek an initial feedback about proposed ideas.
> > > > > Would also ask people to repeat performance tests (see below) on
> > > > > their platforms to confirm the impact.
> > > > >
> > > > > More and more customers use(/try to use) DPDK based apps within
> > > > > overcommitted systems (multiple acttive threads over same pysical
> > cores):
> > > > > VM, container deployments, etc.
> > > > > One quite common problem they hit: Lock-Holder-Preemption with
> > rte_ring.
> > > > > LHP is quite a common problem for spin-based sync primitives
> > > > > (spin-locks, etc.) on overcommitted systems.
> > > > > The situation gets much worse when some sort of fair-locking
> > > > > technique is used (ticket-lock, etc.).
> > > > > As now not only lock-owner but also lock-waiters scheduling order
> > > > > matters a lot.
> > > > > This is a well-known problem for kernel within VMs:
> > > > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > > > > The problem with rte_ring is that while head accusion is sort of
> > > > > un-fair locking, waiting on tail is very similar to ticket lock
> > > > > schema - tail has to be updated in particular order.
> > > > > That makes current rte_ring implementation to perform really pure
> > > > > on some overcommited scenarios.
> > > >
> > > > Rather than reform rte_ring to fit this scenario, it would make more
> > > > sense to me to introduce another primitive. The current lockless
> > > > ring performs very well for the isolated thread model that DPDK was
> > > > built around. This looks like a case of customers violating the
> > > > usage model of the DPDK and then being surprised at the fallout.
> > >
> > > I agree with Stephen here.
> > >
> > > I think, adding more runtime check in the enqueue() and dequeue() will
> > > have a bad effect on the low-end cores too.
> > > But I agree with the problem statement that in the virtualization use
> > > case, It may be possible to have N virtual cores runs on a physical
> > > core.
> > >
> > > IMO, The best solution would be keeping the ring API same and have a
> > > different flavor in "compile-time". Something like liburcu did for
> > > accommodating different flavors.
> > >
> > > i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. The
> > > application can simply include ONE header file in a C file based on
> > > the flavor.
> > > If need both at runtime. Need to have function pointer or so in the
> > > application and define the function in different c file by including
> > > the approaite flavor in C file.
> >
> > This would also be a good time to consider the tradeoffs of the heavy use of
> > inlining that is done in rte_ring vs the impact that has on API/ABI stability.
> >
> I was working on few requirements in rte_ring library for RCU defer APIs. RFC is at https://patchwork.dpdk.org/cover/66020/.

Yep, noticed your patch, seems we sort of collided here.
As I understand you patch aims to provide functionality similar to my HTS one.
Will try to look at yours one in next fee days, hopefully we can end-up with
some common denominator.
Konstantin
  
Ananyev, Konstantin Feb. 25, 2020, 1:41 p.m. UTC | #7
> > > > Upfront note - that RFC is not a complete patch.
> > > > It introduces an ABI breakage, plus it doesn't update ring_elem
> > > > code properly, etc.
> > > > I plan to deal with all these things in later versions.
> > > > Right now I seek an initial feedback about proposed ideas.
> > > > Would also ask people to repeat performance tests (see below)
> > > > on their platforms to confirm the impact.
> > > >
> > > > More and more customers use(/try to use) DPDK based apps within
> > > > overcommitted systems (multiple acttive threads over same pysical cores):
> > > > VM, container deployments, etc.
> > > > One quite common problem they hit: Lock-Holder-Preemption with rte_ring.
> > > > LHP is quite a common problem for spin-based sync primitives
> > > > (spin-locks, etc.) on overcommitted systems.
> > > > The situation gets much worse when some sort of
> > > > fair-locking technique is used (ticket-lock, etc.).
> > > > As now not only lock-owner but also lock-waiters scheduling
> > > > order matters a lot.
> > > > This is a well-known problem for kernel within VMs:
> > > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > > > The problem with rte_ring is that while head accusion is sort of
> > > > un-fair locking, waiting on tail is very similar to ticket lock schema -
> > > > tail has to be updated in particular order.
> > > > That makes current rte_ring implementation to perform
> > > > really pure on some overcommited scenarios.
> > >
> > > Rather than reform rte_ring to fit this scenario, it would make
> > > more sense to me to introduce another primitive. 

I don't see much advantages it will bring us.
As a disadvantages, for developers and maintainers - code duplication,
for end users - extra code churn and removed ability to mix and match
different sync modes in one ring.

> The current lockless
> > > ring performs very well for the isolated thread model that DPDK
> > > was built around. This looks like a case of customers violating
> > > the usage model of the DPDK and then being surprised at the fallout.

For customers using isolated thread model - nothing should change
(both in terms of API and performance).
Existing sync modes MP/MC,SP/SC kept untouched, set up in the same
way (via flags and _init_), and MP/MC remains as default one.
From other side I don't see why we should ignore customers that want to use
their DPDK apps in different deployment scenarios.

> >
> > I agree with Stephen here.
> >
> > I think, adding more runtime check in the enqueue() and dequeue() will
> > have a bad effect on the low-end cores too.

We do have a run-time check in our current enqueue()/dequeue implementation.
In fact we support both modes: we have generic rte_ring_enqueue(/dequeue)_bulk(/burst)
where sync behaviour is determined at runtime by value of prod(/cons).single.
Or user can call  rte_ring_(mp/sp)_enqueue_* functions directly.
This RFC follows exactly the same paradigm:
rte_ring_enqueue(/dequeue)_bulk(/burst) kept generic and it's
behaviour is determined at runtime, by value of prod(/cons).sync_type.
Or user can call enqueue/dequeue with particular sync mode directly:
rte_ring_(mp/sp/rts/hts)_enqueue_(bulk/burst)*.
The only thing that changed:
 Format of prod/cons now could differ depending on mode selected at _init_.
 So you can't create a ring for let say SP mode and then in the middle of data-path
 change your mind and start using MP_RTS mode.
 For existing modes (SP/MP, SC/MC)  format remains the same and user can still
 use them interchangeably, though of course that is an error prone practice.   

> > But I agree with the problem statement that in the virtualization use
> > case, It may be possible to have N virtual cores runs on a physical
> > core.
> >
> > IMO, The best solution would be keeping the ring API same and have a
> > different flavor in "compile-time". Something like
> > liburcu did for accommodating different flavors.
> >
> > i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. The
> > application can simply include ONE header file in a C file based on
> > the flavor.

I don't think it is a flexible enough approach.
In one app user might need to have several rings with different sync modes.
Or even user might need a ring with different sync modes for enqueue/dequeue.

> > If need both at runtime. Need to have function pointer or so in the
> > application and define the function in different c file by including
> > the approaite flavor in C file.

Big issue with function pointers here would be DPDK MP model.
AFAIK,  rte_ring is quite popular mechanism for IPC between DPDK apps.
To support such model, we'll need to split rte_ring data into 'shared'
and 'private' and initialize private one for every process that is going to use it.
That sounds like a massive change, and I am not sure the required effort will worth it. 
BTW, if user just calls API functions without trying to access structure internals directly,
I don't think it would be a big difference for him what is inside:
indirect function call or inlined switch(...) {}.  

> This would also be a good time to consider the tradeoffs of the
> heavy use of inlining that is done in rte_ring vs the impact that
> has on API/ABI stability.

Yes, hiding rte_ring implementation inside .c would help a lot
in terms of ABI maintenance and would make our future life easier.
The question is what is the price for it in terms of performance,
and are we ready to pay it. Not to mention that it would cause
changes in many other libs/apps...
So I think it should be a subject for a separate discussion.
But, agree it would be good at least to measure the performance
impact of such change.
If I'll have some spare cycles, will give it a try.
Meanwhile, can I ask Jerin and other guys to repeat tests from this RFC
on their HW? Before continuing discussion would probably be good to know
does the suggested patch work as expected across different platforms.
Thanks
Konstantin
  
Ananyev, Konstantin Feb. 25, 2020, 3:14 p.m. UTC | #8
> > > > Upfront note - that RFC is not a complete patch.
> > > > It introduces an ABI breakage, plus it doesn't update ring_elem code
> > > > properly, etc.
> > > > I plan to deal with all these things in later versions.
> > > > Right now I seek an initial feedback about proposed ideas.
> > > > Would also ask people to repeat performance tests (see below) on
> > > > their platforms to confirm the impact.
> > > >
> > > > More and more customers use(/try to use) DPDK based apps within
> > > > overcommitted systems (multiple acttive threads over same pysical cores):
> > > > VM, container deployments, etc.
> > > > One quite common problem they hit: Lock-Holder-Preemption with
> > rte_ring.
> > > > LHP is quite a common problem for spin-based sync primitives
> > > > (spin-locks, etc.) on overcommitted systems.
> > > > The situation gets much worse when some sort of fair-locking
> > > > technique is used (ticket-lock, etc.).
> > > > As now not only lock-owner but also lock-waiters scheduling order
> > > > matters a lot.
> > > > This is a well-known problem for kernel within VMs:
> > > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> These slides seem to indicate that the problems are mitigated through the Hypervisor configuration. Do we still need to address the issues?

I am not really an expert here, but AFAIK current mitigations deal mostly with guest kernel:
linux implements PV version of spinlocks (unfair and/or based on hypercall availability),
hypervisor might make decision itself based on is guest in user/kernel mode,
plus on some special cpu instructions. 
We do spin in user-space mode.
Might be hypervisors became smarter these days, but so far,
I heard about few different customers that hit such problem.
As an example, NA DPDK summit presentation:
https://dpdkna2019.sched.com/event/WYBG/dpdk-containers-challenges-solutions-wang-yong-zte
page 16 (problem #4) describes same issue.

> 
> > > > The problem with rte_ring is that while head accusion is sort of
> > > > un-fair locking, waiting on tail is very similar to ticket lock
> > > > schema - tail has to be updated in particular order.
> > > > That makes current rte_ring implementation to perform really pure on
> > > > some overcommited scenarios.
> > >
> > > Rather than reform rte_ring to fit this scenario, it would make more
> > > sense to me to introduce another primitive. The current lockless ring
> > > performs very well for the isolated thread model that DPDK was built
> > > around. This looks like a case of customers violating the usage model
> > > of the DPDK and then being surprised at the fallout.
> >
> > I agree with Stephen here.
> >
> > I think, adding more runtime check in the enqueue() and dequeue() will have a
> > bad effect on the low-end cores too.
> > But I agree with the problem statement that in the virtualization use case, It
> > may be possible to have N virtual cores runs on a physical core.
> It is hard to imagine that there are data plane applications deployed in such environments. Wouldn't this affect the performance terribly?

It wouldn't reach same performance as isolated threads, 
but for some tasks it might be enough. 
AFAIK, one quite common scenario - few isolated threads(/processes) doing
actual IO and then spread packets over dozens(/hundreds) non-isolated
consumers. 

> 
> >
> > IMO, The best solution would be keeping the ring API same and have a
> > different flavor in "compile-time". Something like liburcu did for
> > accommodating different flavors.
> >
> > i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. The application
> > can simply include ONE header file in a C file based on the flavor.
> > If need both at runtime. Need to have function pointer or so in the application
> > and define the function in different c file by including the approaite flavor in C
> > file.
> >
> > #include <urcu-qsbr.h> /* QSBR RCU flavor */ #include <urcu-bp.h> /*
> > Bulletproof RCU flavor */
  
Morten Brørup Feb. 26, 2020, 4:53 p.m. UTC | #9
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> Konstantin
> 

<snip>

> > > > > More and more customers use(/try to use) DPDK based apps within
> > > > > overcommitted systems (multiple acttive threads over same
> pysical cores):
> > > > > VM, container deployments, etc.

<snip>

> > > > > That makes current rte_ring implementation to perform
> > > > > really pure on some overcommited scenarios.
> > > >
> > > > Rather than reform rte_ring to fit this scenario, it would make
> > > > more sense to me to introduce another primitive.
> 
> I don't see much advantages it will bring us.
> As a disadvantages, for developers and maintainers - code duplication,
> for end users - extra code churn and removed ability to mix and match
> different sync modes in one ring.
> 

I strongly agree with Konstantin on this.

Please consider this discussion at a higher abstraction level:

As DPDK applications grow in number and popularity, people will deploy them in overcommitted systems. In this scenario, I consider it extremely unlikely that the user is able to dedicate N physical cores to N specific lcores of the M total lcores of the DPDK application. In the typical hypervisor scenario, the user will be able to assign a number of virtual CPUs to the virtual machine running the DPDK application, and these vCPUs will either be dedicated to the virtual machine or shared with other virtual machines.

DPDK is currently designed for dedicated CPUs only. If a user runs it on shared vCPUs, the user is violating an important DPDK precondition, and DPDK's behavior is undefined!

Adding the ability to run DPDK applications on shared vCPUs would be a great improvement.

I prefer that support for this is as painless as possible for the DPDK application developer.

Perhaps run-time detection during the EAL initialization could be a solution. The EAL would then configure relevant libraries to use the appropriate synchronization primitives.


Med venlig hilsen / kind regards
- Morten Brørup
  
Jerin Jacob Feb. 27, 2020, 10:31 a.m. UTC | #10
On Tue, Feb 25, 2020 at 7:11 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:

> We do have a run-time check in our current enqueue()/dequeue implementation.
> In fact we support both modes: we have generic rte_ring_enqueue(/dequeue)_bulk(/burst)
> where sync behaviour is determined at runtime by value of prod(/cons).single.
> Or user can call  rte_ring_(mp/sp)_enqueue_* functions directly.
> This RFC follows exactly the same paradigm:
> rte_ring_enqueue(/dequeue)_bulk(/burst) kept generic and it's
> behaviour is determined at runtime, by value of prod(/cons).sync_type.
> Or user can call enqueue/dequeue with particular sync mode directly:
> rte_ring_(mp/sp/rts/hts)_enqueue_(bulk/burst)*.
> The only thing that changed:
>  Format of prod/cons now could differ depending on mode selected at _init_.
>  So you can't create a ring for let say SP mode and then in the middle of data-path
>  change your mind and start using MP_RTS mode.
>  For existing modes (SP/MP, SC/MC)  format remains the same and user can still
>  use them interchangeably, though of course that is an error prone practice.

Makes sense.


>
> > > But I agree with the problem statement that in the virtualization use
> > > case, It may be possible to have N virtual cores runs on a physical
> > > core.
> > >
> > > IMO, The best solution would be keeping the ring API same and have a
> > > different flavor in "compile-time". Something like
> > > liburcu did for accommodating different flavors.
> > >
> > > i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. The
> > > application can simply include ONE header file in a C file based on
> > > the flavor.
>
> I don't think it is a flexible enough approach.
> In one app user might need to have several rings with different sync modes.
> Or even user might need a ring with different sync modes for enqueue/dequeue.

Ack.


> Yes, hiding rte_ring implementation inside .c would help a lot
> in terms of ABI maintenance and would make our future life easier.
> The question is what is the price for it in terms of performance,
> and are we ready to pay it. Not to mention that it would cause
> changes in many other libs/apps...
> So I think it should be a subject for a separate discussion.
> But, agree it would be good at least to measure the performance
> impact of such change.
> If I'll have some spare cycles, will give it a try.
> Meanwhile, can I ask Jerin and other guys to repeat tests from this RFC
> on their HW? Before continuing discussion would probably be good to know
> does the suggested patch work as expected across different platforms.


I tested on an arm64 HW. The former section is without the
patch(20.02) and later one with this patch.
I agree with Konstantin that getting more platform tests will be good
early so that we can focus on the approach
to avoid back and forth latter.


RTE>>ring_perf_autotest // without path

### Testing single element enq/deq ###
legacy APIs: SP/SC: single: 289.78
legacy APIs: MP/MC: single: 516.20

### Testing burst enq/deq ###
legacy APIs: SP/SC: burst (size: 8): 312.88
legacy APIs: SP/SC: burst (size: 32): 426.72
legacy APIs: MP/MC: burst (size: 8): 510.95
legacy APIs: MP/MC: burst (size: 32): 702.01

### Testing bulk enq/deq ###
legacy APIs: SP/SC: bulk (size: 8): 306.74
legacy APIs: SP/SC: bulk (size: 32): 411.56
legacy APIs: MP/MC: bulk (size: 8): 501.32
legacy APIs: MP/MC: bulk (size: 32): 693.07

### Testing empty bulk deq ###
legacy APIs: SP/SC: bulk (size: 8): 7.00
legacy APIs: MP/MC: bulk (size: 8): 7.00

### Testing using two physical cores ###
legacy APIs: SP/SC: bulk (size: 8): 74.36
legacy APIs: MP/MC: bulk (size: 8): 110.18
legacy APIs: SP/SC: bulk (size: 32): 23.04
legacy APIs: MP/MC: bulk (size: 32): 32.29

### Testing using all slave nodes ##
Bulk enq/dequeue count on size 8
Core [8] count = 293741
Core [9] count = 293741
Total count (size: 8): 587482

Bulk enq/dequeue count on size 32
Core [8] count = 244909
Core [9] count = 244909
Total count (size: 32): 1077300

### Testing single element enq/deq ###
elem APIs: element size 16B: SP/SC: single: 255.37
elem APIs: element size 16B: MP/MC: single: 456.68

### Testing burst enq/deq ###
elem APIs: element size 16B: SP/SC: burst (size: 8): 291.99
elem APIs: element size 16B: SP/SC: burst (size: 32): 456.25
elem APIs: element size 16B: MP/MC: burst (size: 8): 497.77
elem APIs: element size 16B: MP/MC: burst (size: 32): 680.87

### Testing bulk enq/deq ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 284.40
elem APIs: element size 16B: SP/SC: bulk (size: 32): 453.17
elem APIs: element size 16B: MP/MC: bulk (size: 8): 485.77
elem APIs: element size 16B: MP/MC: bulk (size: 32): 675.08

### Testing empty bulk deq ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 8.00
elem APIs: element size 16B: MP/MC: bulk (size: 8): 7.00

### Testing using two physical cores ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 74.45
elem APIs: element size 16B: MP/MC: bulk (size: 8): 105.91
elem APIs: element size 16B: SP/SC: bulk (size: 32): 22.92
elem APIs: element size 16B: MP/MC: bulk (size: 32): 31.55

### Testing using all slave nodes ###

Bulk enq/dequeue count on size 8
Core [8] count = 308724
Core [9] count = 308723
Total count (size: 8): 617447

Bulk enq/dequeue count on size 32
Core [8] count = 214269
Core [9] count = 214269
Total count (size: 32): 1045985

RTE>>ring_perf_autotest // with patch

### Testing single element enq/deq ###
legacy APIs: SP/SC: single: 289.78
legacy APIs: MP/MC: single: 475.76

### Testing burst enq/deq ###
legacy APIs: SP/SC: burst (size: 8): 323.91
legacy APIs: SP/SC: burst (size: 32): 424.60
legacy APIs: MP/MC: burst (size: 8): 523.00
legacy APIs: MP/MC: burst (size: 32): 717.09

### Testing bulk enq/deq ###
legacy APIs: SP/SC: bulk (size: 8): 317.74
legacy APIs: SP/SC: bulk (size: 32): 413.57
legacy APIs: MP/MC: bulk (size: 8): 512.89
legacy APIs: MP/MC: bulk (size: 32): 712.45

### Testing empty bulk deq ###
legacy APIs: SP/SC: bulk (size: 8): 7.00
legacy APIs: MP/MC: bulk (size: 8): 7.00

### Testing using two physical cores ###
legacy APIs: SP/SC: bulk (size: 8): 74.82
legacy APIs: MP/MC: bulk (size: 8): 96.45
legacy APIs: SP/SC: bulk (size: 32): 22.97
legacy APIs: MP/MC: bulk (size: 32): 32.52

### Testing using all slave nodes ###

Bulk enq/dequeue count on size 8
Core [8] count = 283928
Core [9] count = 283927
Total count (size: 8): 567855

Bulk enq/dequeue count on size 32
Core [8] count = 223916
Core [9] count = 223915
Total count (size: 32): 1015686

### Testing single element enq/deq ###
elem APIs: element size 16B: SP/SC: single: 267.65
elem APIs: element size 16B: MP/MC: single: 439.06

### Testing burst enq/deq ###
elem APIs: element size 16B: SP/SC: burst (size: 8): 302.44
elem APIs: element size 16B: SP/SC: burst (size: 32): 466.31
elem APIs: element size 16B: MP/MC: burst (size: 8): 502.51
elem APIs: element size 16B: MP/MC: burst (size: 32): 695.81

### Testing bulk enq/deq ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 295.15
elem APIs: element size 16B: SP/SC: bulk (size: 32): 462.77
elem APIs: element size 16B: MP/MC: bulk (size: 8): 496.89
elem APIs: element size 16B: MP/MC: bulk (size: 32): 690.46

### Testing empty bulk deq ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 7.50
elem APIs: element size 16B: MP/MC: bulk (size: 8): 7.44

### Testing using two physical cores ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 65.85
elem APIs: element size 16B: MP/MC: bulk (size: 8): 103.80
elem APIs: element size 16B: SP/SC: bulk (size: 32): 23.27
elem APIs: element size 16B: MP/MC: bulk (size: 32): 31.17

### Testing using all slave nodes ###

Bulk enq/dequeue count on size 8
Core [8] count = 304223
Core [9] count = 304221
Total count (size: 8): 608444

Bulk enq/dequeue count on size 32
Core [8] count = 214856
Core [9] count = 214855
Total count (size: 32): 1038155
Test OK
RTE>>quit









> Thanks
> Konstantin
  
David Christensen Feb. 28, 2020, 12:17 a.m. UTC | #11
> On Tue, Feb 25, 2020 at 7:11 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> 
>> We do have a run-time check in our current enqueue()/dequeue implementation.
>> In fact we support both modes: we have generic rte_ring_enqueue(/dequeue)_bulk(/burst)
>> where sync behaviour is determined at runtime by value of prod(/cons).single.
>> Or user can call  rte_ring_(mp/sp)_enqueue_* functions directly.
>> This RFC follows exactly the same paradigm:
>> rte_ring_enqueue(/dequeue)_bulk(/burst) kept generic and it's
>> behaviour is determined at runtime, by value of prod(/cons).sync_type.
>> Or user can call enqueue/dequeue with particular sync mode directly:
>> rte_ring_(mp/sp/rts/hts)_enqueue_(bulk/burst)*.
>> The only thing that changed:
>>   Format of prod/cons now could differ depending on mode selected at _init_.
>>   So you can't create a ring for let say SP mode and then in the middle of data-path
>>   change your mind and start using MP_RTS mode.
>>   For existing modes (SP/MP, SC/MC)  format remains the same and user can still
>>   use them interchangeably, though of course that is an error prone practice.
> 
> Makes sense.
> 
> 
>>
>>>> But I agree with the problem statement that in the virtualization use
>>>> case, It may be possible to have N virtual cores runs on a physical
>>>> core.
>>>>
>>>> IMO, The best solution would be keeping the ring API same and have a
>>>> different flavor in "compile-time". Something like
>>>> liburcu did for accommodating different flavors.
>>>>
>>>> i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. The
>>>> application can simply include ONE header file in a C file based on
>>>> the flavor.
>>
>> I don't think it is a flexible enough approach.
>> In one app user might need to have several rings with different sync modes.
>> Or even user might need a ring with different sync modes for enqueue/dequeue.
> 
> Ack.
> 
> 
>> Yes, hiding rte_ring implementation inside .c would help a lot
>> in terms of ABI maintenance and would make our future life easier.
>> The question is what is the price for it in terms of performance,
>> and are we ready to pay it. Not to mention that it would cause
>> changes in many other libs/apps...
>> So I think it should be a subject for a separate discussion.
>> But, agree it would be good at least to measure the performance
>> impact of such change.
>> If I'll have some spare cycles, will give it a try.
>> Meanwhile, can I ask Jerin and other guys to repeat tests from this RFC
>> on their HW? Before continuing discussion would probably be good to know
>> does the suggested patch work as expected across different platforms.
> 
> 
> I tested on an arm64 HW. The former section is without the
> patch(20.02) and later one with this patch.
> I agree with Konstantin that getting more platform tests will be good
> early so that we can focus on the approach
> to avoid back and forth latter.
> 
> 
> RTE>>ring_perf_autotest // without path
> 
> ### Testing single element enq/deq ###
> legacy APIs: SP/SC: single: 289.78
> legacy APIs: MP/MC: single: 516.20
> 
> ### Testing burst enq/deq ###
> legacy APIs: SP/SC: burst (size: 8): 312.88
> legacy APIs: SP/SC: burst (size: 32): 426.72
> legacy APIs: MP/MC: burst (size: 8): 510.95
> legacy APIs: MP/MC: burst (size: 32): 702.01
> 
> ### Testing bulk enq/deq ###
> legacy APIs: SP/SC: bulk (size: 8): 306.74
> legacy APIs: SP/SC: bulk (size: 32): 411.56
> legacy APIs: MP/MC: bulk (size: 8): 501.32
> legacy APIs: MP/MC: bulk (size: 32): 693.07
> 
> ### Testing empty bulk deq ###
> legacy APIs: SP/SC: bulk (size: 8): 7.00
> legacy APIs: MP/MC: bulk (size: 8): 7.00
> 
> ### Testing using two physical cores ###
> legacy APIs: SP/SC: bulk (size: 8): 74.36
> legacy APIs: MP/MC: bulk (size: 8): 110.18
> legacy APIs: SP/SC: bulk (size: 32): 23.04
> legacy APIs: MP/MC: bulk (size: 32): 32.29
> 
> ### Testing using all slave nodes ##
> Bulk enq/dequeue count on size 8
> Core [8] count = 293741
> Core [9] count = 293741
> Total count (size: 8): 587482
> 
> Bulk enq/dequeue count on size 32
> Core [8] count = 244909
> Core [9] count = 244909
> Total count (size: 32): 1077300
> 
> ### Testing single element enq/deq ###
> elem APIs: element size 16B: SP/SC: single: 255.37
> elem APIs: element size 16B: MP/MC: single: 456.68
> 
> ### Testing burst enq/deq ###
> elem APIs: element size 16B: SP/SC: burst (size: 8): 291.99
> elem APIs: element size 16B: SP/SC: burst (size: 32): 456.25
> elem APIs: element size 16B: MP/MC: burst (size: 8): 497.77
> elem APIs: element size 16B: MP/MC: burst (size: 32): 680.87
> 
> ### Testing bulk enq/deq ###
> elem APIs: element size 16B: SP/SC: bulk (size: 8): 284.40
> elem APIs: element size 16B: SP/SC: bulk (size: 32): 453.17
> elem APIs: element size 16B: MP/MC: bulk (size: 8): 485.77
> elem APIs: element size 16B: MP/MC: bulk (size: 32): 675.08
> 
> ### Testing empty bulk deq ###
> elem APIs: element size 16B: SP/SC: bulk (size: 8): 8.00
> elem APIs: element size 16B: MP/MC: bulk (size: 8): 7.00
> 
> ### Testing using two physical cores ###
> elem APIs: element size 16B: SP/SC: bulk (size: 8): 74.45
> elem APIs: element size 16B: MP/MC: bulk (size: 8): 105.91
> elem APIs: element size 16B: SP/SC: bulk (size: 32): 22.92
> elem APIs: element size 16B: MP/MC: bulk (size: 32): 31.55
> 
> ### Testing using all slave nodes ###
> 
> Bulk enq/dequeue count on size 8
> Core [8] count = 308724
> Core [9] count = 308723
> Total count (size: 8): 617447
> 
> Bulk enq/dequeue count on size 32
> Core [8] count = 214269
> Core [9] count = 214269
> Total count (size: 32): 1045985
> 
> RTE>>ring_perf_autotest // with patch
> 
> ### Testing single element enq/deq ###
> legacy APIs: SP/SC: single: 289.78
> legacy APIs: MP/MC: single: 475.76
> 
> ### Testing burst enq/deq ###
> legacy APIs: SP/SC: burst (size: 8): 323.91
> legacy APIs: SP/SC: burst (size: 32): 424.60
> legacy APIs: MP/MC: burst (size: 8): 523.00
> legacy APIs: MP/MC: burst (size: 32): 717.09
> 
> ### Testing bulk enq/deq ###
> legacy APIs: SP/SC: bulk (size: 8): 317.74
> legacy APIs: SP/SC: bulk (size: 32): 413.57
> legacy APIs: MP/MC: bulk (size: 8): 512.89
> legacy APIs: MP/MC: bulk (size: 32): 712.45
> 
> ### Testing empty bulk deq ###
> legacy APIs: SP/SC: bulk (size: 8): 7.00
> legacy APIs: MP/MC: bulk (size: 8): 7.00
> 
> ### Testing using two physical cores ###
> legacy APIs: SP/SC: bulk (size: 8): 74.82
> legacy APIs: MP/MC: bulk (size: 8): 96.45
> legacy APIs: SP/SC: bulk (size: 32): 22.97
> legacy APIs: MP/MC: bulk (size: 32): 32.52
> 
> ### Testing using all slave nodes ###
> 
> Bulk enq/dequeue count on size 8
> Core [8] count = 283928
> Core [9] count = 283927
> Total count (size: 8): 567855
> 
> Bulk enq/dequeue count on size 32
> Core [8] count = 223916
> Core [9] count = 223915
> Total count (size: 32): 1015686
> 
> ### Testing single element enq/deq ###
> elem APIs: element size 16B: SP/SC: single: 267.65
> elem APIs: element size 16B: MP/MC: single: 439.06
> 
> ### Testing burst enq/deq ###
> elem APIs: element size 16B: SP/SC: burst (size: 8): 302.44
> elem APIs: element size 16B: SP/SC: burst (size: 32): 466.31
> elem APIs: element size 16B: MP/MC: burst (size: 8): 502.51
> elem APIs: element size 16B: MP/MC: burst (size: 32): 695.81
> 
> ### Testing bulk enq/deq ###
> elem APIs: element size 16B: SP/SC: bulk (size: 8): 295.15
> elem APIs: element size 16B: SP/SC: bulk (size: 32): 462.77
> elem APIs: element size 16B: MP/MC: bulk (size: 8): 496.89
> elem APIs: element size 16B: MP/MC: bulk (size: 32): 690.46
> 
> ### Testing empty bulk deq ###
> elem APIs: element size 16B: SP/SC: bulk (size: 8): 7.50
> elem APIs: element size 16B: MP/MC: bulk (size: 8): 7.44
> 
> ### Testing using two physical cores ###
> elem APIs: element size 16B: SP/SC: bulk (size: 8): 65.85
> elem APIs: element size 16B: MP/MC: bulk (size: 8): 103.80
> elem APIs: element size 16B: SP/SC: bulk (size: 32): 23.27
> elem APIs: element size 16B: MP/MC: bulk (size: 32): 31.17
> 
> ### Testing using all slave nodes ###
> 
> Bulk enq/dequeue count on size 8
> Core [8] count = 304223
> Core [9] count = 304221
> Total count (size: 8): 608444
> 
> Bulk enq/dequeue count on size 32
> Core [8] count = 214856
> Core [9] count = 214855
> Total count (size: 32): 1038155
> Test OK
> RTE>>quit
> 
> 

Encountered a couple of different build errors with these patches on my 
Power 9 system:

In file included from ../lib/librte_ring/rte_ring.h:534,
                  from ../drivers/mempool/ring/rte_mempool_ring.c:9:
../lib/librte_ring/rte_ring_hts_generic.h: In function 
‘__rte_ring_hts_update_tail’:
../lib/librte_ring/rte_ring_hts_generic.h:61:2: warning: implicit 
declaration of function ‘RTE_ASSERT’; did you mean ‘RTE_STR’? 
[-Wimplicit-function-declaration]
   RTE_ASSERT(n >= num);
   ^~~~~~~~~~
   RTE_STR

Fixed by adding "#include <rte_debug.h>" to rte_ring.h.

Also encountered:

In file included from ../app/test/test_ring_hts_stress.c:5:
../app/test/test_ring_stress.h: In function ‘check_updt_elem’:
../app/test/test_ring_stress.h:162:9: error: unknown type name 
‘rte_spinlock_t’
   static rte_spinlock_t dump_lock;
          ^~~~~~~~~~~~~~
../app/test/test_ring_stress.h:166:4: warning: implicit declaration of 
function ‘rte_spinlock_lock’; did you mean ‘rte_calloc_socket’? 
[-Wimplicit-function-declaration]
     rte_spinlock_lock(&dump_lock);
     ^~~~~~~~~~~~~~~~~
     rte_calloc_socket
../app/test/test_ring_stress.h:166:4: warning: nested extern declaration 
of ‘rte_spinlock_lock’ [-Wnested-externs]
../app/test/test_ring_stress.h:172:4: warning: implicit declaration of 
function ‘rte_spinlock_unlock’; did you mean ‘pthread_rwlock_unlock’? 
[-Wimplicit-function-declaration]
     rte_spinlock_unlock(&dump_lock);
     ^~~~~~~~~~~~~~~~~~~
     pthread_rwlock_unlock
../app/test/test_ring_stress.h:172:4: warning: nested extern declaration 
of ‘rte_spinlock_unlock’ [-Wnested-externs]

Fixed by adding "#include <rte_spinlock.h>" to test_ring_stress.h.

Autoperf test results
---------------------
RTE>>ring_perf_autotest // DPDK 20.02, without patch

### Testing single element enq/deq ###
legacy APIs: SP/SC: single: 42.14
legacy APIs: MP/MC: single: 56.26

### Testing burst enq/deq ###
legacy APIs: SP/SC: burst (size: 8): 43.59
legacy APIs: SP/SC: burst (size: 32): 49.87
legacy APIs: MP/MC: burst (size: 8): 58.43
legacy APIs: MP/MC: burst (size: 32): 65.68

### Testing bulk enq/deq ###
legacy APIs: SP/SC: bulk (size: 8): 43.59
legacy APIs: SP/SC: bulk (size: 32): 49.85
legacy APIs: MP/MC: bulk (size: 8): 58.43
legacy APIs: MP/MC: bulk (size: 32): 65.60

### Testing empty bulk deq ###
legacy APIs: SP/SC: bulk (size: 8): 7.16
legacy APIs: MP/MC: bulk (size: 8): 7.16

### Testing using two hyperthreads ###
legacy APIs: SP/SC: bulk (size: 8): 12.46
legacy APIs: MP/MC: bulk (size: 8): 16.20
legacy APIs: SP/SC: bulk (size: 32): 3.21
legacy APIs: MP/MC: bulk (size: 32): 3.73

### Testing using two physical cores ###
legacy APIs: SP/SC: bulk (size: 8): 33.34
legacy APIs: MP/MC: bulk (size: 8): 37.99
legacy APIs: SP/SC: bulk (size: 32): 10.19
legacy APIs: MP/MC: bulk (size: 32): 11.90

### Testing using two NUMA nodes ###
legacy APIs: SP/SC: bulk (size: 8): 49.50
legacy APIs: MP/MC: bulk (size: 8): 63.65
legacy APIs: SP/SC: bulk (size: 32): 12.49
legacy APIs: MP/MC: bulk (size: 32): 23.53

### Testing using all slave nodes ###

Bulk enq/dequeue count on size 8
Core [4] count = 5604
Core [5] count = 5563
Core [6] count = 5576
Core [7] count = 5630
Core [8] count = 5643
Core [9] count = 5727
Core [10] count = 5698
Core [11] count = 5711
Core [64] count = 5259
Core [65] count = 5322
Core [66] count = 5321
Core [67] count = 5310
Core [68] count = 4350
Core [69] count = 4455
Core [70] count = 4546
Core [71] count = 4475
Total count (size: 8): 84190

Bulk enq/dequeue count on size 32
Core [4] count = 5543
Core [5] count = 5555
Core [6] count = 5596
Core [7] count = 5584
Core [8] count = 5613
Core [9] count = 5686
Core [10] count = 5689
Core [11] count = 5677
Core [64] count = 5228
Core [65] count = 5389
Core [66] count = 5406
Core [67] count = 5359
Core [68] count = 4554
Core [69] count = 4673
Core [70] count = 4675
Core [71] count = 4644
Total count (size: 32): 169061

### Testing single element enq/deq ###
elem APIs: element size 16B: SP/SC: single: 42.84
elem APIs: element size 16B: MP/MC: single: 56.77

### Testing burst enq/deq ###
elem APIs: element size 16B: SP/SC: burst (size: 8): 44.98
elem APIs: element size 16B: SP/SC: burst (size: 32): 59.16
elem APIs: element size 16B: MP/MC: burst (size: 8): 60.58
elem APIs: element size 16B: MP/MC: burst (size: 32): 74.75

### Testing bulk enq/deq ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 45.01
elem APIs: element size 16B: SP/SC: bulk (size: 32): 59.08
elem APIs: element size 16B: MP/MC: bulk (size: 8): 60.58
elem APIs: element size 16B: MP/MC: bulk (size: 32): 74.76

### Testing empty bulk deq ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 7.16
elem APIs: element size 16B: MP/MC: bulk (size: 8): 7.16

### Testing using two hyperthreads ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 12.18
elem APIs: element size 16B: MP/MC: bulk (size: 8): 15.44
elem APIs: element size 16B: SP/SC: bulk (size: 32): 3.22
elem APIs: element size 16B: MP/MC: bulk (size: 32): 3.97

### Testing using two physical cores ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 42.07
elem APIs: element size 16B: MP/MC: bulk (size: 8): 44.50
elem APIs: element size 16B: SP/SC: bulk (size: 32): 10.73
elem APIs: element size 16B: MP/MC: bulk (size: 32): 11.73

### Testing using two NUMA nodes ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 49.55
elem APIs: element size 16B: MP/MC: bulk (size: 8): 93.10
elem APIs: element size 16B: SP/SC: bulk (size: 32): 12.33
elem APIs: element size 16B: MP/MC: bulk (size: 32): 27.10

### Testing using all slave nodes ###

Bulk enq/dequeue count on size 8
Core [4] count = 5489
Core [5] count = 5559
Core [6] count = 5566
Core [7] count = 5577
Core [8] count = 5645
Core [9] count = 5699
Core [10] count = 5695
Core [11] count = 5733
Core [64] count = 5202
Core [65] count = 5284
Core [66] count = 5319
Core [67] count = 5349
Core [68] count = 4331
Core [69] count = 4465
Core [70] count = 4484
Core [71] count = 4439
Total count (size: 8): 83836

Bulk enq/dequeue count on size 32
Core [4] count = 5567
Core [5] count = 5492
Core [6] count = 5517
Core [7] count = 5515
Core [8] count = 5593
Core [9] count = 5650
Core [10] count = 5694
Core [11] count = 5665
Core [64] count = 5236
Core [65] count = 5319
Core [66] count = 5333
Core [67] count = 5304
Core [68] count = 4608
Core [69] count = 4669
Core [70] count = 4690
Core [71] count = 4654
Total count (size: 32): 168342
Test OK
---------------------
RTE>>ring_perf_autotest // DPDK 20.02, without patch

### Testing single element enq/deq ###
legacy APIs: SP/SC: single: 42.18
legacy APIs: MP/MC: single: 56.26

### Testing burst enq/deq ###
legacy APIs: SP/SC: burst (size: 8): 43.60
legacy APIs: SP/SC: burst (size: 32): 49.86
legacy APIs: MP/MC: burst (size: 8): 58.43
legacy APIs: MP/MC: burst (size: 32): 65.67

### Testing bulk enq/deq ###
legacy APIs: SP/SC: bulk (size: 8): 43.59
legacy APIs: SP/SC: bulk (size: 32): 49.86
legacy APIs: MP/MC: bulk (size: 8): 58.43
legacy APIs: MP/MC: bulk (size: 32): 65.63

### Testing empty bulk deq ###
legacy APIs: SP/SC: bulk (size: 8): 7.16
legacy APIs: MP/MC: bulk (size: 8): 7.16

### Testing using two hyperthreads ###
legacy APIs: SP/SC: bulk (size: 8): 12.07
legacy APIs: MP/MC: bulk (size: 8): 16.24
legacy APIs: SP/SC: bulk (size: 32): 3.20
legacy APIs: MP/MC: bulk (size: 32): 3.72

### Testing using two physical cores ###
legacy APIs: SP/SC: bulk (size: 8): 33.41
legacy APIs: MP/MC: bulk (size: 8): 38.01
legacy APIs: SP/SC: bulk (size: 32): 10.23
legacy APIs: MP/MC: bulk (size: 32): 11.90

### Testing using two NUMA nodes ###
legacy APIs: SP/SC: bulk (size: 8): 49.27
legacy APIs: MP/MC: bulk (size: 8): 64.80
legacy APIs: SP/SC: bulk (size: 32): 12.45
legacy APIs: MP/MC: bulk (size: 32): 23.11

### Testing using all slave nodes ###

Bulk enq/dequeue count on size 8
Core [4] count = 5637
Core [5] count = 5599
Core [6] count = 5623
Core [7] count = 5627
Core [8] count = 5723
Core [9] count = 5758
Core [10] count = 5714
Core [11] count = 5724
Core [64] count = 5310
Core [65] count = 5438
Core [66] count = 5448
Core [67] count = 5374
Core [68] count = 4441
Core [69] count = 4550
Core [70] count = 4550
Core [71] count = 4558
Total count (size: 8): 85074

Bulk enq/dequeue count on size 32
Core [4] count = 5608
Core [5] count = 5623
Core [6] count = 5590
Core [7] count = 5658
Core [8] count = 5680
Core [9] count = 5738
Core [10] count = 5692
Core [11] count = 5712
Core [64] count = 5273
Core [65] count = 5363
Core [66] count = 5341
Core [67] count = 5349
Core [68] count = 4591
Core [69] count = 4673
Core [70] count = 4698
Core [71] count = 4687
Total count (size: 32): 170350

### Testing single element enq/deq ###
elem APIs: element size 16B: SP/SC: single: 42.82
elem APIs: element size 16B: MP/MC: single: 56.79

### Testing burst enq/deq ###
elem APIs: element size 16B: SP/SC: burst (size: 8): 44.99
elem APIs: element size 16B: SP/SC: burst (size: 32): 59.00
elem APIs: element size 16B: MP/MC: burst (size: 8): 60.59
elem APIs: element size 16B: MP/MC: burst (size: 32): 74.78

### Testing bulk enq/deq ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 44.97
elem APIs: element size 16B: SP/SC: bulk (size: 32): 58.91
elem APIs: element size 16B: MP/MC: bulk (size: 8): 60.60
elem APIs: element size 16B: MP/MC: bulk (size: 32): 74.61

### Testing empty bulk deq ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 7.16
elem APIs: element size 16B: MP/MC: bulk (size: 8): 7.16

### Testing using two hyperthreads ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 12.18
elem APIs: element size 16B: MP/MC: bulk (size: 8): 15.41
elem APIs: element size 16B: SP/SC: bulk (size: 32): 3.19
elem APIs: element size 16B: MP/MC: bulk (size: 32): 4.06

### Testing using two physical cores ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 42.08
elem APIs: element size 16B: MP/MC: bulk (size: 8): 44.52
elem APIs: element size 16B: SP/SC: bulk (size: 32): 10.73
elem APIs: element size 16B: MP/MC: bulk (size: 32): 12.39

### Testing using two NUMA nodes ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 49.65
elem APIs: element size 16B: MP/MC: bulk (size: 8): 93.27
elem APIs: element size 16B: SP/SC: bulk (size: 32): 12.38
elem APIs: element size 16B: MP/MC: bulk (size: 32): 27.19

### Testing using all slave nodes ###

Bulk enq/dequeue count on size 8
Core [4] count = 5629
Core [5] count = 5585
Core [6] count = 5676
Core [7] count = 5604
Core [8] count = 5639
Core [9] count = 5731
Core [10] count = 5694
Core [11] count = 5707
Core [64] count = 5254
Core [65] count = 5331
Core [66] count = 5340
Core [67] count = 5355
Core [68] count = 4339
Core [69] count = 4481
Core [70] count = 4504
Core [71] count = 4507
Total count (size: 8): 84376

Bulk enq/dequeue count on size 32
Core [4] count = 5518
Core [5] count = 5493
Core [6] count = 5559
Core [7] count = 5484
Core [8] count = 5623
Core [9] count = 5669
Core [10] count = 5661
Core [11] count = 5658
Core [64] count = 5207
Core [65] count = 5305
Core [66] count = 5273
Core [67] count = 5303
Core [68] count = 4542
Core [69] count = 4682
Core [70] count = 4672
Core [71] count = 4609
Total count (size: 32): 168634
Test OK


Dave
  
Ananyev, Konstantin March 20, 2020, 4:45 p.m. UTC | #12
> >
> >
> > I tested on an arm64 HW. The former section is without the
> > patch(20.02) and later one with this patch.
> > I agree with Konstantin that getting more platform tests will be good
> > early so that we can focus on the approach
> > to avoid back and forth latter.
> >
> >
> > RTE>>ring_perf_autotest // without path
> >
> > ### Testing single element enq/deq ###
> > legacy APIs: SP/SC: single: 289.78
> > legacy APIs: MP/MC: single: 516.20
> >
> > ### Testing burst enq/deq ###
> > legacy APIs: SP/SC: burst (size: 8): 312.88
> > legacy APIs: SP/SC: burst (size: 32): 426.72
> > legacy APIs: MP/MC: burst (size: 8): 510.95
> > legacy APIs: MP/MC: burst (size: 32): 702.01
> >
> > ### Testing bulk enq/deq ###
> > legacy APIs: SP/SC: bulk (size: 8): 306.74
> > legacy APIs: SP/SC: bulk (size: 32): 411.56
> > legacy APIs: MP/MC: bulk (size: 8): 501.32
> > legacy APIs: MP/MC: bulk (size: 32): 693.07
> >
> > ### Testing empty bulk deq ###
> > legacy APIs: SP/SC: bulk (size: 8): 7.00
> > legacy APIs: MP/MC: bulk (size: 8): 7.00
> >
> > ### Testing using two physical cores ###
> > legacy APIs: SP/SC: bulk (size: 8): 74.36
> > legacy APIs: MP/MC: bulk (size: 8): 110.18
> > legacy APIs: SP/SC: bulk (size: 32): 23.04
> > legacy APIs: MP/MC: bulk (size: 32): 32.29
> >
> > ### Testing using all slave nodes ##
> > Bulk enq/dequeue count on size 8
> > Core [8] count = 293741
> > Core [9] count = 293741
> > Total count (size: 8): 587482
> >
> > Bulk enq/dequeue count on size 32
> > Core [8] count = 244909
> > Core [9] count = 244909
> > Total count (size: 32): 1077300
> >
> > ### Testing single element enq/deq ###
> > elem APIs: element size 16B: SP/SC: single: 255.37
> > elem APIs: element size 16B: MP/MC: single: 456.68
> >
> > ### Testing burst enq/deq ###
> > elem APIs: element size 16B: SP/SC: burst (size: 8): 291.99
> > elem APIs: element size 16B: SP/SC: burst (size: 32): 456.25
> > elem APIs: element size 16B: MP/MC: burst (size: 8): 497.77
> > elem APIs: element size 16B: MP/MC: burst (size: 32): 680.87
> >
> > ### Testing bulk enq/deq ###
> > elem APIs: element size 16B: SP/SC: bulk (size: 8): 284.40
> > elem APIs: element size 16B: SP/SC: bulk (size: 32): 453.17
> > elem APIs: element size 16B: MP/MC: bulk (size: 8): 485.77
> > elem APIs: element size 16B: MP/MC: bulk (size: 32): 675.08
> >
> > ### Testing empty bulk deq ###
> > elem APIs: element size 16B: SP/SC: bulk (size: 8): 8.00
> > elem APIs: element size 16B: MP/MC: bulk (size: 8): 7.00
> >
> > ### Testing using two physical cores ###
> > elem APIs: element size 16B: SP/SC: bulk (size: 8): 74.45
> > elem APIs: element size 16B: MP/MC: bulk (size: 8): 105.91
> > elem APIs: element size 16B: SP/SC: bulk (size: 32): 22.92
> > elem APIs: element size 16B: MP/MC: bulk (size: 32): 31.55
> >
> > ### Testing using all slave nodes ###
> >
> > Bulk enq/dequeue count on size 8
> > Core [8] count = 308724
> > Core [9] count = 308723
> > Total count (size: 8): 617447
> >
> > Bulk enq/dequeue count on size 32
> > Core [8] count = 214269
> > Core [9] count = 214269
> > Total count (size: 32): 1045985
> >
> > RTE>>ring_perf_autotest // with patch
> >
> > ### Testing single element enq/deq ###
> > legacy APIs: SP/SC: single: 289.78
> > legacy APIs: MP/MC: single: 475.76
> >
> > ### Testing burst enq/deq ###
> > legacy APIs: SP/SC: burst (size: 8): 323.91
> > legacy APIs: SP/SC: burst (size: 32): 424.60
> > legacy APIs: MP/MC: burst (size: 8): 523.00
> > legacy APIs: MP/MC: burst (size: 32): 717.09
> >
> > ### Testing bulk enq/deq ###
> > legacy APIs: SP/SC: bulk (size: 8): 317.74
> > legacy APIs: SP/SC: bulk (size: 32): 413.57
> > legacy APIs: MP/MC: bulk (size: 8): 512.89
> > legacy APIs: MP/MC: bulk (size: 32): 712.45
> >
> > ### Testing empty bulk deq ###
> > legacy APIs: SP/SC: bulk (size: 8): 7.00
> > legacy APIs: MP/MC: bulk (size: 8): 7.00
> >
> > ### Testing using two physical cores ###
> > legacy APIs: SP/SC: bulk (size: 8): 74.82
> > legacy APIs: MP/MC: bulk (size: 8): 96.45
> > legacy APIs: SP/SC: bulk (size: 32): 22.97
> > legacy APIs: MP/MC: bulk (size: 32): 32.52
> >
> > ### Testing using all slave nodes ###
> >
> > Bulk enq/dequeue count on size 8
> > Core [8] count = 283928
> > Core [9] count = 283927
> > Total count (size: 8): 567855
> >
> > Bulk enq/dequeue count on size 32
> > Core [8] count = 223916
> > Core [9] count = 223915
> > Total count (size: 32): 1015686
> >
> > ### Testing single element enq/deq ###
> > elem APIs: element size 16B: SP/SC: single: 267.65
> > elem APIs: element size 16B: MP/MC: single: 439.06
> >
> > ### Testing burst enq/deq ###
> > elem APIs: element size 16B: SP/SC: burst (size: 8): 302.44
> > elem APIs: element size 16B: SP/SC: burst (size: 32): 466.31
> > elem APIs: element size 16B: MP/MC: burst (size: 8): 502.51
> > elem APIs: element size 16B: MP/MC: burst (size: 32): 695.81
> >
> > ### Testing bulk enq/deq ###
> > elem APIs: element size 16B: SP/SC: bulk (size: 8): 295.15
> > elem APIs: element size 16B: SP/SC: bulk (size: 32): 462.77
> > elem APIs: element size 16B: MP/MC: bulk (size: 8): 496.89
> > elem APIs: element size 16B: MP/MC: bulk (size: 32): 690.46
> >
> > ### Testing empty bulk deq ###
> > elem APIs: element size 16B: SP/SC: bulk (size: 8): 7.50
> > elem APIs: element size 16B: MP/MC: bulk (size: 8): 7.44
> >
> > ### Testing using two physical cores ###
> > elem APIs: element size 16B: SP/SC: bulk (size: 8): 65.85
> > elem APIs: element size 16B: MP/MC: bulk (size: 8): 103.80
> > elem APIs: element size 16B: SP/SC: bulk (size: 32): 23.27
> > elem APIs: element size 16B: MP/MC: bulk (size: 32): 31.17
> >
> > ### Testing using all slave nodes ###
> >
> > Bulk enq/dequeue count on size 8
> > Core [8] count = 304223
> > Core [9] count = 304221
> > Total count (size: 8): 608444
> >
> > Bulk enq/dequeue count on size 32
> > Core [8] count = 214856
> > Core [9] count = 214855
> > Total count (size: 32): 1038155
> > Test OK
> > RTE>>quit
> >
> >
> 
> Encountered a couple of different build errors with these patches on my
> Power 9 system:
> 
> In file included from ../lib/librte_ring/rte_ring.h:534,
>                   from ../drivers/mempool/ring/rte_mempool_ring.c:9:
> ../lib/librte_ring/rte_ring_hts_generic.h: In function
> ‘__rte_ring_hts_update_tail’:
> ../lib/librte_ring/rte_ring_hts_generic.h:61:2: warning: implicit
> declaration of function ‘RTE_ASSERT’; did you mean ‘RTE_STR’?
> [-Wimplicit-function-declaration]
>    RTE_ASSERT(n >= num);
>    ^~~~~~~~~~
>    RTE_STR
> 
> Fixed by adding "#include <rte_debug.h>" to rte_ring.h.
> 
> Also encountered:
> 
> In file included from ../app/test/test_ring_hts_stress.c:5:
> ../app/test/test_ring_stress.h: In function ‘check_updt_elem’:
> ../app/test/test_ring_stress.h:162:9: error: unknown type name
> ‘rte_spinlock_t’
>    static rte_spinlock_t dump_lock;
>           ^~~~~~~~~~~~~~
> ../app/test/test_ring_stress.h:166:4: warning: implicit declaration of
> function ‘rte_spinlock_lock’; did you mean ‘rte_calloc_socket’?
> [-Wimplicit-function-declaration]
>      rte_spinlock_lock(&dump_lock);
>      ^~~~~~~~~~~~~~~~~
>      rte_calloc_socket
> ../app/test/test_ring_stress.h:166:4: warning: nested extern declaration
> of ‘rte_spinlock_lock’ [-Wnested-externs]
> ../app/test/test_ring_stress.h:172:4: warning: implicit declaration of
> function ‘rte_spinlock_unlock’; did you mean ‘pthread_rwlock_unlock’?
> [-Wimplicit-function-declaration]
>      rte_spinlock_unlock(&dump_lock);
>      ^~~~~~~~~~~~~~~~~~~
>      pthread_rwlock_unlock
> ../app/test/test_ring_stress.h:172:4: warning: nested extern declaration
> of ‘rte_spinlock_unlock’ [-Wnested-externs]
> 
> Fixed by adding "#include <rte_spinlock.h>" to test_ring_stress.h.

Thanks a lot for trying it guys. 
About compilation issues - will  to address it in V1.
Konstantin
  
Honnappa Nagarahalli March 25, 2020, 8:43 p.m. UTC | #13
<snip>

> Subject: [dpdk-dev] [RFC 0/6] New sync modes for ring
> 
> Upfront note - that RFC is not a complete patch.
> It introduces an ABI breakage, plus it doesn't update ring_elem code properly,
As per the current rules, these changes (in the current form) will be accepted only for 20.11 release. How do we address this for immediate requirements like RCU defer APIs? 
I suggest that we move forward with my RFC (taking into consideration your feedback) to make progress on RCU APIs.

> etc.
> I plan to deal with all these things in later versions.
> Right now I seek an initial feedback about proposed ideas.
> Would also ask people to repeat performance tests (see below) on their
> platforms to confirm the impact.
> 
> More and more customers use(/try to use) DPDK based apps within
> overcommitted systems (multiple acttive threads over same pysical cores):
> VM, container deployments, etc.
> One quite common problem they hit: Lock-Holder-Preemption with rte_ring.
> LHP is quite a common problem for spin-based sync primitives (spin-locks, etc.)
> on overcommitted systems.
> The situation gets much worse when some sort of fair-locking technique is
> used (ticket-lock, etc.).
> As now not only lock-owner but also lock-waiters scheduling order matters a
> lot.
> This is a well-known problem for kernel within VMs:
> http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> The problem with rte_ring is that while head accusion is sort of un-fair locking,
> waiting on tail is very similar to ticket lock schema - tail has to be updated in
> particular order.
> That makes current rte_ring implementation to perform really pure on some
> overcommited scenarios.
> While it is probably not possible to completely resolve this problem in
> userspace only (without some kernel communication/intervention), removing
> fairness in tail update can mitigate it significantly.
> So this RFC proposes two new optional ring synchronization modes:
> 1) Head/Tail Sync (HTS) mode
> In that mode enqueue/dequeue operation is fully serialized:
>     only one thread at a time is allowed to perform given op.
>     As another enhancement provide ability to split enqueue/dequeue
>     operation into two phases:
>       - enqueue/dequeue start
>       - enqueue/dequeue finish
>     That allows user to inspect objects in the ring without removing
>     them from it (aka MT safe peek).
IMO, this will not address the problem described above. For ex: when a producer updates the head and gets scheduled out, other producers have to spin. The problem is probably worse as with non-HTS case moving of the head and copying of the ring elements can happen in parallel between the producers (similarly for consumers).
IMO, HTS should not be a configurable flag. In RCU requirement, a MP enqueue and HTS dequeue are required.

> 2) Relaxed Tail Sync (RTS)
> The main difference from original MP/MC algorithm is that tail value is
> increased not by every thread that finished enqueue/dequeue, but only by the
> last one.
> That allows threads to avoid spinning on ring tail value, leaving actual tail value
> change to the last thread in the update queue.
This can be a configurable flag on the ring.
I am not sure how this solves the problem you have stated above completely. Updating the count from all intermediate threads is still required to update the value of the head. But yes, it reduces the severity of the problem by not enforcing the order in which the tail is updated.
I also think it introduces the problem on the other side of the ring because the tail is not updated soon enough (the other side has to wait longer for the elements to become available). It also introduces another configuration parameter (HTD_MAX_DEF) which they have to deal with.
Users have to still implement the current hypervisor related solutions.
IMO, we should run the benchmark for this on an over committed setup to understand the benefits.

> 
> Test results on IA (see below) show significant improvements for average
> enqueue/dequeue op times on overcommitted systems.
> For 'classic' DPDK deployments (one thread per core) original MP/MC
> algorithm still shows best numbers, though for 64-bit target RTS numbers are
> not that far away.
> Numbers were produced by ring_stress_*autotest (first patch in these series).
> 
> X86_64 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> DEQ+ENQ average cycles/obj
> 
>                                                 MP/MC      HTS     RTS
> 1thread@1core(--lcores=6-7)                     8.00       8.15    8.99
> 2thread@2core(--lcores=6-8)                     19.14      19.61   20.35
> 4thread@4core(--lcores=6-10)                    29.43      29.79   31.82
> 8thread@8core(--lcores=6-14)                    110.59     192.81  119.50
> 16thread@16core(--lcores=6-22)                  461.03     813.12  495.59
> 32thread/@32core(--lcores='6-22,55-70')         982.90     1972.38 1160.51
> 
> 2thread@1core(--lcores='6,(10-11)@7'            20140.50   23.58   25.14
> 4thread@2core(--lcores='6,(10-11)@7,(20-21)@8'  153680.60  76.88   80.05
> 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8'  280314.32  294.72  318.79
> 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 643176.59  1144.02
> 1175.14 32thread@2core(--lcores='6,(10-25)@7,(30-45)@8' 4264238.80
> 4627.48 4892.68
> 
> 8thread@2core(--lcores='6,(10-17)@(7,8))'       321085.98  298.59  307.47
> 16thread@4core(--lcores='6,(20-35)@(7-10))'     1900705.61 575.35  678.29
> 32thread@4core(--lcores='6,(20-51)@(7-10))'     5510445.85 2164.36 2714.12
> 
> i686 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> DEQ+ENQ average cycles/obj
> 
>                                                 MP/MC      HTS     RTS
> 1thread@1core(--lcores=6-7)                     7.85       12.13   11.31
> 2thread@2core(--lcores=6-8)                     17.89      24.52   21.86
> 8thread@8core(--lcores=6-14)                    32.58      354.20  54.58
> 32thread/@32core(--lcores='6-22,55-70')         813.77     6072.41 2169.91
> 
> 2thread@1core(--lcores='6,(10-11)@7'            16095.00   36.06   34.74
> 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8'  1140354.54 346.61  361.57
> 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 1920417.86 1314.90
> 1416.65
> 
> 8thread@2core(--lcores='6,(10-17)@(7,8))'       594358.61  332.70  357.74
> 32thread@4core(--lcores='6,(20-51)@(7-10))'     5319896.86 2836.44 3028.87
> 
> Konstantin Ananyev (6):
>   test/ring: add contention stress test
>   ring: rework ring layout to allow new sync schemes
>   ring: introduce RTS ring mode
>   test/ring: add contention stress test for RTS ring
>   ring: introduce HTS ring mode
>   test/ring: add contention stress test for HTS ring
> 
>  app/test/Makefile                      |   3 +
>  app/test/meson.build                   |   3 +
>  app/test/test_pdump.c                  |   6 +-
>  app/test/test_ring_hts_stress.c        |  28 ++
>  app/test/test_ring_rts_stress.c        |  28 ++
>  app/test/test_ring_stress.c            |  27 ++
>  app/test/test_ring_stress.h            | 477 +++++++++++++++++++
>  lib/librte_pdump/rte_pdump.c           |   2 +-
>  lib/librte_port/rte_port_ring.c        |  12 +-
>  lib/librte_ring/Makefile               |   4 +-
>  lib/librte_ring/meson.build            |   4 +-
>  lib/librte_ring/rte_ring.c             |  84 +++-
>  lib/librte_ring/rte_ring.h             | 619 +++++++++++++++++++++++--
>  lib/librte_ring/rte_ring_elem.h        |   8 +-
>  lib/librte_ring/rte_ring_hts_generic.h | 228 +++++++++
> lib/librte_ring/rte_ring_rts_generic.h | 240 ++++++++++
>  16 files changed, 1721 insertions(+), 52 deletions(-)  create mode 100644
> app/test/test_ring_hts_stress.c  create mode 100644
> app/test/test_ring_rts_stress.c  create mode 100644
> app/test/test_ring_stress.c  create mode 100644 app/test/test_ring_stress.h
> create mode 100644 lib/librte_ring/rte_ring_hts_generic.h
>  create mode 100644 lib/librte_ring/rte_ring_rts_generic.h
> 
> --
> 2.17.1
  
Ananyev, Konstantin March 26, 2020, 1:50 a.m. UTC | #14
> 
> <snip>
> 
> > Subject: [dpdk-dev] [RFC 0/6] New sync modes for ring
> >
> > Upfront note - that RFC is not a complete patch.
> > It introduces an ABI breakage, plus it doesn't update ring_elem code properly,
> As per the current rules, these changes (in the current form) will be accepted only for 20.11 release. How do we address this for immediate
> requirements like RCU defer APIs?

I think I found a way to introduce these new modes without API/ABI breakage.
Working on v1 right now. Plan to submit it by end of that week/start of next one.

> I suggest that we move forward with my RFC (taking into consideration your feedback) to make progress on RCU APIs.
> 
> > etc.
> > I plan to deal with all these things in later versions.
> > Right now I seek an initial feedback about proposed ideas.
> > Would also ask people to repeat performance tests (see below) on their
> > platforms to confirm the impact.
> >
> > More and more customers use(/try to use) DPDK based apps within
> > overcommitted systems (multiple acttive threads over same pysical cores):
> > VM, container deployments, etc.
> > One quite common problem they hit: Lock-Holder-Preemption with rte_ring.
> > LHP is quite a common problem for spin-based sync primitives (spin-locks, etc.)
> > on overcommitted systems.
> > The situation gets much worse when some sort of fair-locking technique is
> > used (ticket-lock, etc.).
> > As now not only lock-owner but also lock-waiters scheduling order matters a
> > lot.
> > This is a well-known problem for kernel within VMs:
> > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > The problem with rte_ring is that while head accusion is sort of un-fair locking,
> > waiting on tail is very similar to ticket lock schema - tail has to be updated in
> > particular order.
> > That makes current rte_ring implementation to perform really pure on some
> > overcommited scenarios.
> > While it is probably not possible to completely resolve this problem in
> > userspace only (without some kernel communication/intervention), removing
> > fairness in tail update can mitigate it significantly.
> > So this RFC proposes two new optional ring synchronization modes:
> > 1) Head/Tail Sync (HTS) mode
> > In that mode enqueue/dequeue operation is fully serialized:
> >     only one thread at a time is allowed to perform given op.
> >     As another enhancement provide ability to split enqueue/dequeue
> >     operation into two phases:
> >       - enqueue/dequeue start
> >       - enqueue/dequeue finish
> >     That allows user to inspect objects in the ring without removing
> >     them from it (aka MT safe peek).
> IMO, this will not address the problem described above.

It does, please see the results produced by ring_stress_*autotest below.
Let say for test-case: 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8' it shows:
avg number of cycles per object for enqueue /dequeue:
MP/MC: 280314.32
HTS:         294.72
RTS:         318.79

Customer who tried it reported similar level of improvement.
Actually if you have time - would be very interesting to see what numbers will be on ARM boxes.
To reproduce, just:
$cat ring_tests_u4
ring_stress_autotest
ring_stress_hts_autotest
ring_stress_rts_autotest

/app/test/dpdk-test --lcores='6,(10-13)@7,(20-23)@8'  -n 4 < ring_tests_u4 2>&1 | tee res1

Then look at the ' AGGREGATE' stats.
Right now it is a bit too verbose, so probably the easiest thing to extract same numbers quickly:
grep 'cycles/obj'  res1 | grep 'cycles/obj' | cat -n | awk '{if ($(1)%9==0) print $(NF);}'
280314.32
1057833.55
294.72
480.10
318.79
461.52

First 2 numbers will be for MP/MC, next 2 for HTS, last 2 for RTS.

> For ex: when a producer updates the head and gets scheduled out, other producers
> have to spin.

Sure, as I wrote in original cover letter:
" While it is probably not possible to completely resolve this problem in
userspace only (without some kernel communication/intervention),
removing fairness in tail update can mitigate it significantly."
Results from the ring_stress_*_autotest confirm that.

> The problem is probably worse as with non-HTS case moving of the head and copying of the ring elements can happen in
> parallel between the producers (similarly for consumers).

Yes as we serialize the ring, we remove possibility of simultaneous copy.
That's why for 'normal' cases (one thread per core) original MP/MC is usually faster.
Though on overcommitted scenarios current MP/MC performance degrades dramatically.
The main problem with current MP/MC implementation is in that tail update
have to be done in strict order (sort of fair locking scheme).
Which means that we have much-much worse LHP manifestation,
then when we use unfair schemes.
With serialized ring (HTS) we remove that ordering completely
(same idea as switch from fair to unfair locking for PV spin-locks).  

> IMO, HTS should not be a configurable flag. 

Why?

> In RCU requirement, a MP enqueue and HTS dequeue are required.

This is supported, user can specify different modes for consumer and producer:
(0 | RING_F_MC_HTS_DEQ).
Then it is up to the user either to call generic rte_ring_enqueue/rte_ring_dequeue,
or specify mode manually by function name:
rte_ring_mp_enqueue_bulk/ rte_ring_hts_dequeue_bulk.

> 
> > 2) Relaxed Tail Sync (RTS)
> > The main difference from original MP/MC algorithm is that tail value is
> > increased not by every thread that finished enqueue/dequeue, but only by the
> > last one.
> > That allows threads to avoid spinning on ring tail value, leaving actual tail value
> > change to the last thread in the update queue.
> This can be a configurable flag on the ring.
> I am not sure how this solves the problem you have stated above completely. Updating the count from all intermediate threads is still
> required to update the value of the head. But yes, it reduces the severity of the problem by not enforcing the order in which the tail is
> updated.

As I said above, main source of slowdown here -
that we have to  update tail in particular order.
So the main objective (same as for HTS) is to remove
that ordering.  

> I also think it introduces the problem on the other side of the ring because the tail is not updated soon enough (the other side has to wait
> longer for the elements to become available).

Yes, producer/consumer starvation.
That's why we need max allowed Head-Tail-Distance (htd_max) -
to limit how far head can go away from tail.

> It also introduces another configuration parameter (HTD_MAX_DEF) which they have to deal
> with.

If user doesn't provide any value, it will be set by default to ring.capacity / 8.
From my measurements works quite well.
Though there possibility for the user to set another value, if needed.

> Users have to still implement the current hypervisor related solutions.

Didn't get what you trying to say with that phrase.

> IMO, we should run the benchmark for this on an over committed setup to understand the benefits.

That's why I created ring_stress_*autotest test-cases and collected numbers provided below.
I suppose they clearly show the problem on overcommitted scenarios,
and how RTS/HTS improve that situation. 
Would appreciate if you repeat these tests on your machines. 

> 
> >
> > Test results on IA (see below) show significant improvements for average
> > enqueue/dequeue op times on overcommitted systems.
> > For 'classic' DPDK deployments (one thread per core) original MP/MC
> > algorithm still shows best numbers, though for 64-bit target RTS numbers are
> > not that far away.
> > Numbers were produced by ring_stress_*autotest (first patch in these series).
> >
> > X86_64 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> > DEQ+ENQ average cycles/obj
> >
> >                                                 MP/MC      HTS     RTS
> > 1thread@1core(--lcores=6-7)                     8.00       8.15    8.99
> > 2thread@2core(--lcores=6-8)                     19.14      19.61   20.35
> > 4thread@4core(--lcores=6-10)                    29.43      29.79   31.82
> > 8thread@8core(--lcores=6-14)                    110.59     192.81  119.50
> > 16thread@16core(--lcores=6-22)                  461.03     813.12  495.59
> > 32thread/@32core(--lcores='6-22,55-70')         982.90     1972.38 1160.51
> >
> > 2thread@1core(--lcores='6,(10-11)@7'            20140.50   23.58   25.14
> > 4thread@2core(--lcores='6,(10-11)@7,(20-21)@8'  153680.60  76.88   80.05
> > 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8'  280314.32  294.72  318.79
> > 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 643176.59  1144.02
> > 1175.14 32thread@2core(--lcores='6,(10-25)@7,(30-45)@8' 4264238.80
> > 4627.48 4892.68
> >
> > 8thread@2core(--lcores='6,(10-17)@(7,8))'       321085.98  298.59  307.47
> > 16thread@4core(--lcores='6,(20-35)@(7-10))'     1900705.61 575.35  678.29
> > 32thread@4core(--lcores='6,(20-51)@(7-10))'     5510445.85 2164.36 2714.12
> >
> > i686 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> > DEQ+ENQ average cycles/obj
> >
> >                                                 MP/MC      HTS     RTS
> > 1thread@1core(--lcores=6-7)                     7.85       12.13   11.31
> > 2thread@2core(--lcores=6-8)                     17.89      24.52   21.86
> > 8thread@8core(--lcores=6-14)                    32.58      354.20  54.58
> > 32thread/@32core(--lcores='6-22,55-70')         813.77     6072.41 2169.91
> >
> > 2thread@1core(--lcores='6,(10-11)@7'            16095.00   36.06   34.74
> > 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8'  1140354.54 346.61  361.57
> > 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 1920417.86 1314.90
> > 1416.65
> >
> > 8thread@2core(--lcores='6,(10-17)@(7,8))'       594358.61  332.70  357.74
> > 32thread@4core(--lcores='6,(20-51)@(7-10))'     5319896.86 2836.44 3028.87
> >
> > Konstantin Ananyev (6):
> >   test/ring: add contention stress test
> >   ring: rework ring layout to allow new sync schemes
> >   ring: introduce RTS ring mode
> >   test/ring: add contention stress test for RTS ring
> >   ring: introduce HTS ring mode
> >   test/ring: add contention stress test for HTS ring
> >
> >  app/test/Makefile                      |   3 +
> >  app/test/meson.build                   |   3 +
> >  app/test/test_pdump.c                  |   6 +-
> >  app/test/test_ring_hts_stress.c        |  28 ++
> >  app/test/test_ring_rts_stress.c        |  28 ++
> >  app/test/test_ring_stress.c            |  27 ++
> >  app/test/test_ring_stress.h            | 477 +++++++++++++++++++
> >  lib/librte_pdump/rte_pdump.c           |   2 +-
> >  lib/librte_port/rte_port_ring.c        |  12 +-
> >  lib/librte_ring/Makefile               |   4 +-
> >  lib/librte_ring/meson.build            |   4 +-
> >  lib/librte_ring/rte_ring.c             |  84 +++-
> >  lib/librte_ring/rte_ring.h             | 619 +++++++++++++++++++++++--
> >  lib/librte_ring/rte_ring_elem.h        |   8 +-
> >  lib/librte_ring/rte_ring_hts_generic.h | 228 +++++++++
> > lib/librte_ring/rte_ring_rts_generic.h | 240 ++++++++++
> >  16 files changed, 1721 insertions(+), 52 deletions(-)  create mode 100644
> > app/test/test_ring_hts_stress.c  create mode 100644
> > app/test/test_ring_rts_stress.c  create mode 100644
> > app/test/test_ring_stress.c  create mode 100644 app/test/test_ring_stress.h
> > create mode 100644 lib/librte_ring/rte_ring_hts_generic.h
> >  create mode 100644 lib/librte_ring/rte_ring_rts_generic.h
> >
> > --
> > 2.17.1
  
Honnappa Nagarahalli March 30, 2020, 9:29 p.m. UTC | #15
<snip>

> >
> > > Subject: [dpdk-dev] [RFC 0/6] New sync modes for ring
> > >
> > > Upfront note - that RFC is not a complete patch.
> > > It introduces an ABI breakage, plus it doesn't update ring_elem code
> > > properly,
> > As per the current rules, these changes (in the current form) will be
> > accepted only for 20.11 release. How do we address this for immediate
> requirements like RCU defer APIs?
> 
> I think I found a way to introduce these new modes without API/ABI breakage.
> Working on v1 right now. Plan to submit it by end of that week/start of next
> one.
ok

> 
> > I suggest that we move forward with my RFC (taking into consideration your
> feedback) to make progress on RCU APIs.
> >
> > > etc.
> > > I plan to deal with all these things in later versions.
> > > Right now I seek an initial feedback about proposed ideas.
> > > Would also ask people to repeat performance tests (see below) on
> > > their platforms to confirm the impact.
> > >
> > > More and more customers use(/try to use) DPDK based apps within
> > > overcommitted systems (multiple acttive threads over same pysical cores):
> > > VM, container deployments, etc.
> > > One quite common problem they hit: Lock-Holder-Preemption with
> rte_ring.
> > > LHP is quite a common problem for spin-based sync primitives
> > > (spin-locks, etc.) on overcommitted systems.
> > > The situation gets much worse when some sort of fair-locking
> > > technique is used (ticket-lock, etc.).
> > > As now not only lock-owner but also lock-waiters scheduling order
> > > matters a lot.
> > > This is a well-known problem for kernel within VMs:
> > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > > The problem with rte_ring is that while head accusion is sort of
> > > un-fair locking, waiting on tail is very similar to ticket lock
> > > schema - tail has to be updated in particular order.
> > > That makes current rte_ring implementation to perform really pure on
> > > some overcommited scenarios.
> > > While it is probably not possible to completely resolve this problem
> > > in userspace only (without some kernel communication/intervention),
> > > removing fairness in tail update can mitigate it significantly.
> > > So this RFC proposes two new optional ring synchronization modes:
> > > 1) Head/Tail Sync (HTS) mode
> > > In that mode enqueue/dequeue operation is fully serialized:
> > >     only one thread at a time is allowed to perform given op.
> > >     As another enhancement provide ability to split enqueue/dequeue
> > >     operation into two phases:
> > >       - enqueue/dequeue start
> > >       - enqueue/dequeue finish
> > >     That allows user to inspect objects in the ring without removing
> > >     them from it (aka MT safe peek).
> > IMO, this will not address the problem described above.
> 
> It does, please see the results produced by ring_stress_*autotest below.
> Let say for test-case: 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8' it
Had not looked at these tests. Please see the numbers below.

> shows:
> avg number of cycles per object for enqueue /dequeue:
> MP/MC: 280314.32
> HTS:         294.72
> RTS:         318.79
> 
> Customer who tried it reported similar level of improvement.
Is this tested with the VM/Container setup described in the slides you referred to?

> Actually if you have time - would be very interesting to see what numbers will
> be on ARM boxes.
> To reproduce, just:
> $cat ring_tests_u4
> ring_stress_autotest
> ring_stress_hts_autotest
> ring_stress_rts_autotest
> 
> /app/test/dpdk-test --lcores='6,(10-13)@7,(20-23)@8'  -n 4 < ring_tests_u4
> 2>&1 | tee res1
> 
> Then look at the ' AGGREGATE' stats.
> Right now it is a bit too verbose, so probably the easiest thing to extract same
> numbers quickly:
> grep 'cycles/obj'  res1 | grep 'cycles/obj' | cat -n | awk '{if ($(1)%9==0) print
> $(NF);}'
> 280314.32
> 1057833.55
> 294.72
> 480.10
> 318.79
> 461.52
> 
> First 2 numbers will be for MP/MC, next 2 for HTS, last 2 for RTS.
12305.05
12027.09
3.59
7.37
4.41
7.98

> 
> > For ex: when a producer updates the head and gets scheduled out, other
> > producers have to spin.
> 
> Sure, as I wrote in original cover letter:
> " While it is probably not possible to completely resolve this problem in
> userspace only (without some kernel communication/intervention), removing
> fairness in tail update can mitigate it significantly."
> Results from the ring_stress_*_autotest confirm that.
> 
> > The problem is probably worse as with non-HTS case moving of the head
> > and copying of the ring elements can happen in parallel between the
> producers (similarly for consumers).
> 
> Yes as we serialize the ring, we remove possibility of simultaneous copy.
> That's why for 'normal' cases (one thread per core) original MP/MC is usually
> faster.
> Though on overcommitted scenarios current MP/MC performance degrades
> dramatically.
> The main problem with current MP/MC implementation is in that tail update
> have to be done in strict order (sort of fair locking scheme).
> Which means that we have much-much worse LHP manifestation, then when
> we use unfair schemes.
> With serialized ring (HTS) we remove that ordering completely (same idea as
> switch from fair to unfair locking for PV spin-locks).
> 
> > IMO, HTS should not be a configurable flag.
> 
> Why?
> 
> > In RCU requirement, a MP enqueue and HTS dequeue are required.
> 
> This is supported, user can specify different modes for consumer and
> producer:
> (0 | RING_F_MC_HTS_DEQ).
> Then it is up to the user either to call generic
> rte_ring_enqueue/rte_ring_dequeue,
> or specify mode manually by function name:
> rte_ring_mp_enqueue_bulk/ rte_ring_hts_dequeue_bulk.
Ok, that should be good.

> 
> >
> > > 2) Relaxed Tail Sync (RTS)
> > > The main difference from original MP/MC algorithm is that tail value
> > > is increased not by every thread that finished enqueue/dequeue, but
> > > only by the last one.
> > > That allows threads to avoid spinning on ring tail value, leaving
> > > actual tail value change to the last thread in the update queue.
> > This can be a configurable flag on the ring.
> > I am not sure how this solves the problem you have stated above
> > completely. Updating the count from all intermediate threads is still
> > required to update the value of the head. But yes, it reduces the severity of
> the problem by not enforcing the order in which the tail is updated.
> 
> As I said above, main source of slowdown here - that we have to  update tail in
> particular order.
> So the main objective (same as for HTS) is to remove that ordering.
> 
> > I also think it introduces the problem on the other side of the ring
> > because the tail is not updated soon enough (the other side has to wait
> longer for the elements to become available).
> 
> Yes, producer/consumer starvation.
> That's why we need max allowed Head-Tail-Distance (htd_max) - to limit how
> far head can go away from tail.
> 
> > It also introduces another configuration parameter (HTD_MAX_DEF) which
> > they have to deal with.
> 
> If user doesn't provide any value, it will be set by default to ring.capacity / 8.
> From my measurements works quite well.
> Though there possibility for the user to set another value, if needed.
> 
> > Users have to still implement the current hypervisor related solutions.
> 
> Didn't get what you trying to say with that phrase.
The references you provided talked about resolving LHP by doing co-scheduling of vCPUs (which I think could be applied to DPDK applications). I am saying that we still need such mechanisms along with these solutions.

> 
> > IMO, we should run the benchmark for this on an over committed setup to
> understand the benefits.
> 
> That's why I created ring_stress_*autotest test-cases and collected numbers
> provided below.
> I suppose they clearly show the problem on overcommitted scenarios, and
> how RTS/HTS improve that situation.
> Would appreciate if you repeat these tests on your machines.
> 
> >
> > >
> > > Test results on IA (see below) show significant improvements for
> > > average enqueue/dequeue op times on overcommitted systems.
> > > For 'classic' DPDK deployments (one thread per core) original MP/MC
> > > algorithm still shows best numbers, though for 64-bit target RTS
> > > numbers are not that far away.
> > > Numbers were produced by ring_stress_*autotest (first patch in these
> series).
> > >
> > > X86_64 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> > > DEQ+ENQ average cycles/obj
> > >
> > >                                                 MP/MC      HTS     RTS
> > > 1thread@1core(--lcores=6-7)                     8.00       8.15    8.99
> > > 2thread@2core(--lcores=6-8)                     19.14      19.61   20.35
> > > 4thread@4core(--lcores=6-10)                    29.43      29.79   31.82
> > > 8thread@8core(--lcores=6-14)                    110.59     192.81  119.50
> > > 16thread@16core(--lcores=6-22)                  461.03     813.12  495.59
> > > 32thread/@32core(--lcores='6-22,55-70')         982.90     1972.38 1160.51
> > >
> > > 2thread@1core(--lcores='6,(10-11)@7'            20140.50   23.58   25.14
> > > 4thread@2core(--lcores='6,(10-11)@7,(20-21)@8'  153680.60  76.88
> 80.05
> > > 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8'  280314.32  294.72
> > > 318.79 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 643176.59
> > > 1144.02
> > > 1175.14 32thread@2core(--lcores='6,(10-25)@7,(30-45)@8' 4264238.80
> > > 4627.48 4892.68
> > >
> > > 8thread@2core(--lcores='6,(10-17)@(7,8))'       321085.98  298.59  307.47
> > > 16thread@4core(--lcores='6,(20-35)@(7-10))'     1900705.61 575.35
> 678.29
> > > 32thread@4core(--lcores='6,(20-51)@(7-10))'     5510445.85 2164.36
> 2714.12
> > >
> > > i686 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> > > DEQ+ENQ average cycles/obj
> > >
> > >                                                 MP/MC      HTS     RTS
> > > 1thread@1core(--lcores=6-7)                     7.85       12.13   11.31
> > > 2thread@2core(--lcores=6-8)                     17.89      24.52   21.86
> > > 8thread@8core(--lcores=6-14)                    32.58      354.20  54.58
> > > 32thread/@32core(--lcores='6-22,55-70')         813.77     6072.41 2169.91
> > >
> > > 2thread@1core(--lcores='6,(10-11)@7'            16095.00   36.06   34.74
> > > 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8'  1140354.54 346.61
> > > 361.57 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 1920417.86
> > > 1314.90
> > > 1416.65
> > >
> > > 8thread@2core(--lcores='6,(10-17)@(7,8))'       594358.61  332.70  357.74
> > > 32thread@4core(--lcores='6,(20-51)@(7-10))'     5319896.86 2836.44
> 3028.87
> > >
> > > Konstantin Ananyev (6):
> > >   test/ring: add contention stress test
> > >   ring: rework ring layout to allow new sync schemes
> > >   ring: introduce RTS ring mode
> > >   test/ring: add contention stress test for RTS ring
> > >   ring: introduce HTS ring mode
> > >   test/ring: add contention stress test for HTS ring
> > >
> > >  app/test/Makefile                      |   3 +
> > >  app/test/meson.build                   |   3 +
> > >  app/test/test_pdump.c                  |   6 +-
> > >  app/test/test_ring_hts_stress.c        |  28 ++
> > >  app/test/test_ring_rts_stress.c        |  28 ++
> > >  app/test/test_ring_stress.c            |  27 ++
> > >  app/test/test_ring_stress.h            | 477 +++++++++++++++++++
> > >  lib/librte_pdump/rte_pdump.c           |   2 +-
> > >  lib/librte_port/rte_port_ring.c        |  12 +-
> > >  lib/librte_ring/Makefile               |   4 +-
> > >  lib/librte_ring/meson.build            |   4 +-
> > >  lib/librte_ring/rte_ring.c             |  84 +++-
> > >  lib/librte_ring/rte_ring.h             | 619 +++++++++++++++++++++++--
> > >  lib/librte_ring/rte_ring_elem.h        |   8 +-
> > >  lib/librte_ring/rte_ring_hts_generic.h | 228 +++++++++
> > > lib/librte_ring/rte_ring_rts_generic.h | 240 ++++++++++
> > >  16 files changed, 1721 insertions(+), 52 deletions(-)  create mode
> > > 100644 app/test/test_ring_hts_stress.c  create mode 100644
> > > app/test/test_ring_rts_stress.c  create mode 100644
> > > app/test/test_ring_stress.c  create mode 100644
> > > app/test/test_ring_stress.h create mode 100644
> > > lib/librte_ring/rte_ring_hts_generic.h
> > >  create mode 100644 lib/librte_ring/rte_ring_rts_generic.h
> > >
> > > --
> > > 2.17.1
  
Honnappa Nagarahalli March 30, 2020, 11:37 p.m. UTC | #16
<snip>
> 
> > >
> > > > Subject: [dpdk-dev] [RFC 0/6] New sync modes for ring
> > > >
> > > > Upfront note - that RFC is not a complete patch.
> > > > It introduces an ABI breakage, plus it doesn't update ring_elem
> > > > code properly,
> > > As per the current rules, these changes (in the current form) will
> > > be accepted only for 20.11 release. How do we address this for
> > > immediate
> > requirements like RCU defer APIs?
> >
> > I think I found a way to introduce these new modes without API/ABI
> breakage.
> > Working on v1 right now. Plan to submit it by end of that week/start
> > of next one.
> ok
RCU defer APIs require the rte_ring_xxx_elem versions. I guess you are adding those as well.

> 
<snip>
  
Ananyev, Konstantin March 31, 2020, 5:21 p.m. UTC | #17
> <snip>
> >
> > > >
> > > > > Subject: [dpdk-dev] [RFC 0/6] New sync modes for ring
> > > > >
> > > > > Upfront note - that RFC is not a complete patch.
> > > > > It introduces an ABI breakage, plus it doesn't update ring_elem
> > > > > code properly,
> > > > As per the current rules, these changes (in the current form) will
> > > > be accepted only for 20.11 release. How do we address this for
> > > > immediate
> > > requirements like RCU defer APIs?
> > >
> > > I think I found a way to introduce these new modes without API/ABI
> > breakage.
> > > Working on v1 right now. Plan to submit it by end of that week/start
> > > of next one.
> > ok
> RCU defer APIs require the rte_ring_xxx_elem versions. I guess you are adding those as well.

Yes, I added it into V1, please have a look.
Also I made 'legacy' peek API (enqueue/dequeue_start/finish) to call
'elem' peek API (enqueue/dequeue_elem_start/finish).
About naming: thought about changing start/finish to reserve/commit,
but decided to left as it is for now - in case you would like to go ahead
with SG API and use reserve/commit there.
Konstantin