mbox series

[v9,0/4] lib/rcu: add RCU library supporting QSBR mechanism

Message ID 20190501035419.33524-1-honnappa.nagarahalli@arm.com (mailing list archive)
Headers show
Series lib/rcu: add RCU library supporting QSBR mechanism | expand

Message

Honnappa Nagarahalli May 1, 2019, 3:54 a.m. UTC
Lock-less data structures provide scalability and determinism.
They enable use cases where locking may not be allowed
(for ex: real-time applications).

In the following paras, the term 'memory' refers to memory allocated
by typical APIs like malloc or anything that is representative of
memory, for ex: an index of a free element array.

Since these data structures are lock less, the writers and readers
are accessing the data structures concurrently. Hence, while removing
an element from a data structure, the writers cannot return the memory
to the allocator, without knowing that the readers are not
referencing that element/memory anymore. Hence, it is required to
separate the operation of removing an element into 2 steps:

Delete: in this step, the writer removes the reference to the element from
the data structure but does not return the associated memory to the
allocator. This will ensure that new readers will not get a reference to
the removed element. Removing the reference is an atomic operation.

Free(Reclaim): in this step, the writer returns the memory to the
memory allocator, only after knowing that all the readers have stopped
referencing the deleted element.

This library helps the writer determine when it is safe to free the
memory.

This library makes use of thread Quiescent State (QS). QS can be
defined as 'any point in the thread execution where the thread does
not hold a reference to shared memory'. It is upto the application to
determine its quiescent state. Let us consider the following diagram:

    Time -------------------------------------------------->

                                |     |
  RT1   $++++****D1****+++***D2*|**+++|+++**D3*****++++$
                                |     |
  RT2          $++++****D1****++|+**D2|***++++++**D3*****++++$
                                |     |
  RT3      $++++****D1****+++***|D2***|++++++**D2*****++++$
                                |     |
                                |<--->|
                               Del | Free
                                   |
                              Cannot free memory
                              during this period
                              (Grace Period)

RTx - Reader thread
< and > - Start and end of while(1) loop
***Dx*** - Reader thread is accessing the shared data structure Dx.
           i.e. critical section.
+++ - Reader thread is not accessing any shared data structure.
      i.e. non critical section or quiescent state.
Del - Point in time when the reference to the entry is removed using
      atomic operation.
Free - Point in time when the writer can free the entry.
Grace Period - Time duration between Del and Free, during which memory cannot
               be freed.

As shown, thread RT1 accesses data structures D1, D2 and D3. When it is
accessing D2, if the writer has to remove an element from D2, the
writer cannot free the memory associated with that element immediately.
The writer can return the memory to the allocator only after the reader
stops referencing D2. In other words, reader thread RT1 has to enter
a quiescent state.

Similarly, since thread RT3 is also accessing D2, writer has to wait till
RT3 enters quiescent state as well.

However, the writer does not need to wait for RT2 to enter quiescent state.
Thread RT2 was not accessing D2 when the delete operation happened.
So, RT2 will not get a reference to the deleted entry.

It can be noted that, the critical sections for D2 and D3 are quiescent states
for D1. i.e. for a given data structure Dx, any point in the thread execution
that does not reference Dx is a quiescent state.

Since memory is not freed immediately, there might be a need for
provisioning of additional memory, depending on the application requirements.

It is important to make sure that this library keeps the overhead of
identifying the end of grace period and subsequent freeing of memory,
to a minimum. The following paras explain how grace period and critical
section affect this overhead.

The writer has to poll the readers to identify the end of grace period.
Polling introduces memory accesses and wastes CPU cycles. The memory
is not available for reuse during grace period. Longer grace periods
exasperate these conditions.

The length of the critical section and the number of reader threads
is proportional to the duration of the grace period. Keeping the critical
sections smaller will keep the grace period smaller. However, keeping the
critical sections smaller requires additional CPU cycles(due to additional
reporting) in the readers.

Hence, we need the characteristics of small grace period and large critical
section. This library addresses this by allowing the writer to do
other work without having to block till the readers report their quiescent
state.

For DPDK applications, the start and end of while(1) loop (where no
references to shared data structures are kept) act as perfect quiescent
states. This will combine all the shared data structure accesses into a
single, large critical section which helps keep the overhead on the
reader side to a minimum.

DPDK supports pipeline model of packet processing and service cores.
In these use cases, a given data structure may not be used by all the
workers in the application. The writer does not have to wait for all
the workers to report their quiescent state. To provide the required
flexibility, this library has a concept of QS variable. The application
can create one QS variable per data structure to help it track the
end of grace period for each data structure. This helps keep the grace
period to a minimum.

The application has to allocate memory and initialize a QS variable.

Application can call rte_rcu_qsbr_get_memsize to calculate the size
of memory to allocate. This API takes maximum number of reader threads,
using this variable, as a parameter. Currently, a maximum of 1024 threads
are supported.

Further, the application can initialize a QS variable using the API
rte_rcu_qsbr_init.

Each reader thread is assumed to have a unique thread ID. Currently, the
management of the thread ID (for ex: allocation/free) is left to the
application. The thread ID should be in the range of 0 to
maximum number of threads provided while creating the QS variable.
The application could also use lcore_id as the thread ID where applicable.

rte_rcu_qsbr_thread_register API will register a reader thread
to report its quiescent state. This can be called from a reader thread.
A control plane thread can also call this on behalf of a reader thread.
The reader thread must call rte_rcu_qsbr_thread_online API to start reporting
its quiescent state.

Some of the use cases might require the reader threads to make
blocking API calls (for ex: while using eventdev APIs). The writer thread
should not wait for such reader threads to enter quiescent state.
The reader thread must call rte_rcu_qsbr_thread_offline API, before calling
blocking APIs. It can call rte_rcu_qsbr_thread_online API once the blocking
API call returns.

The writer thread can trigger the reader threads to report their quiescent
state by calling the API rte_rcu_qsbr_start. It is possible for multiple
writer threads to query the quiescent state status simultaneously. Hence,
rte_rcu_qsbr_start returns a token to each caller.

The writer thread has to call rte_rcu_qsbr_check API with the token to get the
current quiescent state status. Option to block till all the reader threads
enter the quiescent state is provided. If this API indicates that all the
reader threads have entered the quiescent state, the application can free the
deleted entry.

The APIs rte_rcu_qsbr_start and rte_rcu_qsbr_check are lock free. Hence, they
can be called concurrently from multiple writers even while running
as worker threads.

The separation of triggering the reporting from querying the status provides
the writer threads flexibility to do useful work instead of blocking for the
reader threads to enter the quiescent state or go offline. This reduces the
memory accesses due to continuous polling for the status.

rte_rcu_qsbr_synchronize API combines the functionality of rte_rcu_qsbr_start
and blocking rte_rcu_qsbr_check into a single API. This API triggers the reader
threads to report their quiescent state and polls till all the readers enter
the quiescent state or go offline. This API does not allow the writer to
do useful work while waiting and also introduces additional memory accesses
due to continuous polling.

The reader thread must call rte_rcu_qsbr_thread_offline and
rte_rcu_qsbr_thread_unregister APIs to remove itself from reporting its
quiescent state. The rte_rcu_qsbr_check API will not wait for this reader
thread to report the quiescent state status anymore.

The reader threads should call rte_rcu_qsbr_update API to indicate that they
entered a quiescent state. This API checks if a writer has triggered a
quiescent state query and update the state accordingly.

Patch v9:
1) Files/test cases/function names listed alphabetically where required (Thomas)

Patch v8:
1) Library changes
   a) Symbols prefixed with '__RTE' or 'rte_' as required (Thomas)
   b) Used PRI?64 macros to support 32b compilation (Thomas)
   c) Fixed shared library compilation (Thomas)
2) Test cases
   a) Fixed segmentation fault when more than 20 cores are used for testing (Jerin)
   b) Used PRI?64 macros to support 32b compilation (Thomas)
   c) Testing done on x86, ThunderX2, Octeon TX, BlueField for 32b(x86 only)/64b,
      debug/non-debug, shared/static linking, meson/makefile with various
      number of cores

Patch v7:
1) Library changes
   a) Added macro RCU_IS_LOCK_CNT_ZERO
   b) Added lock counter validation to rte_rcu_qsbr_thread_online/
      rte_rcu_qsbr_thread_offline/rte_rcu_qsbr_thread_register/
      rte_rcu_qsbr_thread_unregister APIs (Paul)

Patch v6:
1) Library changes
   a) Fixed and tested meson build on Arm and x86 (Konstantin)
   b) Moved rte_rcu_qsbr_synchronize API to rte_rcu_qsbr.c

Patch v5:
1) Library changes
   a) Removed extra alignment in rte_rcu_qsbr_get_memsize API (Paul)
   b) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock APIs (Paul)
   c) Clarified the need for 64b counters (Paul)
2) Test cases
   a) Added additional performance test cases to benchmark
      __rcu_qsbr_check_all
   b) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock calls in various test cases
3) Documentation
   a) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock usage description

Patch v4:
1) Library changes
   a) Fixed the compilation issue on x86 (Konstantin)
   b) Rebased with latest master

Patch v3:
1) Library changes
   a) Moved the registered thread ID array to the end of the
      structure (Konstantin)
   b) Removed the compile time constant RTE_RCU_MAX_THREADS
   c) Added code to keep track of registered number of threads

Patch v2:
1) Library changes
   a) Corrected the RTE_ASSERT checks (Konstantin)
   b) Replaced RTE_ASSERT with 'if' checks for non-datapath APIs (Konstantin)
   c) Made rte_rcu_qsbr_thread_register/unregister non-datapath critical APIs
   d) Renamed rte_rcu_qsbr_update to rte_rcu_qsbr_quiescent (Ola)
   e) Used rte_smp_mb() in rte_rcu_qsbr_thread_online API for x86 (Konstantin)
   f) Removed the macro to access the thread QS counters (Konstantin)
2) Test cases
   a) Added additional test cases for removing RTE_ASSERT
3) Documentation
   a) Changed the figure to make it bigger (Marko)
   b) Spelling and format corrections (Marko)

Patch v1:
1) Library changes
   a) Changed the maximum number of reader threads to 1024
   b) Renamed rte_rcu_qsbr_register/unregister_thread to
      rte_rcu_qsbr_thread_register/unregister
   c) Added rte_rcu_qsbr_thread_online/offline API. These are optimized
      version of rte_rcu_qsbr_thread_register/unregister API. These
      also provide the flexibility for performance when the requested
      maximum number of threads is higher than the current number of
      threads.
   d) Corrected memory orderings in rte_rcu_qsbr_update
   e) Changed the signature of rte_rcu_qsbr_start API to return the token
   f) Changed the signature of rte_rcu_qsbr_start API to not take the
      expected number of QS states to wait.
   g) Added debug logs
   h) Added API and programmer guide documentation.

RFC v3:
1) Library changes
   a) Rebased to latest master
   b) Added new API rte_rcu_qsbr_get_memsize
   c) Add support for memory allocation for QSBR variable (Konstantin)
   d) Fixed a bug in rte_rcu_qsbr_check (Konstantin)
2) Testcase changes
   a) Separated stress tests into a performance test case file
   b) Added performance statistics

RFC v2:
1) Cover letter changes
   a) Explian the parameters that affect the overhead of using RCU
      and their effect
   b) Explain how this library addresses these effects to keep
      the overhead to minimum
2) Library changes
   a) Rename the library to avoid confusion (Matias, Bruce, Konstantin)
   b) Simplify the code/remove APIs to keep this library inline with
      other synchronisation mechanisms like locks (Konstantin)
   c) Change the design to support more than 64 threads (Konstantin)
   d) Fixed version map to remove static inline functions
3) Testcase changes
   a) Add boundary and additional functional test cases
   b) Add stress test cases (Paul E. McKenney)

Dharmik Thakkar (1):
  test/rcu_qsbr: add API and functional tests

Honnappa Nagarahalli (3):
  rcu: add RCU library supporting QSBR mechanism
  doc/rcu: add lib_rcu documentation
  doc: added RCU to the release notes

 MAINTAINERS                                   |    5 +
 app/test/Makefile                             |    2 +
 app/test/autotest_data.py                     |   12 +
 app/test/meson.build                          |    5 +
 app/test/test_rcu_qsbr.c                      | 1014 +++++++++++++++++
 app/test/test_rcu_qsbr_perf.c                 |  704 ++++++++++++
 config/common_base                            |    6 +
 doc/api/doxy-api-index.md                     |    3 +-
 doc/api/doxy-api.conf.in                      |    1 +
 .../prog_guide/img/rcu_general_info.svg       |  509 +++++++++
 doc/guides/prog_guide/index.rst               |    1 +
 doc/guides/prog_guide/rcu_lib.rst             |  185 +++
 doc/guides/rel_notes/release_19_05.rst        |    8 +
 lib/Makefile                                  |    2 +
 lib/librte_rcu/Makefile                       |   23 +
 lib/librte_rcu/meson.build                    |    7 +
 lib/librte_rcu/rte_rcu_qsbr.c                 |  277 +++++
 lib/librte_rcu/rte_rcu_qsbr.h                 |  641 +++++++++++
 lib/librte_rcu/rte_rcu_version.map            |   13 +
 lib/meson.build                               |    2 +-
 mk/rte.app.mk                                 |    1 +
 21 files changed, 3419 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_rcu_qsbr.c
 create mode 100644 app/test/test_rcu_qsbr_perf.c
 create mode 100644 doc/guides/prog_guide/img/rcu_general_info.svg
 create mode 100644 doc/guides/prog_guide/rcu_lib.rst
 create mode 100644 lib/librte_rcu/Makefile
 create mode 100644 lib/librte_rcu/meson.build
 create mode 100644 lib/librte_rcu/rte_rcu_qsbr.c
 create mode 100644 lib/librte_rcu/rte_rcu_qsbr.h
 create mode 100644 lib/librte_rcu/rte_rcu_version.map

Comments

Neil Horman May 1, 2019, 12:15 p.m. UTC | #1
On Tue, Apr 30, 2019 at 10:54:15PM -0500, Honnappa Nagarahalli wrote:
> Lock-less data structures provide scalability and determinism.
> They enable use cases where locking may not be allowed
> (for ex: real-time applications).
> 
I know this is version 9 of the patch, so I'm sorry for the late comment, but I
have to ask: Why re-invent this wheel?  There are already several Userspace RCU
libraries that are mature and carried by Linux and BSD distributions.  Why would
we throw another one into DPDK instead of just using whats already available,
mature and stable?

Neil
Honnappa Nagarahalli May 1, 2019, 2:56 p.m. UTC | #2
> 
> On Tue, Apr 30, 2019 at 10:54:15PM -0500, Honnappa Nagarahalli wrote:
> > Lock-less data structures provide scalability and determinism.
> > They enable use cases where locking may not be allowed (for ex:
> > real-time applications).
> >
> I know this is version 9 of the patch, so I'm sorry for the late comment, but I
> have to ask: Why re-invent this wheel?  There are already several Userspace
Thanks Neil, for asking the question. This has been debated before. Please refer to [2] for more details.

liburcu [1] was explored as it seemed to be familiar to others in the community . I am not aware of any other library.

There are unique requirements in DPDK and there is still scope for improvement from what is available. I have explained this in the cover letter without making a direct comparison to liburcu. May be it is worth tweaking the documentation to call this out explicitly.

[1] https://liburcu.org/
[2] http://mails.dpdk.org/archives/dev/2018-November/119875.html

> RCU libraries that are mature and carried by Linux and BSD distributions.
> Why would we throw another one into DPDK instead of just using whats
> already available, mature and stable?
> 
> Neil
Neil Horman May 1, 2019, 6:05 p.m. UTC | #3
On Wed, May 01, 2019 at 02:56:48PM +0000, Honnappa Nagarahalli wrote:
> > 
> > On Tue, Apr 30, 2019 at 10:54:15PM -0500, Honnappa Nagarahalli wrote:
> > > Lock-less data structures provide scalability and determinism.
> > > They enable use cases where locking may not be allowed (for ex:
> > > real-time applications).
> > >
> > I know this is version 9 of the patch, so I'm sorry for the late comment, but I
> > have to ask: Why re-invent this wheel?  There are already several Userspace
> Thanks Neil, for asking the question. This has been debated before. Please refer to [2] for more details.
> 
> liburcu [1] was explored as it seemed to be familiar to others in the community . I am not aware of any other library.
> 
> There are unique requirements in DPDK and there is still scope for improvement from what is available. I have explained this in the cover letter without making a direct comparison to liburcu. May be it is worth tweaking the documentation to call this out explicitly.
> 
I think what you're referring to here is the need for multiple QSBR variables,
yes?  I'm not sure thats, strictly speaking, a requirement.  It seems like its a
performance improvement, but I'm not sure thats the case (see performance
numbers below).

Regarding performance, we can't keep using raw performance as a trump card for
all other aspects of the DPDK.  This entire patch is meant to improve
performance, it seems like it would be worthwhile to gain the code consolidation
and reuse benefits for the minor performance hit.

Further to performance, I may be misreading this, but I ran the integrated
performance test you provided in this patch, as well as the benchmark tests for
liburcw (trimmed for easier reading here)

liburcw:
[nhorman@hmswarspite benchmark]$ ./test_urcu 7 1 1 -v -a 0 -a 1 -a 2 -a 3 -a 4 -a 5 -a 6 -a 7 -a 0
Adding CPU 0 affinity
Adding CPU 1 affinity
Adding CPU 2 affinity
Adding CPU 3 affinity
Adding CPU 4 affinity
Adding CPU 5 affinity
Adding CPU 6 affinity
Adding CPU 7 affinity
Adding CPU 0 affinity
running test for 1 seconds, 7 readers, 1 writers.
Writer delay : 0 loops.
Reader duration : 0 loops.
thread main  , tid 22712
thread_begin reader, tid 22726
thread_begin reader, tid 22729
thread_begin reader, tid 22728
thread_begin reader, tid 22727
thread_begin reader, tid 22731
thread_begin reader, tid 22730
thread_begin reader, tid 22732
thread_begin writer, tid 22733
thread_end reader, tid 22729
thread_end reader, tid 22731
thread_end reader, tid 22730
thread_end reader, tid 22728
thread_end reader, tid 22727
thread_end writer, tid 22733
thread_end reader, tid 22726
thread_end reader, tid 22732
total number of reads : 1185640774, writes 264444
SUMMARY /home/nhorman/git/userspace-rcu/tests/benchmark/.libs/lt-test_urcu testdur    1 nr_readers   7 rdur      0 wdur      0 nr_writers   1 wdelay      0 nr_reads   1185640774 nr_writes       264444 nr_ops   1185905218

DPDK:
Perf test: 1 writer, 7 readers, 1 QSBR variable, 1 QSBR Query, Non-Blocking QSBR check
Following numbers include calls to rte_hash functions
Cycles per 1 update(online/update/offline): 813407
Cycles per 1 check(start, check): 859679


Both of these tests qsbr rcu in each library using 7 readers and 1 writer.  Its
a little bit of an apples to oranges comparison, as the tests run using slightly
different parameters, and produce different output statistics, but I think they
can be somewhat normalized.  Primarily the stat that stuck out to me was the
DPDK Cycles per 1 update statistic, which I believe is effectively the number of
cycles spent in the test / the number of writer updates.  On DPDK that number in
this test run works out to 813407.  In the liburcw test, it reports the total
number of ops (cycles), and the number of writes completed within those cycles.
If we do the same division there we get 185905218 / 264444 = 4484. I may be
misreading something here, but that seems like a pretty significant write side
performance improvement over this implementation.

Neil
 
> [1] https://liburcu.org/
> [2] http://mails.dpdk.org/archives/dev/2018-November/119875.html
> 
> > RCU libraries that are mature and carried by Linux and BSD distributions.
> > Why would we throw another one into DPDK instead of just using whats
> > already available, mature and stable?
> > 
> > Neil
> 
>
Honnappa Nagarahalli May 1, 2019, 9:18 p.m. UTC | #4
> Subject: Re: [dpdk-dev] [PATCH v9 0/4] lib/rcu: add RCU library supporting
> QSBR mechanism
> 
> On Wed, May 01, 2019 at 02:56:48PM +0000, Honnappa Nagarahalli wrote:
> > >
> > > On Tue, Apr 30, 2019 at 10:54:15PM -0500, Honnappa Nagarahalli wrote:
> > > > Lock-less data structures provide scalability and determinism.
> > > > They enable use cases where locking may not be allowed (for ex:
> > > > real-time applications).
> > > >
> > > I know this is version 9 of the patch, so I'm sorry for the late
> > > comment, but I have to ask: Why re-invent this wheel?  There are
> > > already several Userspace
> > Thanks Neil, for asking the question. This has been debated before. Please
> refer to [2] for more details.
> >
> > liburcu [1] was explored as it seemed to be familiar to others in the
> community . I am not aware of any other library.
> >
> > There are unique requirements in DPDK and there is still scope for
> improvement from what is available. I have explained this in the cover letter
> without making a direct comparison to liburcu. May be it is worth tweaking the
> documentation to call this out explicitly.
> >
> I think what you're referring to here is the need for multiple QSBR variables,
> yes?  I'm not sure thats, strictly speaking, a requirement.  It seems like its a
> performance improvement, but I'm not sure thats the case (see performance
> numbers below).
DPDK supports service cores feature and pipeline mode where a particular data structure is used by a subset of readers. These use cases affect the writer and reader (which are on the data plane) in the following ways:

1) The writer does not need to wait for all the readers to complete the quiescent state. Writer does not need to spend CPU cycles and add to memory bandwidth polling the unwanted readers. DPDK has uses cases where the writer is on the data plane as well.
2) The readers that do not use the data structure do not have to spend cycles reporting their quiescent state. Note that these are data plane cycles

Other than this, please read about how grace period and critical section affect the over head introduced by QSBR mechanism in the cover letter. It also explains how this library solves this issue.

This is discussed in the discussion thread I provided earlier.

> 
> Regarding performance, we can't keep using raw performance as a trump card
IMO, performance is NOT a 'trump card'. The whole essence of DPDK is performance. If not for performance, would DPDK exist?

> for all other aspects of the DPDK.  This entire patch is meant to improve
> performance, it seems like it would be worthwhile to gain the code
> consolidation and reuse benefits for the minor performance hit.
Apologies, I did not understand this. Can you please elaborate code consolidation part?

> 
> Further to performance, I may be misreading this, but I ran the integrated
> performance test you provided in this patch, as well as the benchmark tests for
> liburcw (trimmed for easier reading here)
Just to be sure, I believe you are referring to *liburcu*

> 
> liburcw:
> [nhorman@hmswarspite benchmark]$ ./test_urcu 7 1 1 -v -a 0 -a 1 -a 2 -a 3 -a
> 4 -a 5 -a 6 -a 7 -a 0 Adding CPU 0 affinity Adding CPU 1 affinity Adding CPU 2
> affinity Adding CPU 3 affinity Adding CPU 4 affinity Adding CPU 5 affinity
> Adding CPU 6 affinity Adding CPU 7 affinity Adding CPU 0 affinity running test
> for 1 seconds, 7 readers, 1 writers.
> Writer delay : 0 loops.
> Reader duration : 0 loops.
> thread main  , tid 22712
> thread_begin reader, tid 22726
> thread_begin reader, tid 22729
> thread_begin reader, tid 22728
> thread_begin reader, tid 22727
> thread_begin reader, tid 22731
> thread_begin reader, tid 22730
> thread_begin reader, tid 22732
> thread_begin writer, tid 22733
> thread_end reader, tid 22729
> thread_end reader, tid 22731
> thread_end reader, tid 22730
> thread_end reader, tid 22728
> thread_end reader, tid 22727
> thread_end writer, tid 22733
> thread_end reader, tid 22726
> thread_end reader, tid 22732
> total number of reads : 1185640774, writes 264444
> SUMMARY /home/nhorman/git/userspace-rcu/tests/benchmark/.libs/lt-
> test_urcu testdur    1 nr_readers   7 rdur      0 wdur      0 nr_writers   1 wdelay
> 0 nr_reads   1185640774 nr_writes       264444 nr_ops   1185905218
> 
> DPDK:
> Perf test: 1 writer, 7 readers, 1 QSBR variable, 1 QSBR Query, Non-Blocking
> QSBR check Following numbers include calls to rte_hash functions Cycles per 1
> update(online/update/offline): 813407 Cycles per 1 check(start, check):
> 859679
> 
> 
> Both of these tests qsbr rcu in each library using 7 readers and 1 writer.  Its a
> little bit of an apples to oranges comparison, as the tests run using slightly
Thanks for running the test. Yes, it is apples to oranges comparison:
1) The test you are running is not the correct test assuming the code for this test is [3]
2) This is not QSBR

I suggest you use [4] for your testing. It also need further changes to match the test case in this patch. The function 'thr_reader' reports quiescent state every 1024 iterations, please change it to report every iteration.

After this you need to compare these results with the first test case in this patch.

[3] https://github.com/urcu/userspace-rcu/blob/master/tests/benchmark/test_urcu.c
[4] https://github.com/urcu/userspace-rcu/blob/master/tests/benchmark/test_urcu_qsbr.c

> different parameters, and produce different output statistics, but I think they
> can be somewhat normalized.  Primarily the stat that stuck out to me was the
> DPDK Cycles per 1 update statistic, which I believe is effectively the number of
> cycles spent in the test / the number of writer updates.  On DPDK that number
> in this test run works out to 813407.  In the liburcw test, it reports the total
> number of ops (cycles), and the number of writes completed within those
> cycles.
> If we do the same division there we get 185905218 / 264444 = 4484. I may be
> misreading something here, but that seems like a pretty significant write side
Yes, you are misreading. 'number of ops' is not cycles. It is sum of 'nr_writes' and 'nr_reads'. The test runs for 1 sec (uses 'sleep'), so these are number of operations done in 1 sec. You need to normalize to number of cycles using this data.

> performance improvement over this implementation.
> 
> Neil
> 
> > [1] https://liburcu.org/
> > [2] http://mails.dpdk.org/archives/dev/2018-November/119875.html
> >
> > > RCU libraries that are mature and carried by Linux and BSD distributions.
> > > Why would we throw another one into DPDK instead of just using whats
> > > already available, mature and stable?
> > >
> > > Neil
> >
> >
Thomas Monjalon May 1, 2019, 11:36 p.m. UTC | #5
01/05/2019 05:54, Honnappa Nagarahalli:
> Dharmik Thakkar (1):
>   test/rcu_qsbr: add API and functional tests
> 
> Honnappa Nagarahalli (3):
>   rcu: add RCU library supporting QSBR mechanism
>   doc/rcu: add lib_rcu documentation
>   doc: added RCU to the release notes

Applied with discussed doc changes, thanks.