mbox series

[v2,0/5] lib/stack: improve lockfree C11 implementation

Message ID 20200925174340.10014-1-steven.lariau@arm.com (mailing list archive)
Headers
Series lib/stack: improve lockfree C11 implementation |

Message

Steven Lariau Sept. 25, 2020, 5:43 p.m. UTC
  One implementation of the DPDK stack library is lockfree,
based on C11 memory model for atomics.
Some of these atomic operations use unnecessary memory orders,
that can be relaxed.
This patch relax some of these operations in order to improve
the performance of the stack library.

The patch was tested on several architectures, to ensure that
the implementation is correct, and to measure performance.
Below are the results for a few architectures on multithread stack
lockfree test.
The cycles count is the average number of cycles per item to perform
a bulk push / pop.

$sudo ./builddir/app/dpdk-test
RTE>>stack_lf_perf_autotest
                              difference compared to main
Cycles count on ThunderX2
 2 cores, bulk size =  8:           -15.85%
 2 cores, bulk size = 32:           -04.56%
 4 cores, bulk size =  8:           -05.00%
 4 cores, bulk size = 32:           -04.35%
16 cores, bulk size =  8:           -02.38%
16 cores, bulk size = 32:           -01.88%

                              difference compared to main
Cycles count on N1SDP
 2 cores, batch size =  8:          +00.77%
 2 cores, batch size = 32:          -16.00%

                              difference compared to main
Cycles count on Skylake
 2 cores, bulk size =  8:           -00.18%
 2 cores, bulk size = 32:           -00.95%
 4 cores, bulk size =  8:           -01.19%
 4 cores, bulk size = 32:           +00.64%
16 cores, bulk size =  8:           +01.20%
16 cores, bulk size = 32:           +00.48%

v2: add comment to explain why pop head CAS relaxed is valid
    added Fixes information

Steven Lariau (5):
  lib/stack: fix inconsistent weak / strong cas
  lib/stack: remove push acquire fence
  lib/stack: remove redundant orderings for list->len
  lib/stack: reload head when pop fails
  lib/stack: remove pop cas release ordering

 lib/librte_stack/rte_stack_lf_c11.h | 32 +++++++++++++++++++----------
 1 file changed, 21 insertions(+), 11 deletions(-)
  

Comments

David Marchand Sept. 30, 2020, 7:14 p.m. UTC | #1
On Fri, Sep 25, 2020 at 7:44 PM Steven Lariau <steven.lariau@arm.com> wrote:
>
> One implementation of the DPDK stack library is lockfree,
> based on C11 memory model for atomics.
> Some of these atomic operations use unnecessary memory orders,
> that can be relaxed.
> This patch relax some of these operations in order to improve
> the performance of the stack library.
>
> The patch was tested on several architectures, to ensure that
> the implementation is correct, and to measure performance.
> Below are the results for a few architectures on multithread stack
> lockfree test.
> The cycles count is the average number of cycles per item to perform
> a bulk push / pop.
>
> $sudo ./builddir/app/dpdk-test
> RTE>>stack_lf_perf_autotest
>                               difference compared to main
> Cycles count on ThunderX2
>  2 cores, bulk size =  8:           -15.85%
>  2 cores, bulk size = 32:           -04.56%
>  4 cores, bulk size =  8:           -05.00%
>  4 cores, bulk size = 32:           -04.35%
> 16 cores, bulk size =  8:           -02.38%
> 16 cores, bulk size = 32:           -01.88%
>
>                               difference compared to main
> Cycles count on N1SDP
>  2 cores, batch size =  8:          +00.77%
>  2 cores, batch size = 32:          -16.00%
>
>                               difference compared to main
> Cycles count on Skylake
>  2 cores, bulk size =  8:           -00.18%
>  2 cores, bulk size = 32:           -00.95%
>  4 cores, bulk size =  8:           -01.19%
>  4 cores, bulk size = 32:           +00.64%
> 16 cores, bulk size =  8:           +01.20%
> 16 cores, bulk size = 32:           +00.48%
>
> v2: add comment to explain why pop head CAS relaxed is valid
>     added Fixes information
>
> Steven Lariau (5):
>   lib/stack: fix inconsistent weak / strong cas
>   lib/stack: remove push acquire fence
>   lib/stack: remove redundant orderings for list->len
>   lib/stack: reload head when pop fails
>   lib/stack: remove pop cas release ordering
>
>  lib/librte_stack/rte_stack_lf_c11.h | 32 +++++++++++++++++++----------
>  1 file changed, 21 insertions(+), 11 deletions(-)

Series applied, thanks for those optimisations.