[v8,1/3] doc: add optimizations using C11 atomic built-ins

Message ID 1594875225-5850-2-git-send-email-phil.yang@arm.com (mailing list archive)
State Superseded, archived
Delegated to: David Marchand
Headers
Series generic rte atomic APIs deprecate proposal |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-testing success Testing PASS

Commit Message

Phil Yang July 16, 2020, 4:53 a.m. UTC
  Add information about possible optimizations using C11 atomic built-ins.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 doc/guides/prog_guide/writing_efficient_code.rst | 59 +++++++++++++++++++++++-
 1 file changed, 58 insertions(+), 1 deletion(-)
  

Comments

David Marchand July 16, 2020, 10:35 a.m. UTC | #1
Hello,

On Thu, Jul 16, 2020 at 6:58 AM Phil Yang <phil.yang@arm.com> wrote:
>
> Add information about possible optimizations using C11 atomic built-ins.

We are missing a review on this doc update.

Thanks.
  
Honnappa Nagarahalli July 16, 2020, 6:22 p.m. UTC | #2
<snip>

> Subject: [PATCH v8 1/3] doc: add optimizations using C11 atomic built-ins
> 
> Add information about possible optimizations using C11 atomic built-ins.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Thanks for the changes, they look good now.

David wanted to change 'built-ins' to 'builtins', otherwise
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> ---
>  doc/guides/prog_guide/writing_efficient_code.rst | 59
> +++++++++++++++++++++++-
>  1 file changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/guides/prog_guide/writing_efficient_code.rst
> b/doc/guides/prog_guide/writing_efficient_code.rst
> index 849f63e..53a1ca1 100644
> --- a/doc/guides/prog_guide/writing_efficient_code.rst
> +++ b/doc/guides/prog_guide/writing_efficient_code.rst
> @@ -167,7 +167,13 @@ but with the added cost of lower throughput.
>  Locks and Atomic Operations
>  ---------------------------
> 
> -Atomic operations imply a lock prefix before the instruction,
> +This section describes some key considerations when using locks and
> +atomic operations in the DPDK environment.
> +
> +Locks
> +~~~~~
> +
> +On x86, atomic operations imply a lock prefix before the instruction,
>  causing the processor's LOCK# signal to be asserted during execution of the
> following instruction.
>  This has a big impact on performance in a multicore environment.
> 
> @@ -176,6 +182,57 @@ It can often be replaced by other solutions like per-
> lcore variables.
>  Also, some locking techniques are more efficient than others.
>  For instance, the Read-Copy-Update (RCU) algorithm can frequently replace
> simple rwlocks.
> 
> +Atomic Operations: Use C11 Atomic Built-ins
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +DPDK generic rte_atomic operations are implemented by __sync built-ins.
> +These __sync built-ins result in full barriers on aarch64, which are
> +unnecessary in many use cases. They can be replaced by __atomic
> +built-ins that conform to the C11 memory model and provide finer memory
> order control.
> +
> +So replacing the rte_atomic operations with __atomic built-ins might
> +improve performance for aarch64 machines.
> +
> +Some typical optimization cases are listed below:
> +
> +Atomicity
> +^^^^^^^^^
> +
> +Some use cases require atomicity alone, the ordering of the memory
> +operations does not matter. For example, the packet statistics counters
> +need to be incremented atomically but do not need any particular memory
> ordering.
> +So, RELAXED memory ordering is sufficient.
> +
> +One-way Barrier
> +^^^^^^^^^^^^^^^
> +
> +Some use cases allow for memory reordering in one way while requiring
> +memory ordering in the other direction.
> +
> +For example, the memory operations before the spinlock lock are allowed
> +to move to the critical section, but the memory operations in the
> +critical section are not allowed to move above the lock. In this case,
> +the full memory barrier in the compare-and-swap operation can be replaced
> with ACQUIRE memory order.
> +On the other hand, the memory operations after the spinlock unlock are
> +allowed to move to the critical section, but the memory operations in
> +the critical section are not allowed to move below the unlock. So the
> +full barrier in the store operation can use RELEASE memory order.
> +
> +Reader-Writer Concurrency
> +^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Lock-free reader-writer concurrency is one of the common use cases in DPDK.
> +
> +The payload or the data that the writer wants to communicate to the
> +reader, can be written with RELAXED memory order. However, the guard
> +variable should be written with RELEASE memory order. This ensures that
> +the store to guard variable is observable only after the store to payload is
> observable.
> +
> +Correspondingly, on the reader side, the guard variable should be read
> +with ACQUIRE memory order. The payload or the data the writer
> +communicated, can be read with RELAXED memory order. This ensures that,
> +if the store to guard variable is observable, the store to payload is also
> observable.
> +
>  Coding Considerations
>  ---------------------
> 
> --
> 2.7.4
  
Phil Yang July 17, 2020, 4:44 a.m. UTC | #3
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> writes:

<snip>
> 
> > Subject: [PATCH v8 1/3] doc: add optimizations using C11 atomic built-ins
> >
> > Add information about possible optimizations using C11 atomic built-ins.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> Thanks for the changes, they look good now.
> 
> David wanted to change 'built-ins' to 'builtins', otherwise
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Will do in the next version.
Thanks.

> 
> > ---
> >  doc/guides/prog_guide/writing_efficient_code.rst | 59
> > +++++++++++++++++++++++-
> >  1 file changed, 58 insertions(+), 1 deletion(-)
> >
> > diff --git a/doc/guides/prog_guide/writing_efficient_code.rst
> > b/doc/guides/prog_guide/writing_efficient_code.rst
> > index 849f63e..53a1ca1 100644
> > --- a/doc/guides/prog_guide/writing_efficient_code.rst
> > +++ b/doc/guides/prog_guide/writing_efficient_code.rst
> > @@ -167,7 +167,13 @@ but with the added cost of lower throughput.
> >  Locks and Atomic Operations
> >  ---------------------------
> >
> > -Atomic operations imply a lock prefix before the instruction,
> > +This section describes some key considerations when using locks and
> > +atomic operations in the DPDK environment.
> > +
> > +Locks
> > +~~~~~
> > +
> > +On x86, atomic operations imply a lock prefix before the instruction,
> >  causing the processor's LOCK# signal to be asserted during execution of
> the
> > following instruction.
> >  This has a big impact on performance in a multicore environment.
> >
> > @@ -176,6 +182,57 @@ It can often be replaced by other solutions like per-
> > lcore variables.
> >  Also, some locking techniques are more efficient than others.
> >  For instance, the Read-Copy-Update (RCU) algorithm can frequently
> replace
> > simple rwlocks.
> >
> > +Atomic Operations: Use C11 Atomic Built-ins
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +DPDK generic rte_atomic operations are implemented by __sync built-ins.
> > +These __sync built-ins result in full barriers on aarch64, which are
> > +unnecessary in many use cases. They can be replaced by __atomic
> > +built-ins that conform to the C11 memory model and provide finer
> memory
> > order control.
> > +
> > +So replacing the rte_atomic operations with __atomic built-ins might
> > +improve performance for aarch64 machines.
> > +
> > +Some typical optimization cases are listed below:
> > +
> > +Atomicity
> > +^^^^^^^^^
> > +
> > +Some use cases require atomicity alone, the ordering of the memory
> > +operations does not matter. For example, the packet statistics counters
> > +need to be incremented atomically but do not need any particular
> memory
> > ordering.
> > +So, RELAXED memory ordering is sufficient.
> > +
> > +One-way Barrier
> > +^^^^^^^^^^^^^^^
> > +
> > +Some use cases allow for memory reordering in one way while requiring
> > +memory ordering in the other direction.
> > +
> > +For example, the memory operations before the spinlock lock are allowed
> > +to move to the critical section, but the memory operations in the
> > +critical section are not allowed to move above the lock. In this case,
> > +the full memory barrier in the compare-and-swap operation can be
> replaced
> > with ACQUIRE memory order.
> > +On the other hand, the memory operations after the spinlock unlock are
> > +allowed to move to the critical section, but the memory operations in
> > +the critical section are not allowed to move below the unlock. So the
> > +full barrier in the store operation can use RELEASE memory order.
> > +
> > +Reader-Writer Concurrency
> > +^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Lock-free reader-writer concurrency is one of the common use cases in
> DPDK.
> > +
> > +The payload or the data that the writer wants to communicate to the
> > +reader, can be written with RELAXED memory order. However, the guard
> > +variable should be written with RELEASE memory order. This ensures that
> > +the store to guard variable is observable only after the store to payload is
> > observable.
> > +
> > +Correspondingly, on the reader side, the guard variable should be read
> > +with ACQUIRE memory order. The payload or the data the writer
> > +communicated, can be read with RELAXED memory order. This ensures
> that,
> > +if the store to guard variable is observable, the store to payload is also
> > observable.
> > +
> >  Coding Considerations
> >  ---------------------
> >
> > --
> > 2.7.4
  

Patch

diff --git a/doc/guides/prog_guide/writing_efficient_code.rst b/doc/guides/prog_guide/writing_efficient_code.rst
index 849f63e..53a1ca1 100644
--- a/doc/guides/prog_guide/writing_efficient_code.rst
+++ b/doc/guides/prog_guide/writing_efficient_code.rst
@@ -167,7 +167,13 @@  but with the added cost of lower throughput.
 Locks and Atomic Operations
 ---------------------------
 
-Atomic operations imply a lock prefix before the instruction,
+This section describes some key considerations when using locks and atomic
+operations in the DPDK environment.
+
+Locks
+~~~~~
+
+On x86, atomic operations imply a lock prefix before the instruction,
 causing the processor's LOCK# signal to be asserted during execution of the following instruction.
 This has a big impact on performance in a multicore environment.
 
@@ -176,6 +182,57 @@  It can often be replaced by other solutions like per-lcore variables.
 Also, some locking techniques are more efficient than others.
 For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
 
+Atomic Operations: Use C11 Atomic Built-ins
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DPDK generic rte_atomic operations are implemented by __sync built-ins. These
+__sync built-ins result in full barriers on aarch64, which are unnecessary
+in many use cases. They can be replaced by __atomic built-ins that conform to
+the C11 memory model and provide finer memory order control.
+
+So replacing the rte_atomic operations with __atomic built-ins might improve
+performance for aarch64 machines.
+
+Some typical optimization cases are listed below:
+
+Atomicity
+^^^^^^^^^
+
+Some use cases require atomicity alone, the ordering of the memory operations
+does not matter. For example, the packet statistics counters need to be
+incremented atomically but do not need any particular memory ordering.
+So, RELAXED memory ordering is sufficient.
+
+One-way Barrier
+^^^^^^^^^^^^^^^
+
+Some use cases allow for memory reordering in one way while requiring memory
+ordering in the other direction.
+
+For example, the memory operations before the spinlock lock are allowed to
+move to the critical section, but the memory operations in the critical section
+are not allowed to move above the lock. In this case, the full memory barrier
+in the compare-and-swap operation can be replaced with ACQUIRE memory order.
+On the other hand, the memory operations after the spinlock unlock are allowed
+to move to the critical section, but the memory operations in the critical
+section are not allowed to move below the unlock. So the full barrier in the
+store operation can use RELEASE memory order.
+
+Reader-Writer Concurrency
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Lock-free reader-writer concurrency is one of the common use cases in DPDK.
+
+The payload or the data that the writer wants to communicate to the reader,
+can be written with RELAXED memory order. However, the guard variable should
+be written with RELEASE memory order. This ensures that the store to guard
+variable is observable only after the store to payload is observable.
+
+Correspondingly, on the reader side, the guard variable should be read
+with ACQUIRE memory order. The payload or the data the writer communicated,
+can be read with RELAXED memory order. This ensures that, if the store to
+guard variable is observable, the store to payload is also observable.
+
 Coding Considerations
 ---------------------