[v7,1/3] doc: add generic atomic deprecation section

Message ID 1594621423-14796-2-git-send-email-phil.yang@arm.com (mailing list archive)
State Superseded, archived
Delegated to: David Marchand
Headers
Series generic rte atomic APIs deprecate proposal |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Performance-Testing fail build patch failure
ci/Intel-compilation success Compilation OK
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-testing success Testing PASS

Commit Message

Phil Yang July 13, 2020, 6:23 a.m. UTC
  Add deprecating the generic rte_atomic_xx APIs to C11 atomic built-ins
guide and examples.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 doc/guides/prog_guide/writing_efficient_code.rst | 64 +++++++++++++++++++++++-
 1 file changed, 63 insertions(+), 1 deletion(-)
  

Comments

Honnappa Nagarahalli July 14, 2020, 6:36 p.m. UTC | #1
Hi Phil,
	Few comments.

<snip>

> Subject: [PATCH v7 1/3] doc: add generic atomic deprecation section
I think we should change this as follows:
"doc: add optimizations using C11 atomic built-ins"

> 
> Add deprecating the generic rte_atomic_xx APIs to C11 atomic built-ins guide
> and examples.
I think same here, something like following would help:
"Add information about possible optimizations using C11 atomic built-ins"

> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  doc/guides/prog_guide/writing_efficient_code.rst | 64
> +++++++++++++++++++++++-
>  1 file changed, 63 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/guides/prog_guide/writing_efficient_code.rst
> b/doc/guides/prog_guide/writing_efficient_code.rst
> index 849f63e..16d6188 100644
> --- a/doc/guides/prog_guide/writing_efficient_code.rst
> +++ b/doc/guides/prog_guide/writing_efficient_code.rst
> @@ -167,7 +167,13 @@ but with the added cost of lower throughput.
>  Locks and Atomic Operations
>  ---------------------------
> 
> -Atomic operations imply a lock prefix before the instruction,
> +This section describes some key considerations when using locks and
> +atomic operations in the DPDK environment.
> +
> +Locks
> +~~~~~
> +
> +On x86, atomic operations imply a lock prefix before the instruction,
>  causing the processor's LOCK# signal to be asserted during execution of the
> following instruction.
>  This has a big impact on performance in a multicore environment.
> 
> @@ -176,6 +182,62 @@ It can often be replaced by other solutions like per-
> lcore variables.
>  Also, some locking techniques are more efficient than others.
>  For instance, the Read-Copy-Update (RCU) algorithm can frequently replace
> simple rwlocks.
> 
> +Atomic Operations: Use C11 Atomic Built-ins
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +DPDK generic rte_atomic operations are implemented by `__sync built-ins
> +<https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
We can skip the hyper-link. Google search on "__sync built-in functions" leads to this page easily.

> +These __sync built-ins result in full barriers on aarch64, which are
> +unnecessary in many use cases. They can be replaced by `__atomic
> +built-ins

> +<https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_
Same here, we can skip the hyper-link.

> +that conform to the C11 memory model and provide finer memory order
> control.
> +
> +So replacing the rte_atomic operations with __atomic built-ins might
> +improve performance for aarch64 machines.
> +
> +Some typical optimization cases are listed below:
> +
> +Atomicity
> +^^^^^^^^^
> +
> +Some use cases require atomicity alone, the ordering of the memory
> +operations does not matter. For example the packets statistics in
> +``virtio_xmit()`` function of ``vhost`` example application. It just
> +updates the number of transmitted packets, no subsequent logic depends
Instead of referring to the code here, it is better to keep it generic. May be something like below?
"For example, the packet statistics counters need to be incremented atomically but do not need any particular memory ordering. So, RELAXED memory ordering is sufficient".

> +on these counters. So the RELAXED memory ordering is sufficient.
> +
> +One-way Barrier
> +^^^^^^^^^^^^^^^
> +
> +Some use cases allow for memory reordering in one way while requiring
> +memory ordering in the other direction.
> +
Spinlock is a good example, making is little more generic would be helpful.

> +For example, the memory operations before the ``rte_spinlock_lock()``
Instead of referring to "rte_spinlock_lock" can you just say spinlock lock
> +can move to the critical section, but the memory operations in the
     ^^^ are allowed to
> +critical section cannot move above the lock. In this case, the full
                                ^^^^^^ are not allowed to
> +memory barrier in the compare-and-swap operation can be replaced to
                                                                                                                                 ^^ with
> +ACQUIRE. On the other hand, the memory operations after the
     ^^^^^^^^ ACQUIRE memory order.
> +``rte_spinlock_unlock()`` can move to the critical section, but the
                                                  ^^^ are allowed to
> +memory operations in the critical section cannot move below the unlock.
                                                                                ^^^^^^ are not allowed to
> +So the full barrier in the STORE operation can be replaced with RELEASE.
I think this above statement can be replaced with something like below:

"So the full barrier in the store operation can use RELEASE memory order."

> +
> +Reader-Writer Concurrency
> +^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Lock-free reader-writer concurrency is one of the common use cases in DPDK.
> +
> +The payload or the data that the writer wants to communicate to the
> +reader, can be written with RELAXED memory order. However, the guard
> +variable should be written with RELEASE memory order. This ensures that
> +the store to guard variable is observable only after the store to payload is
> observable.
> +Refer to ``rte_hash_cuckoo_insert_mw()`` for an example.
We can remove just this line above.

> +
> +Correspondingly, on the reader side, the guard variable should be read
> +with ACQUIRE memory order. The payload or the data the writer
> +communicated, can be read with RELAXED memory order. This ensures that,
> +if the store to guard variable is observable, the store to payload is also
> observable.
> +Refer to rte_hash ``search_one_bucket_lf()`` for an example.
Same here, suggest removing just the above line.

> +
>  Coding Considerations
>  ---------------------
> 
> --
> 2.7.4
  
Phil Yang July 15, 2020, 2:58 a.m. UTC | #2
Hi Honnappa,

Thanks for your comments and suggestions.
I Will update it in the next version.


Hi John,

I would greatly appreciate it if you kindly give me some feedback on this proposal.

Thanks,
Phil

> Hi Phil,
> 	Few comments.
> 
> <snip>
> 
> > Subject: [PATCH v7 1/3] doc: add generic atomic deprecation section
> I think we should change this as follows:
> "doc: add optimizations using C11 atomic built-ins"
> 
> >
> > Add deprecating the generic rte_atomic_xx APIs to C11 atomic built-ins
> guide
> > and examples.
> I think same here, something like following would help:
> "Add information about possible optimizations using C11 atomic built-ins"
> 
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> >  doc/guides/prog_guide/writing_efficient_code.rst | 64
> > +++++++++++++++++++++++-
> >  1 file changed, 63 insertions(+), 1 deletion(-)
> >
> > diff --git a/doc/guides/prog_guide/writing_efficient_code.rst
> > b/doc/guides/prog_guide/writing_efficient_code.rst
> > index 849f63e..16d6188 100644
> > --- a/doc/guides/prog_guide/writing_efficient_code.rst
> > +++ b/doc/guides/prog_guide/writing_efficient_code.rst
> > @@ -167,7 +167,13 @@ but with the added cost of lower throughput.
> >  Locks and Atomic Operations
> >  ---------------------------
> >
> > -Atomic operations imply a lock prefix before the instruction,
> > +This section describes some key considerations when using locks and
> > +atomic operations in the DPDK environment.
> > +
> > +Locks
> > +~~~~~
> > +
> > +On x86, atomic operations imply a lock prefix before the instruction,
> >  causing the processor's LOCK# signal to be asserted during execution of
> the
> > following instruction.
> >  This has a big impact on performance in a multicore environment.
> >
> > @@ -176,6 +182,62 @@ It can often be replaced by other solutions like per-
> > lcore variables.
> >  Also, some locking techniques are more efficient than others.
> >  For instance, the Read-Copy-Update (RCU) algorithm can frequently
> replace
> > simple rwlocks.
> >
> > +Atomic Operations: Use C11 Atomic Built-ins
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +DPDK generic rte_atomic operations are implemented by `__sync built-ins
> > +<https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
> We can skip the hyper-link. Google search on "__sync built-in functions"
> leads to this page easily.
> 
> > +These __sync built-ins result in full barriers on aarch64, which are
> > +unnecessary in many use cases. They can be replaced by `__atomic
> > +built-ins
> 
> > +<https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_
> Same here, we can skip the hyper-link.
> 
> > +that conform to the C11 memory model and provide finer memory order
> > control.
> > +
> > +So replacing the rte_atomic operations with __atomic built-ins might
> > +improve performance for aarch64 machines.
> > +
> > +Some typical optimization cases are listed below:
> > +
> > +Atomicity
> > +^^^^^^^^^
> > +
> > +Some use cases require atomicity alone, the ordering of the memory
> > +operations does not matter. For example the packets statistics in
> > +``virtio_xmit()`` function of ``vhost`` example application. It just
> > +updates the number of transmitted packets, no subsequent logic
> depends
> Instead of referring to the code here, it is better to keep it generic. May be
> something like below?
> "For example, the packet statistics counters need to be incremented
> atomically but do not need any particular memory ordering. So, RELAXED
> memory ordering is sufficient".
> 
> > +on these counters. So the RELAXED memory ordering is sufficient.
> > +
> > +One-way Barrier
> > +^^^^^^^^^^^^^^^
> > +
> > +Some use cases allow for memory reordering in one way while requiring
> > +memory ordering in the other direction.
> > +
> Spinlock is a good example, making is little more generic would be helpful.
> 
> > +For example, the memory operations before the ``rte_spinlock_lock()``
> Instead of referring to "rte_spinlock_lock" can you just say spinlock lock
> > +can move to the critical section, but the memory operations in the
>      ^^^ are allowed to
> > +critical section cannot move above the lock. In this case, the full
>                                 ^^^^^^ are not allowed to
> > +memory barrier in the compare-and-swap operation can be replaced to
>                                                                                                                                  ^^ with
> > +ACQUIRE. On the other hand, the memory operations after the
>      ^^^^^^^^ ACQUIRE memory order.
> > +``rte_spinlock_unlock()`` can move to the critical section, but the
>                                                   ^^^ are allowed to
> > +memory operations in the critical section cannot move below the unlock.
>                                                                                 ^^^^^^ are not allowed to
> > +So the full barrier in the STORE operation can be replaced with RELEASE.
> I think this above statement can be replaced with something like below:
> 
> "So the full barrier in the store operation can use RELEASE memory order."
> 
> > +
> > +Reader-Writer Concurrency
> > +^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Lock-free reader-writer concurrency is one of the common use cases in
> DPDK.
> > +
> > +The payload or the data that the writer wants to communicate to the
> > +reader, can be written with RELAXED memory order. However, the guard
> > +variable should be written with RELEASE memory order. This ensures that
> > +the store to guard variable is observable only after the store to payload is
> > observable.
> > +Refer to ``rte_hash_cuckoo_insert_mw()`` for an example.
> We can remove just this line above.
> 
> > +
> > +Correspondingly, on the reader side, the guard variable should be read
> > +with ACQUIRE memory order. The payload or the data the writer
> > +communicated, can be read with RELAXED memory order. This ensures
> that,
> > +if the store to guard variable is observable, the store to payload is also
> > observable.
> > +Refer to rte_hash ``search_one_bucket_lf()`` for an example.
> Same here, suggest removing just the above line.
> 
> > +
> >  Coding Considerations
> >  ---------------------
> >
> > --
> > 2.7.4
  

Patch

diff --git a/doc/guides/prog_guide/writing_efficient_code.rst b/doc/guides/prog_guide/writing_efficient_code.rst
index 849f63e..16d6188 100644
--- a/doc/guides/prog_guide/writing_efficient_code.rst
+++ b/doc/guides/prog_guide/writing_efficient_code.rst
@@ -167,7 +167,13 @@  but with the added cost of lower throughput.
 Locks and Atomic Operations
 ---------------------------
 
-Atomic operations imply a lock prefix before the instruction,
+This section describes some key considerations when using locks and atomic
+operations in the DPDK environment.
+
+Locks
+~~~~~
+
+On x86, atomic operations imply a lock prefix before the instruction,
 causing the processor's LOCK# signal to be asserted during execution of the following instruction.
 This has a big impact on performance in a multicore environment.
 
@@ -176,6 +182,62 @@  It can often be replaced by other solutions like per-lcore variables.
 Also, some locking techniques are more efficient than others.
 For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
 
+Atomic Operations: Use C11 Atomic Built-ins
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DPDK generic rte_atomic operations are implemented by `__sync built-ins
+<https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
+These __sync built-ins result in full barriers on aarch64, which are unnecessary
+in many use cases. They can be replaced by `__atomic built-ins
+<https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_
+that conform to the C11 memory model and provide finer memory order control.
+
+So replacing the rte_atomic operations with __atomic built-ins might improve
+performance for aarch64 machines.
+
+Some typical optimization cases are listed below:
+
+Atomicity
+^^^^^^^^^
+
+Some use cases require atomicity alone, the ordering of the memory operations
+does not matter. For example the packets statistics in ``virtio_xmit()``
+function of ``vhost`` example application. It just updates the number of
+transmitted packets, no subsequent logic depends on these counters. So the
+RELAXED memory ordering is sufficient.
+
+One-way Barrier
+^^^^^^^^^^^^^^^
+
+Some use cases allow for memory reordering in one way while requiring memory
+ordering in the other direction.
+
+For example, the memory operations before the ``rte_spinlock_lock()`` can move
+to the critical section, but the memory operations in the critical section
+cannot move above the lock. In this case, the full memory barrier in the
+compare-and-swap operation can be replaced to ACQUIRE. On the other hand, the
+memory operations after the ``rte_spinlock_unlock()`` can move to the critical
+section, but the memory operations in the critical section cannot move below
+the unlock. So the full barrier in the STORE operation can be replaced with
+RELEASE.
+
+Reader-Writer Concurrency
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Lock-free reader-writer concurrency is one of the common use cases in DPDK.
+
+The payload or the data that the writer wants to communicate to the reader,
+can be written with RELAXED memory order. However, the guard variable should
+be written with RELEASE memory order. This ensures that the store to guard
+variable is observable only after the store to payload is observable.
+Refer to ``rte_hash_cuckoo_insert_mw()`` for an example.
+
+Correspondingly, on the reader side, the guard variable should be read
+with ACQUIRE memory order. The payload or the data the writer communicated,
+can be read with RELAXED memory order. This ensures that, if the store to
+guard variable is observable, the store to payload is also observable.
+Refer to rte_hash ``search_one_bucket_lf()`` for an example.
+
 Coding Considerations
 ---------------------