[v6,1/4] doc: add generic atomic deprecation section

Message ID 1594115449-13750-2-git-send-email-phil.yang@arm.com (mailing list archive)
State Superseded, archived
Delegated to: David Marchand
Headers
Series generic rte atomic APIs deprecate proposal |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/iol-broadcom-Performance success Performance Testing PASS
ci/Intel-compilation success Compilation OK
ci/iol-testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Performance fail Performance Testing issues

Commit Message

Phil Yang July 7, 2020, 9:50 a.m. UTC
  Add deprecating the generic rte_atomic_xx APIs to c11 atomic built-ins
guide and examples.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 doc/guides/prog_guide/writing_efficient_code.rst | 139 ++++++++++++++++++++++-
 1 file changed, 138 insertions(+), 1 deletion(-)
  

Comments

Thomas Monjalon July 10, 2020, 4:55 p.m. UTC | #1
Interestingly, John, our doc maintainer is not Cc'ed.
I add him.
Please use --cc-cmd devtools/get-maintainer.sh
I am expecting a review from an x86 maintainer as well.
If no maintainer replies, ping them.

07/07/2020 11:50, Phil Yang:
> Add deprecating the generic rte_atomic_xx APIs to c11 atomic built-ins
> guide and examples.
[...]
> +Atomic Operations: Use C11 Atomic Built-ins
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +DPDK `generic rte_atomic <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_atomic.h>`_ operations are

Why this github link on 20.02?

Please try to keep lines small.

> +implemented by `__sync built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.

Long links should be on their own line to avoid long lines.

> +These __sync built-ins result in full barriers on aarch64, which are unnecessary
> +in many use cases. They can be replaced by `__atomic built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_ that
> +conform to the C11 memory model and provide finer memory order control.
> +
> +So replacing the rte_atomic operations with __atomic built-ins might improve
> +performance for aarch64 machines. `More details <https://www.dpdk.org/wp-content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_.

"More details."
Please make a sentence.

> +
> +Some typical optimization cases are listed below:
> +
> +Atomicity
> +^^^^^^^^^
> +
> +Some use cases require atomicity alone, the ordering of the memory operations
> +does not matter. For example the packets statistics in the `vhost <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>`_ example application.

Again github.
If you really want a web link, use code.dpdk.org or doc.dpdk.org/api

But why giving code example at all?

> +
> +It just updates the number of transmitted packets, no subsequent logic depends
> +on these counters. So the RELAXED memory ordering is sufficient:
> +
> +.. code-block:: c
> +
> +    static __rte_always_inline void
> +    virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev,
> +            struct rte_mbuf *m)
> +    {
> +        ...
> +        ...
> +        if (enable_stats) {
> +            __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1, __ATOMIC_RELAXED);
> +            __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret, __ATOMIC_RELAXED);
> +            ...
> +        }
> +    }

I don't see how adding real code helps here.
Why not just mentioning __atomic_add_fetch and __ATOMIC_RELAXED?

> +
> +One-way Barrier
> +^^^^^^^^^^^^^^^
> +
> +Some use cases allow for memory reordering in one way while requiring memory
> +ordering in the other direction.
> +
> +For example, the memory operations before the `lock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L66>`_ can move to the
> +critical section, but the memory operations in the critical section cannot move
> +above the lock. In this case, the full memory barrier in the CAS operation can
> +be replaced to ACQUIRE. On the other hand, the memory operations after the
> +`unlock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L88>`_ can move to the critical section, but the memory operations in the
> +critical section cannot move below the unlock. So the full barrier in the STORE
> +operation can be replaced with RELEASE.

Again github links instead of our doxygen.

> +
> +Reader-Writer Concurrency
> +^^^^^^^^^^^^^^^^^^^^^^^^^

No blank line here?

> +Lock-free reader-writer concurrency is one of the common use cases in DPDK.
> +
> +The payload or the data that the writer wants to communicate to the reader,
> +can be written with RELAXED memory order. However, the guard variable should
> +be written with RELEASE memory order. This ensures that the store to guard
> +variable is observable only after the store to payload is observable.
> +Refer to `rte_hash insert <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_hash.c#L737>`_ for an example.

Hum...

> +
> +.. code-block:: c
> +
> +    static inline int32_t
> +    rte_hash_cuckoo_insert_mw(const struct rte_hash *h,
> +        ...
> +        int32_t *ret_val)
> +    {
> +        ...
> +        ...
> +
> +        /* Insert new entry if there is room in the primary
> +         * bucket.
> +         */
> +        for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
> +                /* Check if slot is available */
> +                if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) {
> +                        prim_bkt->sig_current[i] = sig;
> +                        /* Store to signature and key should not
> +                         * leak after the store to key_idx. i.e.
> +                         * key_idx is the guard variable for signature
> +                         * and key.
> +                         */
> +                        __atomic_store_n(&prim_bkt->key_idx[i],
> +                                         new_idx,
> +                                         __ATOMIC_RELEASE);
> +                        break;
> +                }
> +        }
> +
> +        ...
> +    }
> +
> +Correspondingly, on the reader side, the guard variable should be read
> +with ACQUIRE memory order. The payload or the data the writer communicated,
> +can be read with RELAXED memory order. This ensures that, if the store to
> +guard variable is observable, the store to payload is also observable. Refer to `rte_hash lookup <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_hash.c#L1215>`_ for an example.
> +
> +.. code-block:: c
> +
> +    static inline int32_t
> +    search_one_bucket_lf(const struct rte_hash *h, const void *key, uint16_t sig,
> +        void **data, const struct rte_hash_bucket *bkt)
> +    {
> +        ...
> +
> +        for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
> +            ....
> +            if (bkt->sig_current[i] == sig) {
> +                key_idx = __atomic_load_n(&bkt->key_idx[i],
> +                                        __ATOMIC_ACQUIRE);
> +                if (key_idx != EMPTY_SLOT) {
> +                    k = (struct rte_hash_key *) ((char *)keys +
> +                        key_idx * h->key_entry_size);
> +
> +                if (rte_hash_cmp_eq(key, k->key, h) == 0) {
> +                    if (data != NULL) {
> +                        *data = __atomic_load_n(&k->pdata,
> +                                        __ATOMIC_ACQUIRE);
> +                    }
> +
> +                    /*
> +                    * Return index where key is stored,
> +                    * subtracting the first dummy index
> +                    */
> +                    return key_idx - 1;
> +                }
> +            ...
> +    }
> +

NACK for the big chunks of real code.
Please use words and avoid code.

If you insist on keeping code in doc, I will make you responsible
of updating all the code we have already in the doc :)
  
Honnappa Nagarahalli July 10, 2020, 11:47 p.m. UTC | #2
<snip>

> Subject: Re: [dpdk-dev] [PATCH v6 1/4] doc: add generic atomic deprecation
> section
> 
> Interestingly, John, our doc maintainer is not Cc'ed.
> I add him.
> Please use --cc-cmd devtools/get-maintainer.sh I am expecting a review from
> an x86 maintainer as well.
> If no maintainer replies, ping them.
> 
> 07/07/2020 11:50, Phil Yang:
> > Add deprecating the generic rte_atomic_xx APIs to c11 atomic built-ins
> > guide and examples.
> [...]
> > +Atomic Operations: Use C11 Atomic Built-ins
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +DPDK `generic rte_atomic
> > +<https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/inclu
> > +de/generic/rte_atomic.h>`_ operations are
> 
> Why this github link on 20.02?
> 
> Please try to keep lines small.
> 
> > +implemented by `__sync built-ins
> <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
> 
> Long links should be on their own line to avoid long lines.
> 
> > +These __sync built-ins result in full barriers on aarch64, which are
> > +unnecessary in many use cases. They can be replaced by `__atomic
> > +built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-
> Builtins.html>`_ that conform to the C11 memory model and provide finer
> memory order control.
> > +
> > +So replacing the rte_atomic operations with __atomic built-ins might
> > +improve performance for aarch64 machines. `More details
> <https://www.dpdk.org/wp-
> content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_.
> 
> "More details."
> Please make a sentence.
The full stop is after the link. But, I think we will remove this link as well.

> 
> > +
> > +Some typical optimization cases are listed below:
> > +
> > +Atomicity
> > +^^^^^^^^^
> > +
> > +Some use cases require atomicity alone, the ordering of the memory
> > +operations does not matter. For example the packets statistics in the
> `vhost
> <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>`
> _ example application.
> 
> Again github.
> If you really want a web link, use code.dpdk.org or doc.dpdk.org/api
> 
> But why giving code example at all?
> 
> > +
> > +It just updates the number of transmitted packets, no subsequent
> > +logic depends on these counters. So the RELAXED memory ordering is
> sufficient:
> > +
> > +.. code-block:: c
> > +
> > +    static __rte_always_inline void
> > +    virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev,
> > +            struct rte_mbuf *m)
> > +    {
> > +        ...
> > +        ...
> > +        if (enable_stats) {
> > +            __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1,
> __ATOMIC_RELAXED);
> > +            __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret,
> __ATOMIC_RELAXED);
> > +            ...
> > +        }
> > +    }
> 
> I don't see how adding real code helps here.
> Why not just mentioning __atomic_add_fetch and __ATOMIC_RELAXED?
> 
> > +
> > +One-way Barrier
> > +^^^^^^^^^^^^^^^
> > +
> > +Some use cases allow for memory reordering in one way while requiring
> > +memory ordering in the other direction.
> > +
> > +For example, the memory operations before the `lock
> > +<https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/inclu
> > +de/generic/rte_spinlock.h#L66>`_ can move to the critical section,
> > +but the memory operations in the critical section cannot move above
> > +the lock. In this case, the full memory barrier in the CAS operation
> > +can be replaced to ACQUIRE. On the other hand, the memory operations
> > +after the `unlock
> <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/
> generic/rte_spinlock.h#L88>`_ can move to the critical section, but the
> memory operations in the critical section cannot move below the unlock. So
> the full barrier in the STORE operation can be replaced with RELEASE.
> 
> Again github links instead of our doxygen.
> 
> > +
> > +Reader-Writer Concurrency
> > +^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> No blank line here?
Will fix

> 
> > +Lock-free reader-writer concurrency is one of the common use cases in
> DPDK.
> > +
> > +The payload or the data that the writer wants to communicate to the
> > +reader, can be written with RELAXED memory order. However, the guard
> > +variable should be written with RELEASE memory order. This ensures
> > +that the store to guard variable is observable only after the store to
> payload is observable.
> > +Refer to `rte_hash insert
> <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_has
> h.c#L737>`_ for an example.
> 
> Hum...
> 
> > +
> > +.. code-block:: c
> > +
> > +    static inline int32_t
> > +    rte_hash_cuckoo_insert_mw(const struct rte_hash *h,
> > +        ...
> > +        int32_t *ret_val)
> > +    {
> > +        ...
> > +        ...
> > +
> > +        /* Insert new entry if there is room in the primary
> > +         * bucket.
> > +         */
> > +        for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
> > +                /* Check if slot is available */
> > +                if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) {
> > +                        prim_bkt->sig_current[i] = sig;
> > +                        /* Store to signature and key should not
> > +                         * leak after the store to key_idx. i.e.
> > +                         * key_idx is the guard variable for signature
> > +                         * and key.
> > +                         */
> > +                        __atomic_store_n(&prim_bkt->key_idx[i],
> > +                                         new_idx,
> > +                                         __ATOMIC_RELEASE);
> > +                        break;
> > +                }
> > +        }
> > +
> > +        ...
> > +    }
> > +
> > +Correspondingly, on the reader side, the guard variable should be
> > +read with ACQUIRE memory order. The payload or the data the writer
> > +communicated, can be read with RELAXED memory order. This ensures
> > +that, if the store to guard variable is observable, the store to payload is
> also observable. Refer to `rte_hash lookup
> <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_has
> h.c#L1215>`_ for an example.
> > +
> > +.. code-block:: c
> > +
> > +    static inline int32_t
> > +    search_one_bucket_lf(const struct rte_hash *h, const void *key,
> uint16_t sig,
> > +        void **data, const struct rte_hash_bucket *bkt)
> > +    {
> > +        ...
> > +
> > +        for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
> > +            ....
> > +            if (bkt->sig_current[i] == sig) {
> > +                key_idx = __atomic_load_n(&bkt->key_idx[i],
> > +                                        __ATOMIC_ACQUIRE);
> > +                if (key_idx != EMPTY_SLOT) {
> > +                    k = (struct rte_hash_key *) ((char *)keys +
> > +                        key_idx * h->key_entry_size);
> > +
> > +                if (rte_hash_cmp_eq(key, k->key, h) == 0) {
> > +                    if (data != NULL) {
> > +                        *data = __atomic_load_n(&k->pdata,
> > +                                        __ATOMIC_ACQUIRE);
> > +                    }
> > +
> > +                    /*
> > +                    * Return index where key is stored,
> > +                    * subtracting the first dummy index
> > +                    */
> > +                    return key_idx - 1;
> > +                }
> > +            ...
> > +    }
> > +
> 
> NACK for the big chunks of real code.
> Please use words and avoid code.
> 
> If you insist on keeping code in doc, I will make you responsible of updating
> all the code we have already in the doc :)
Ok, understood, will re-spin.

>
  

Patch

diff --git a/doc/guides/prog_guide/writing_efficient_code.rst b/doc/guides/prog_guide/writing_efficient_code.rst
index 849f63e..3bd2601 100644
--- a/doc/guides/prog_guide/writing_efficient_code.rst
+++ b/doc/guides/prog_guide/writing_efficient_code.rst
@@ -167,7 +167,13 @@  but with the added cost of lower throughput.
 Locks and Atomic Operations
 ---------------------------
 
-Atomic operations imply a lock prefix before the instruction,
+This section describes some key considerations when using locks and atomic
+operations in the DPDK environment.
+
+Locks
+~~~~~
+
+On x86, atomic operations imply a lock prefix before the instruction,
 causing the processor's LOCK# signal to be asserted during execution of the following instruction.
 This has a big impact on performance in a multicore environment.
 
@@ -176,6 +182,137 @@  It can often be replaced by other solutions like per-lcore variables.
 Also, some locking techniques are more efficient than others.
 For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
 
+Atomic Operations: Use C11 Atomic Built-ins
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DPDK `generic rte_atomic <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_atomic.h>`_ operations are
+implemented by `__sync built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
+These __sync built-ins result in full barriers on aarch64, which are unnecessary
+in many use cases. They can be replaced by `__atomic built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_ that
+conform to the C11 memory model and provide finer memory order control.
+
+So replacing the rte_atomic operations with __atomic built-ins might improve
+performance for aarch64 machines. `More details <https://www.dpdk.org/wp-content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_.
+
+Some typical optimization cases are listed below:
+
+Atomicity
+^^^^^^^^^
+
+Some use cases require atomicity alone, the ordering of the memory operations
+does not matter. For example the packets statistics in the `vhost <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>`_ example application.
+
+It just updates the number of transmitted packets, no subsequent logic depends
+on these counters. So the RELAXED memory ordering is sufficient:
+
+.. code-block:: c
+
+    static __rte_always_inline void
+    virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev,
+            struct rte_mbuf *m)
+    {
+        ...
+        ...
+        if (enable_stats) {
+            __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1, __ATOMIC_RELAXED);
+            __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret, __ATOMIC_RELAXED);
+            ...
+        }
+    }
+
+One-way Barrier
+^^^^^^^^^^^^^^^
+
+Some use cases allow for memory reordering in one way while requiring memory
+ordering in the other direction.
+
+For example, the memory operations before the `lock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L66>`_ can move to the
+critical section, but the memory operations in the critical section cannot move
+above the lock. In this case, the full memory barrier in the CAS operation can
+be replaced to ACQUIRE. On the other hand, the memory operations after the
+`unlock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L88>`_ can move to the critical section, but the memory operations in the
+critical section cannot move below the unlock. So the full barrier in the STORE
+operation can be replaced with RELEASE.
+
+Reader-Writer Concurrency
+^^^^^^^^^^^^^^^^^^^^^^^^^
+Lock-free reader-writer concurrency is one of the common use cases in DPDK.
+
+The payload or the data that the writer wants to communicate to the reader,
+can be written with RELAXED memory order. However, the guard variable should
+be written with RELEASE memory order. This ensures that the store to guard
+variable is observable only after the store to payload is observable.
+Refer to `rte_hash insert <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_hash.c#L737>`_ for an example.
+
+.. code-block:: c
+
+    static inline int32_t
+    rte_hash_cuckoo_insert_mw(const struct rte_hash *h,
+        ...
+        int32_t *ret_val)
+    {
+        ...
+        ...
+
+        /* Insert new entry if there is room in the primary
+         * bucket.
+         */
+        for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
+                /* Check if slot is available */
+                if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) {
+                        prim_bkt->sig_current[i] = sig;
+                        /* Store to signature and key should not
+                         * leak after the store to key_idx. i.e.
+                         * key_idx is the guard variable for signature
+                         * and key.
+                         */
+                        __atomic_store_n(&prim_bkt->key_idx[i],
+                                         new_idx,
+                                         __ATOMIC_RELEASE);
+                        break;
+                }
+        }
+
+        ...
+    }
+
+Correspondingly, on the reader side, the guard variable should be read
+with ACQUIRE memory order. The payload or the data the writer communicated,
+can be read with RELAXED memory order. This ensures that, if the store to
+guard variable is observable, the store to payload is also observable. Refer to `rte_hash lookup <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_hash.c#L1215>`_ for an example.
+
+.. code-block:: c
+
+    static inline int32_t
+    search_one_bucket_lf(const struct rte_hash *h, const void *key, uint16_t sig,
+        void **data, const struct rte_hash_bucket *bkt)
+    {
+        ...
+
+        for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
+            ....
+            if (bkt->sig_current[i] == sig) {
+                key_idx = __atomic_load_n(&bkt->key_idx[i],
+                                        __ATOMIC_ACQUIRE);
+                if (key_idx != EMPTY_SLOT) {
+                    k = (struct rte_hash_key *) ((char *)keys +
+                        key_idx * h->key_entry_size);
+
+                if (rte_hash_cmp_eq(key, k->key, h) == 0) {
+                    if (data != NULL) {
+                        *data = __atomic_load_n(&k->pdata,
+                                        __ATOMIC_ACQUIRE);
+                    }
+
+                    /*
+                    * Return index where key is stored,
+                    * subtracting the first dummy index
+                    */
+                    return key_idx - 1;
+                }
+            ...
+    }
+
 Coding Considerations
 ---------------------