Message ID | 1594115449-13750-2-git-send-email-phil.yang@arm.com |
---|---|
State | Superseded, archived |
Delegated to: | David Marchand |
Headers | show |
Series |
|
Related | show |
Context | Check | Description |
---|---|---|
ci/iol-intel-Performance | fail | Performance Testing issues |
ci/iol-mellanox-Performance | success | Performance Testing PASS |
ci/iol-testing | success | Testing PASS |
ci/Intel-compilation | success | Compilation OK |
ci/iol-broadcom-Performance | success | Performance Testing PASS |
ci/checkpatch | warning | coding style issues |
Interestingly, John, our doc maintainer is not Cc'ed. I add him. Please use --cc-cmd devtools/get-maintainer.sh I am expecting a review from an x86 maintainer as well. If no maintainer replies, ping them. 07/07/2020 11:50, Phil Yang: > Add deprecating the generic rte_atomic_xx APIs to c11 atomic built-ins > guide and examples. [...] > +Atomic Operations: Use C11 Atomic Built-ins > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +DPDK `generic rte_atomic <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_atomic.h>`_ operations are Why this github link on 20.02? Please try to keep lines small. > +implemented by `__sync built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_. Long links should be on their own line to avoid long lines. > +These __sync built-ins result in full barriers on aarch64, which are unnecessary > +in many use cases. They can be replaced by `__atomic built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_ that > +conform to the C11 memory model and provide finer memory order control. > + > +So replacing the rte_atomic operations with __atomic built-ins might improve > +performance for aarch64 machines. `More details <https://www.dpdk.org/wp-content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_. "More details." Please make a sentence. > + > +Some typical optimization cases are listed below: > + > +Atomicity > +^^^^^^^^^ > + > +Some use cases require atomicity alone, the ordering of the memory operations > +does not matter. For example the packets statistics in the `vhost <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>`_ example application. Again github. If you really want a web link, use code.dpdk.org or doc.dpdk.org/api But why giving code example at all? > + > +It just updates the number of transmitted packets, no subsequent logic depends > +on these counters. So the RELAXED memory ordering is sufficient: > + > +.. code-block:: c > + > + static __rte_always_inline void > + virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev, > + struct rte_mbuf *m) > + { > + ... > + ... > + if (enable_stats) { > + __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1, __ATOMIC_RELAXED); > + __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret, __ATOMIC_RELAXED); > + ... > + } > + } I don't see how adding real code helps here. Why not just mentioning __atomic_add_fetch and __ATOMIC_RELAXED? > + > +One-way Barrier > +^^^^^^^^^^^^^^^ > + > +Some use cases allow for memory reordering in one way while requiring memory > +ordering in the other direction. > + > +For example, the memory operations before the `lock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L66>`_ can move to the > +critical section, but the memory operations in the critical section cannot move > +above the lock. In this case, the full memory barrier in the CAS operation can > +be replaced to ACQUIRE. On the other hand, the memory operations after the > +`unlock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L88>`_ can move to the critical section, but the memory operations in the > +critical section cannot move below the unlock. So the full barrier in the STORE > +operation can be replaced with RELEASE. Again github links instead of our doxygen. > + > +Reader-Writer Concurrency > +^^^^^^^^^^^^^^^^^^^^^^^^^ No blank line here? > +Lock-free reader-writer concurrency is one of the common use cases in DPDK. > + > +The payload or the data that the writer wants to communicate to the reader, > +can be written with RELAXED memory order. However, the guard variable should > +be written with RELEASE memory order. This ensures that the store to guard > +variable is observable only after the store to payload is observable. > +Refer to `rte_hash insert <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_hash.c#L737>`_ for an example. Hum... > + > +.. code-block:: c > + > + static inline int32_t > + rte_hash_cuckoo_insert_mw(const struct rte_hash *h, > + ... > + int32_t *ret_val) > + { > + ... > + ... > + > + /* Insert new entry if there is room in the primary > + * bucket. > + */ > + for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) { > + /* Check if slot is available */ > + if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) { > + prim_bkt->sig_current[i] = sig; > + /* Store to signature and key should not > + * leak after the store to key_idx. i.e. > + * key_idx is the guard variable for signature > + * and key. > + */ > + __atomic_store_n(&prim_bkt->key_idx[i], > + new_idx, > + __ATOMIC_RELEASE); > + break; > + } > + } > + > + ... > + } > + > +Correspondingly, on the reader side, the guard variable should be read > +with ACQUIRE memory order. The payload or the data the writer communicated, > +can be read with RELAXED memory order. This ensures that, if the store to > +guard variable is observable, the store to payload is also observable. Refer to `rte_hash lookup <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_hash.c#L1215>`_ for an example. > + > +.. code-block:: c > + > + static inline int32_t > + search_one_bucket_lf(const struct rte_hash *h, const void *key, uint16_t sig, > + void **data, const struct rte_hash_bucket *bkt) > + { > + ... > + > + for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) { > + .... > + if (bkt->sig_current[i] == sig) { > + key_idx = __atomic_load_n(&bkt->key_idx[i], > + __ATOMIC_ACQUIRE); > + if (key_idx != EMPTY_SLOT) { > + k = (struct rte_hash_key *) ((char *)keys + > + key_idx * h->key_entry_size); > + > + if (rte_hash_cmp_eq(key, k->key, h) == 0) { > + if (data != NULL) { > + *data = __atomic_load_n(&k->pdata, > + __ATOMIC_ACQUIRE); > + } > + > + /* > + * Return index where key is stored, > + * subtracting the first dummy index > + */ > + return key_idx - 1; > + } > + ... > + } > + NACK for the big chunks of real code. Please use words and avoid code. If you insist on keeping code in doc, I will make you responsible of updating all the code we have already in the doc :)
<snip> > Subject: Re: [dpdk-dev] [PATCH v6 1/4] doc: add generic atomic deprecation > section > > Interestingly, John, our doc maintainer is not Cc'ed. > I add him. > Please use --cc-cmd devtools/get-maintainer.sh I am expecting a review from > an x86 maintainer as well. > If no maintainer replies, ping them. > > 07/07/2020 11:50, Phil Yang: > > Add deprecating the generic rte_atomic_xx APIs to c11 atomic built-ins > > guide and examples. > [...] > > +Atomic Operations: Use C11 Atomic Built-ins > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +DPDK `generic rte_atomic > > +<https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/inclu > > +de/generic/rte_atomic.h>`_ operations are > > Why this github link on 20.02? > > Please try to keep lines small. > > > +implemented by `__sync built-ins > <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_. > > Long links should be on their own line to avoid long lines. > > > +These __sync built-ins result in full barriers on aarch64, which are > > +unnecessary in many use cases. They can be replaced by `__atomic > > +built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic- > Builtins.html>`_ that conform to the C11 memory model and provide finer > memory order control. > > + > > +So replacing the rte_atomic operations with __atomic built-ins might > > +improve performance for aarch64 machines. `More details > <https://www.dpdk.org/wp- > content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_. > > "More details." > Please make a sentence. The full stop is after the link. But, I think we will remove this link as well. > > > + > > +Some typical optimization cases are listed below: > > + > > +Atomicity > > +^^^^^^^^^ > > + > > +Some use cases require atomicity alone, the ordering of the memory > > +operations does not matter. For example the packets statistics in the > `vhost > <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>` > _ example application. > > Again github. > If you really want a web link, use code.dpdk.org or doc.dpdk.org/api > > But why giving code example at all? > > > + > > +It just updates the number of transmitted packets, no subsequent > > +logic depends on these counters. So the RELAXED memory ordering is > sufficient: > > + > > +.. code-block:: c > > + > > + static __rte_always_inline void > > + virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev, > > + struct rte_mbuf *m) > > + { > > + ... > > + ... > > + if (enable_stats) { > > + __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1, > __ATOMIC_RELAXED); > > + __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret, > __ATOMIC_RELAXED); > > + ... > > + } > > + } > > I don't see how adding real code helps here. > Why not just mentioning __atomic_add_fetch and __ATOMIC_RELAXED? > > > + > > +One-way Barrier > > +^^^^^^^^^^^^^^^ > > + > > +Some use cases allow for memory reordering in one way while requiring > > +memory ordering in the other direction. > > + > > +For example, the memory operations before the `lock > > +<https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/inclu > > +de/generic/rte_spinlock.h#L66>`_ can move to the critical section, > > +but the memory operations in the critical section cannot move above > > +the lock. In this case, the full memory barrier in the CAS operation > > +can be replaced to ACQUIRE. On the other hand, the memory operations > > +after the `unlock > <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/ > generic/rte_spinlock.h#L88>`_ can move to the critical section, but the > memory operations in the critical section cannot move below the unlock. So > the full barrier in the STORE operation can be replaced with RELEASE. > > Again github links instead of our doxygen. > > > + > > +Reader-Writer Concurrency > > +^^^^^^^^^^^^^^^^^^^^^^^^^ > > No blank line here? Will fix > > > +Lock-free reader-writer concurrency is one of the common use cases in > DPDK. > > + > > +The payload or the data that the writer wants to communicate to the > > +reader, can be written with RELAXED memory order. However, the guard > > +variable should be written with RELEASE memory order. This ensures > > +that the store to guard variable is observable only after the store to > payload is observable. > > +Refer to `rte_hash insert > <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_has > h.c#L737>`_ for an example. > > Hum... > > > + > > +.. code-block:: c > > + > > + static inline int32_t > > + rte_hash_cuckoo_insert_mw(const struct rte_hash *h, > > + ... > > + int32_t *ret_val) > > + { > > + ... > > + ... > > + > > + /* Insert new entry if there is room in the primary > > + * bucket. > > + */ > > + for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) { > > + /* Check if slot is available */ > > + if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) { > > + prim_bkt->sig_current[i] = sig; > > + /* Store to signature and key should not > > + * leak after the store to key_idx. i.e. > > + * key_idx is the guard variable for signature > > + * and key. > > + */ > > + __atomic_store_n(&prim_bkt->key_idx[i], > > + new_idx, > > + __ATOMIC_RELEASE); > > + break; > > + } > > + } > > + > > + ... > > + } > > + > > +Correspondingly, on the reader side, the guard variable should be > > +read with ACQUIRE memory order. The payload or the data the writer > > +communicated, can be read with RELAXED memory order. This ensures > > +that, if the store to guard variable is observable, the store to payload is > also observable. Refer to `rte_hash lookup > <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_has > h.c#L1215>`_ for an example. > > + > > +.. code-block:: c > > + > > + static inline int32_t > > + search_one_bucket_lf(const struct rte_hash *h, const void *key, > uint16_t sig, > > + void **data, const struct rte_hash_bucket *bkt) > > + { > > + ... > > + > > + for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) { > > + .... > > + if (bkt->sig_current[i] == sig) { > > + key_idx = __atomic_load_n(&bkt->key_idx[i], > > + __ATOMIC_ACQUIRE); > > + if (key_idx != EMPTY_SLOT) { > > + k = (struct rte_hash_key *) ((char *)keys + > > + key_idx * h->key_entry_size); > > + > > + if (rte_hash_cmp_eq(key, k->key, h) == 0) { > > + if (data != NULL) { > > + *data = __atomic_load_n(&k->pdata, > > + __ATOMIC_ACQUIRE); > > + } > > + > > + /* > > + * Return index where key is stored, > > + * subtracting the first dummy index > > + */ > > + return key_idx - 1; > > + } > > + ... > > + } > > + > > NACK for the big chunks of real code. > Please use words and avoid code. > > If you insist on keeping code in doc, I will make you responsible of updating > all the code we have already in the doc :) Ok, understood, will re-spin. >
diff --git a/doc/guides/prog_guide/writing_efficient_code.rst b/doc/guides/prog_guide/writing_efficient_code.rst index 849f63e..3bd2601 100644 --- a/doc/guides/prog_guide/writing_efficient_code.rst +++ b/doc/guides/prog_guide/writing_efficient_code.rst @@ -167,7 +167,13 @@ but with the added cost of lower throughput. Locks and Atomic Operations --------------------------- -Atomic operations imply a lock prefix before the instruction, +This section describes some key considerations when using locks and atomic +operations in the DPDK environment. + +Locks +~~~~~ + +On x86, atomic operations imply a lock prefix before the instruction, causing the processor's LOCK# signal to be asserted during execution of the following instruction. This has a big impact on performance in a multicore environment. @@ -176,6 +182,137 @@ It can often be replaced by other solutions like per-lcore variables. Also, some locking techniques are more efficient than others. For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks. +Atomic Operations: Use C11 Atomic Built-ins +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +DPDK `generic rte_atomic <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_atomic.h>`_ operations are +implemented by `__sync built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_. +These __sync built-ins result in full barriers on aarch64, which are unnecessary +in many use cases. They can be replaced by `__atomic built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_ that +conform to the C11 memory model and provide finer memory order control. + +So replacing the rte_atomic operations with __atomic built-ins might improve +performance for aarch64 machines. `More details <https://www.dpdk.org/wp-content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_. + +Some typical optimization cases are listed below: + +Atomicity +^^^^^^^^^ + +Some use cases require atomicity alone, the ordering of the memory operations +does not matter. For example the packets statistics in the `vhost <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>`_ example application. + +It just updates the number of transmitted packets, no subsequent logic depends +on these counters. So the RELAXED memory ordering is sufficient: + +.. code-block:: c + + static __rte_always_inline void + virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev, + struct rte_mbuf *m) + { + ... + ... + if (enable_stats) { + __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1, __ATOMIC_RELAXED); + __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret, __ATOMIC_RELAXED); + ... + } + } + +One-way Barrier +^^^^^^^^^^^^^^^ + +Some use cases allow for memory reordering in one way while requiring memory +ordering in the other direction. + +For example, the memory operations before the `lock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L66>`_ can move to the +critical section, but the memory operations in the critical section cannot move +above the lock. In this case, the full memory barrier in the CAS operation can +be replaced to ACQUIRE. On the other hand, the memory operations after the +`unlock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L88>`_ can move to the critical section, but the memory operations in the +critical section cannot move below the unlock. So the full barrier in the STORE +operation can be replaced with RELEASE. + +Reader-Writer Concurrency +^^^^^^^^^^^^^^^^^^^^^^^^^ +Lock-free reader-writer concurrency is one of the common use cases in DPDK. + +The payload or the data that the writer wants to communicate to the reader, +can be written with RELAXED memory order. However, the guard variable should +be written with RELEASE memory order. This ensures that the store to guard +variable is observable only after the store to payload is observable. +Refer to `rte_hash insert <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_hash.c#L737>`_ for an example. + +.. code-block:: c + + static inline int32_t + rte_hash_cuckoo_insert_mw(const struct rte_hash *h, + ... + int32_t *ret_val) + { + ... + ... + + /* Insert new entry if there is room in the primary + * bucket. + */ + for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) { + /* Check if slot is available */ + if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) { + prim_bkt->sig_current[i] = sig; + /* Store to signature and key should not + * leak after the store to key_idx. i.e. + * key_idx is the guard variable for signature + * and key. + */ + __atomic_store_n(&prim_bkt->key_idx[i], + new_idx, + __ATOMIC_RELEASE); + break; + } + } + + ... + } + +Correspondingly, on the reader side, the guard variable should be read +with ACQUIRE memory order. The payload or the data the writer communicated, +can be read with RELAXED memory order. This ensures that, if the store to +guard variable is observable, the store to payload is also observable. Refer to `rte_hash lookup <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_hash.c#L1215>`_ for an example. + +.. code-block:: c + + static inline int32_t + search_one_bucket_lf(const struct rte_hash *h, const void *key, uint16_t sig, + void **data, const struct rte_hash_bucket *bkt) + { + ... + + for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) { + .... + if (bkt->sig_current[i] == sig) { + key_idx = __atomic_load_n(&bkt->key_idx[i], + __ATOMIC_ACQUIRE); + if (key_idx != EMPTY_SLOT) { + k = (struct rte_hash_key *) ((char *)keys + + key_idx * h->key_entry_size); + + if (rte_hash_cmp_eq(key, k->key, h) == 0) { + if (data != NULL) { + *data = __atomic_load_n(&k->pdata, + __ATOMIC_ACQUIRE); + } + + /* + * Return index where key is stored, + * subtracting the first dummy index + */ + return key_idx - 1; + } + ... + } + Coding Considerations ---------------------