[RFC] ring: improve ring performance with C11 atomics

Message ID 20230421191642.217011-1-wathsala.vithanage@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [RFC] ring: improve ring performance with C11 atomics |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/intel-Functional success Functional PASS

Commit Message

Wathsala Wathawana Vithanage April 21, 2023, 7:16 p.m. UTC
  Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head
can be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE.
Because to calculate the addresses of the dequeue
elements __rte_ring_dequeue_elems uses the old_head updated by the
__atomic_compare_exchange_n intrinsic used in
__rte_ring_move_prod_head. This results in an address dependency
between the two operations. Therefore __rte_ring_dequeue_elems cannot
happen before  __rte_ring_move_prod_head.
Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head
won't be reordered either.

Performance on Arm N1
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8 (Arm N1)                         |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 766730 | Total count: 651686  | Total count: 812125  |
|                     |        Gain: -15%    |        Gain: 6%      |
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32 (Arm N1)                        |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 816745 | Total count: 646385  | Total count: 830935  |
|                     |        Gain: -21%    |        Gain: 2%      |
+-------------------------------------------------------------------+

Performance on x86-64 Cascade Lake
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8                                  |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 181640 | Total count: 181995  | Total count: 182791  |
|                     |        Gain: 0.2%    |        Gain: 0.6%
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32                                 |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 167495 | Total count: 161536  | Total count: 163190  |
|                     |        Gain: -3.5%   |        Gain: -2.6%   |
+-------------------------------------------------------------------+

Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Feifei Wang <feifei.wang2@arm.com>
---
 .mailmap                    |  1 +
 lib/ring/rte_ring_c11_pvt.h | 18 +++++++++---------
 2 files changed, 10 insertions(+), 9 deletions(-)
  

Comments

Morten Brørup April 22, 2023, 1:19 p.m. UTC | #1
> From: Wathsala Vithanage [mailto:wathsala.vithanage@arm.com]
> Sent: Friday, 21 April 2023 21.17
> 
> Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head
> can be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE.
> Because to calculate the addresses of the dequeue
> elements __rte_ring_dequeue_elems uses the old_head updated by the
> __atomic_compare_exchange_n intrinsic used in
> __rte_ring_move_prod_head. This results in an address dependency
> between the two operations. Therefore __rte_ring_dequeue_elems cannot
> happen before  __rte_ring_move_prod_head.
> Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head
> won't be reordered either.

These preconditions should be added as comments in the source code.

> 
> Performance on Arm N1
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8 (Arm N1)                         |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 766730 | Total count: 651686  | Total count: 812125  |
> |                     |        Gain: -15%    |        Gain: 6%      |
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32 (Arm N1)                        |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 816745 | Total count: 646385  | Total count: 830935  |
> |                     |        Gain: -21%    |        Gain: 2%      |
> +-------------------------------------------------------------------+

Big performance gain compared to pre-improved C11 atomics! Excellent.

> 
> Performance on x86-64 Cascade Lake
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8                                  |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 181640 | Total count: 181995  | Total count: 182791  |
> |                     |        Gain: 0.2%    |        Gain: 0.6%
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32                                 |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 167495 | Total count: 161536  | Total count: 163190  |
> |                     |        Gain: -3.5%   |        Gain: -2.6%   |
> +-------------------------------------------------------------------+

I noticed that the larger size (32 objects) had a larger relative drop in performance than the smaller size (8 objects), so I am wondering what the performance numbers are for size 512, the default RTE_MEMPOOL_CACHE_MAX_SIZE? It's probably not going to change anything regarding the patch acceptance, but I'm curious about the numbers.

> 
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Feifei Wang <feifei.wang2@arm.com>
> ---
>  .mailmap                    |  1 +
>  lib/ring/rte_ring_c11_pvt.h | 18 +++++++++---------
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/.mailmap b/.mailmap
> index 4018f0fc47..367115d134 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com>
>  Wang Sheng-Hui <shhuiw@gmail.com>
>  Wangyu (Eric) <seven.wangyu@huawei.com>
>  Waterman Cao <waterman.cao@intel.com>
> +Wathsala Vithanage <wathsala.vithanage@arm.com>
>  Weichun Chen <weichunx.chen@intel.com>
>  Wei Dai <wei.dai@intel.com>
>  Weifeng Li <liweifeng96@126.com>
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> index f895950df4..1895f2bb0e 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
>  	if (!single)
>  		rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED);
> 
> +	/*
> +	 * Updating of ht->tail cannot happen before elements are added to or
> +	 * removed from the ring, as it could result in data races between
> +	 * producer and consumer threads. Therefore ht->tail should be updated
> +	 * with release semantics to prevent ring data copy phase from sinking
> +	 * below it.
> +	 */

I think this comment should clarified as:

Updating of ht->tail SHOULD NOT happen before elements are added to or
removed from the ring, as it WOULD result in data races between
producer and consumer threads. Therefore ht->tail MUST be updated
with release semantics to prevent ring data copy phase from sinking
below it.

Don't use capitals; I only used them to highlight my modifications.

NB: English is not my native language, so please ignore me if I am mistaken!

>  	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
>  }
> 
> @@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int
> is_sp,
>  		/* Ensure the head is read before tail */
>  		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> 
> -		/* load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		cons_tail = __atomic_load_n(&r->cons.tail,
> -					__ATOMIC_ACQUIRE);
> +					__ATOMIC_RELAXED);

The __ATOMIC_ACQUIRE had a comment describing its purpose.

Please don't just remove that comment. Please replace it with a new comment describing why __ATOMIC_RELAXED is valid here.

> 
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
> @@ -145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
>  		/* Ensure the head is read before tail */
>  		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> 
> -		/* this load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		prod_tail = __atomic_load_n(&r->prod.tail,
> -					__ATOMIC_ACQUIRE);
> -
> +					__ATOMIC_RELAXED);

Same comment as above: Don't just remove the old comment, please replace it with a new comment.

>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
>  		 * cons_head > prod_tail). So 'entries' is always between 0
> --
> 2.25.1

With comments regarding __ATOMIC_RELAXED added to the source code,

Acked-by: Morten Brørup <mb@smartsharesystems.com>
  
Wathsala Wathawana Vithanage June 8, 2023, 1:21 p.m. UTC | #2
The solution presented in this RFC is not C11 compliant. 
C11 __atomic_compare_exchange_n updates "expected" only when CAS instruction fails. 
Therefore, the assumption that there is an address dependency from CAS instructions in both producer/consumer head update to the ring element accesses falls apart with respect to semantics of the C11 memory model. 
Thus, this solution may corrupt ring elements when executed on CPUs with weak memory models.
Therefore, this RFC will be retracted.

> -----Original Message-----
> From: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Sent: Friday, April 21, 2023 3:17 PM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> konstantin.v.ananyev@yandex.ru; Feifei Wang <Feifei.Wang2@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; Wathsala Wathawana Vithanage
> <wathsala.vithanage@arm.com>
> Subject: [RFC] ring: improve ring performance with C11 atomics
> 
> Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head can
> be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE.
> Because to calculate the addresses of the dequeue elements
> __rte_ring_dequeue_elems uses the old_head updated by the
> __atomic_compare_exchange_n intrinsic used in __rte_ring_move_prod_head.
> This results in an address dependency between the two operations. Therefore
> __rte_ring_dequeue_elems cannot happen before
> __rte_ring_move_prod_head.
> Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head won't be
> reordered either.
> 
> Performance on Arm N1
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8 (Arm N1)                         |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 766730 | Total count: 651686  | Total count: 812125  |
> |                     |        Gain: -15%    |        Gain: 6%      |
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32 (Arm N1)                        |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 816745 | Total count: 646385  | Total count: 830935  |
> |                     |        Gain: -21%    |        Gain: 2%      |
> +-------------------------------------------------------------------+
> 
> Performance on x86-64 Cascade Lake
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8                                  |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 181640 | Total count: 181995  | Total count: 182791  |
> |                     |        Gain: 0.2%    |        Gain: 0.6%
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32                                 |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 167495 | Total count: 161536  | Total count: 163190  |
> |                     |        Gain: -3.5%   |        Gain: -2.6%   |
> +-------------------------------------------------------------------+
> 
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Feifei Wang <feifei.wang2@arm.com>
> ---
>  .mailmap                    |  1 +
>  lib/ring/rte_ring_c11_pvt.h | 18 +++++++++---------
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/.mailmap b/.mailmap
> index 4018f0fc47..367115d134 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com>
> Wang Sheng-Hui <shhuiw@gmail.com>  Wangyu (Eric)
> <seven.wangyu@huawei.com>  Waterman Cao <waterman.cao@intel.com>
> +Wathsala Vithanage <wathsala.vithanage@arm.com>
>  Weichun Chen <weichunx.chen@intel.com>
>  Wei Dai <wei.dai@intel.com>
>  Weifeng Li <liweifeng96@126.com>
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index
> f895950df4..1895f2bb0e 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
>  	if (!single)
>  		rte_wait_until_equal_32(&ht->tail, old_val,
> __ATOMIC_RELAXED);
> 
> +	/*
> +	 * Updating of ht->tail cannot happen before elements are added to or
> +	 * removed from the ring, as it could result in data races between
> +	 * producer and consumer threads. Therefore ht->tail should be
> updated
> +	 * with release semantics to prevent ring data copy phase from sinking
> +	 * below it.
> +	 */
>  	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);  }
> 
> @@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
>  		/* Ensure the head is read before tail */
>  		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> 
> -		/* load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		cons_tail = __atomic_load_n(&r->cons.tail,
> -					__ATOMIC_ACQUIRE);
> +					__ATOMIC_RELAXED);
> 
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have @@ -
> 145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
>  		/* Ensure the head is read before tail */
>  		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> 
> -		/* this load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		prod_tail = __atomic_load_n(&r->prod.tail,
> -					__ATOMIC_ACQUIRE);
> -
> +					__ATOMIC_RELAXED);
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
>  		 * cons_head > prod_tail). So 'entries' is always between 0
> --
> 2.25.1
  

Patch

diff --git a/.mailmap b/.mailmap
index 4018f0fc47..367115d134 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1430,6 +1430,7 @@  Walter Heymans <walter.heymans@corigine.com>
 Wang Sheng-Hui <shhuiw@gmail.com>
 Wangyu (Eric) <seven.wangyu@huawei.com>
 Waterman Cao <waterman.cao@intel.com>
+Wathsala Vithanage <wathsala.vithanage@arm.com>
 Weichun Chen <weichunx.chen@intel.com>
 Wei Dai <wei.dai@intel.com>
 Weifeng Li <liweifeng96@126.com>
diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index f895950df4..1895f2bb0e 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -24,6 +24,13 @@  __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val,
 	if (!single)
 		rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED);
 
+	/*
+	 * Updating of ht->tail cannot happen before elements are added to or
+	 * removed from the ring, as it could result in data races between
+	 * producer and consumer threads. Therefore ht->tail should be updated
+	 * with release semantics to prevent ring data copy phase from sinking
+	 * below it.
+	 */
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
 
@@ -69,11 +76,8 @@  __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
-		 */
 		cons_tail = __atomic_load_n(&r->cons.tail,
-					__ATOMIC_ACQUIRE);
+					__ATOMIC_RELAXED);
 
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
@@ -145,12 +149,8 @@  __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* this load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
-		 */
 		prod_tail = __atomic_load_n(&r->prod.tail,
-					__ATOMIC_ACQUIRE);
-
+					__ATOMIC_RELAXED);
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0