[v3,2/3] ring: synchronize the load and store of the tail

Message ID 1537172244-64874-2-git-send-email-gavin.hu@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v3,1/3] ring: read tail using atomic load |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Gavin Hu Sept. 17, 2018, 8:17 a.m. UTC
  Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.

The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.

This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.

The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i

Test result with this patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.86
 MP/MC bulk enq/dequeue (size: 8): 10.15
 SP/SC bulk enq/dequeue (size: 32): 1.94
 MP/MC bulk enq/dequeue (size: 32): 2.36

In comparison of the test result without this patch:
 SP/SC bulk enq/dequeue (size: 8): 6.67
 MP/MC bulk enq/dequeue (size: 8): 13.12
 SP/SC bulk enq/dequeue (size: 32): 2.04
 MP/MC bulk enq/dequeue (size: 32): 3.26

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)
  

Comments

Gavin Hu Sept. 26, 2018, 9:29 a.m. UTC | #1
+Justin He for review.

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Monday, September 17, 2018 4:17 PM
> To: dev@dpdk.org
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org
> Subject: [PATCH v3 2/3] ring: synchronize the load and store of the tail
> 
> Synchronize the load-acquire of the tail and the store-release within
> update_tail, the store release ensures all the ring operations, enqueue or
> dequeue, are seen by the observers on the other side as soon as they see
> the updated tail. The load-acquire is needed here as the data dependency is
> not a reliable way for ordering as the compiler might break it by saving to
> temporary values to boost performance.
> When computing the free_entries and avail_entries, use atomic semantics to
> load the heads and tails instead.
> 
> The patch was benchmarked with test/ring_perf_autotest and it decreases
> the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
> are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
> For 1 lcore, it also improves a little, about 3 ~ 4%.
> It is a big improvement, in case of MPMC, with two lcores and ring size of 32,
> it saves latency up to (3.26-2.36)/3.26 = 27.6%.
> 
> This patch is a bug fix, while the improvement is a bonus. In our analysis the
> improvement comes from the cacheline pre-filling after hoisting load-
> acquire from _atomic_compare_exchange_n up above.
> 
> The test command:
> $sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
> 1024 -- -i
> 
> Test result with this patch(two cores):
>  SP/SC bulk enq/dequeue (size: 8): 5.86
>  MP/MC bulk enq/dequeue (size: 8): 10.15  SP/SC bulk enq/dequeue (size:
> 32): 1.94  MP/MC bulk enq/dequeue (size: 32): 2.36
> 
> In comparison of the test result without this patch:
>  SP/SC bulk enq/dequeue (size: 8): 6.67
>  MP/MC bulk enq/dequeue (size: 8): 13.12  SP/SC bulk enq/dequeue (size:
> 32): 2.04  MP/MC bulk enq/dequeue (size: 32): 3.26
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> ---
>  lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> b/lib/librte_ring/rte_ring_c11_mem.h
> index 234fea0..0eae3b3 100644
> --- a/lib/librte_ring/rte_ring_c11_mem.h
> +++ b/lib/librte_ring/rte_ring_c11_mem.h
> @@ -68,13 +68,18 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
>  		*old_head = __atomic_load_n(&r->prod.head,
>  					__ATOMIC_ACQUIRE);
> 
> -		/*
> -		 *  The subtraction is done between two unsigned 32bits
> value
> +		/* load-acquire synchronize with store-release of ht->tail
> +		 * in update_tail.
> +		 */
> +		const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
> +
> 	__ATOMIC_ACQUIRE);
> +
> +		/* The subtraction is done between two unsigned 32bits
> value
>  		 * (the result is always modulo 32 bits even if we have
>  		 * *old_head > cons_tail). So 'free_entries' is always
> between 0
>  		 * and capacity (which is < size).
>  		 */
> -		*free_entries = (capacity + r->cons.tail - *old_head);
> +		*free_entries = (capacity + cons_tail - *old_head);
> 
>  		/* check that we have enough room in ring */
>  		if (unlikely(n > *free_entries))
> @@ -132,15 +137,22 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> is_sc,
>  	do {
>  		/* Restore n as it may change every loop */
>  		n = max;
> +
>  		*old_head = __atomic_load_n(&r->cons.head,
>  					__ATOMIC_ACQUIRE);
> 
> +		/* this load-acquire synchronize with store-release of ht->tail
> +		 * in update_tail.
> +		 */
> +		const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> +					__ATOMIC_ACQUIRE);
> +
>  		/* The subtraction is done between two unsigned 32bits
> value
>  		 * (the result is always modulo 32 bits even if we have
>  		 * cons_head > prod_tail). So 'entries' is always between 0
>  		 * and size(ring)-1.
>  		 */
> -		*entries = (r->prod.tail - *old_head);
> +		*entries = (prod_tail - *old_head);
> 
>  		/* Set the actual entries for dequeue */
>  		if (n > *entries)
> --
> 2.7.4
  
Justin He Sept. 26, 2018, 9:59 a.m. UTC | #2
Hi Gavin

> -----Original Message-----
> From: Gavin Hu (Arm Technology China)
> Sent: 2018年9月26日 17:30
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org; Justin
> He <Justin.He@arm.com>
> Subject: RE: [PATCH v3 2/3] ring: synchronize the load and store of the tail
>
> +Justin He for review.
>
> > -----Original Message-----
> > From: Gavin Hu <gavin.hu@arm.com>
> > Sent: Monday, September 17, 2018 4:17 PM
> > To: dev@dpdk.org
> > Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> > <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> > jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org
> > Subject: [PATCH v3 2/3] ring: synchronize the load and store of the
> > tail
> >
> > Synchronize the load-acquire of the tail and the store-release within
> > update_tail, the store release ensures all the ring operations,
> > enqueue or dequeue, are seen by the observers on the other side as
> > soon as they see the updated tail. The load-acquire is needed here as
> > the data dependency is not a reliable way for ordering as the compiler
> > might break it by saving to temporary values to boost performance.
> > When computing the free_entries and avail_entries, use atomic
> > semantics to load the heads and tails instead.
> >
> > The patch was benchmarked with test/ring_perf_autotest and it
> > decreases the enqueue/dequeue latency by 5% ~ 27.6% with two lcores,
> > the real gains are dependent on the number of lcores, depth of the ring, SPSC
> or MPMC.
> > For 1 lcore, it also improves a little, about 3 ~ 4%.
> > It is a big improvement, in case of MPMC, with two lcores and ring
> > size of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
> >
> > This patch is a bug fix, while the improvement is a bonus. In our
> > analysis the improvement comes from the cacheline pre-filling after
> > hoisting load- acquire from _atomic_compare_exchange_n up above.
> >
> > The test command:
> > $sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4
> > --socket-mem=\
> > 1024 -- -i
> >
> > Test result with this patch(two cores):
> >  SP/SC bulk enq/dequeue (size: 8): 5.86  MP/MC bulk enq/dequeue (size:
> > 8): 10.15  SP/SC bulk enq/dequeue (size:
> > 32): 1.94  MP/MC bulk enq/dequeue (size: 32): 2.36
> >
> > In comparison of the test result without this patch:
> >  SP/SC bulk enq/dequeue (size: 8): 6.67  MP/MC bulk enq/dequeue (size:
> > 8): 13.12  SP/SC bulk enq/dequeue (size:
> > 32): 2.04  MP/MC bulk enq/dequeue (size: 32): 3.26
> >
> > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > ---
> >  lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
> >  1 file changed, 16 insertions(+), 4 deletions(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> > b/lib/librte_ring/rte_ring_c11_mem.h
> > index 234fea0..0eae3b3 100644
> > --- a/lib/librte_ring/rte_ring_c11_mem.h
> > +++ b/lib/librte_ring/rte_ring_c11_mem.h
> > @@ -68,13 +68,18 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > unsigned int is_sp,
> >  *old_head = __atomic_load_n(&r->prod.head,
> >  __ATOMIC_ACQUIRE);
> >
> > -/*
> > - *  The subtraction is done between two unsigned 32bits
> > value
> > +/* load-acquire synchronize with store-release of ht->tail
> > + * in update_tail.
> > + */
> > +const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
> > +
> > __ATOMIC_ACQUIRE);
I ever noticed that freebsd also used the double __atomic_load_n. And as we
discussed it several months ago [1], seems the second load_acquire is not
necessary. But as you verified, it looks good to me
[1] http://mails.dpdk.org/archives/dev/2017-November/080983.html
So,
Reviewed-by: Jia He <justin.he@arm.com>

Cheers,
Justin (Jia He)
> > +
> > +/* The subtraction is done between two unsigned 32bits
> > value
> >   * (the result is always modulo 32 bits even if we have
> >   * *old_head > cons_tail). So 'free_entries' is always between 0
> >   * and capacity (which is < size).
> >   */
> > -*free_entries = (capacity + r->cons.tail - *old_head);
> > +*free_entries = (capacity + cons_tail - *old_head);
> >
> >  /* check that we have enough room in ring */
> >  if (unlikely(n > *free_entries))
> > @@ -132,15 +137,22 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> > int is_sc,
> >  do {
> >  /* Restore n as it may change every loop */
> >  n = max;
> > +
> >  *old_head = __atomic_load_n(&r->cons.head,
> >  __ATOMIC_ACQUIRE);
> >
> > +/* this load-acquire synchronize with store-release of ht->tail
> > + * in update_tail.
> > + */
> > +const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> > +__ATOMIC_ACQUIRE);
> > +
> >  /* The subtraction is done between two unsigned 32bits value
> >   * (the result is always modulo 32 bits even if we have
> >   * cons_head > prod_tail). So 'entries' is always between 0
> >   * and size(ring)-1.
> >   */
> > -*entries = (r->prod.tail - *old_head);
> > +*entries = (prod_tail - *old_head);
> >
> >  /* Set the actual entries for dequeue */
> >  if (n > *entries)
> > --
> > 2.7.4

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  
Jerin Jacob Sept. 29, 2018, 10:57 a.m. UTC | #3
-----Original Message-----
> Date: Mon, 17 Sep 2018 16:17:23 +0800
> From: Gavin Hu <gavin.hu@arm.com>
> To: dev@dpdk.org
> CC: gavin.hu@arm.com, Honnappa.Nagarahalli@arm.com, steve.capper@arm.com,
>  Ola.Liljedahl@arm.com, jerin.jacob@caviumnetworks.com, nd@arm.com,
>  stable@dpdk.org
> Subject: [PATCH v3 2/3] ring: synchronize the load and store of the tail
> X-Mailer: git-send-email 2.7.4
> 
> 
> Synchronize the load-acquire of the tail and the store-release
> within update_tail, the store release ensures all the ring operations,
> enqueue or dequeue, are seen by the observers on the other side as soon
> as they see the updated tail. The load-acquire is needed here as the
> data dependency is not a reliable way for ordering as the compiler might
> break it by saving to temporary values to boost performance.
> When computing the free_entries and avail_entries, use atomic semantics
> to load the heads and tails instead.
> 
> The patch was benchmarked with test/ring_perf_autotest and it decreases
> the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
> are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
> For 1 lcore, it also improves a little, about 3 ~ 4%.
> It is a big improvement, in case of MPMC, with two lcores and ring size
> of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
> 
> This patch is a bug fix, while the improvement is a bonus. In our analysis
> the improvement comes from the cacheline pre-filling after hoisting load-
> acquire from _atomic_compare_exchange_n up above.
> 
> The test command:
> $sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
> 1024 -- -i
> 
> Test result with this patch(two cores):
>  SP/SC bulk enq/dequeue (size: 8): 5.86
>  MP/MC bulk enq/dequeue (size: 8): 10.15
>  SP/SC bulk enq/dequeue (size: 32): 1.94
>  MP/MC bulk enq/dequeue (size: 32): 2.36
> 
> In comparison of the test result without this patch:
>  SP/SC bulk enq/dequeue (size: 8): 6.67
>  MP/MC bulk enq/dequeue (size: 8): 13.12
>  SP/SC bulk enq/dequeue (size: 32): 2.04
>  MP/MC bulk enq/dequeue (size: 32): 3.26
> 
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>

Tested with ThunderX2 server platform. Though it has minor performance impact on
non burst variants. For Burst variant, I could see similar performance
improvement and it is C11 semantically correct too.

Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
  

Patch

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 234fea0..0eae3b3 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -68,13 +68,18 @@  __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		*old_head = __atomic_load_n(&r->prod.head,
 					__ATOMIC_ACQUIRE);
 
-		/*
-		 *  The subtraction is done between two unsigned 32bits value
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
+							__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * *old_head > cons_tail). So 'free_entries' is always between 0
 		 * and capacity (which is < size).
 		 */
-		*free_entries = (capacity + r->cons.tail - *old_head);
+		*free_entries = (capacity + cons_tail - *old_head);
 
 		/* check that we have enough room in ring */
 		if (unlikely(n > *free_entries))
@@ -132,15 +137,22 @@  __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
+
 		*old_head = __atomic_load_n(&r->cons.head,
 					__ATOMIC_ACQUIRE);
 
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
+					__ATOMIC_ACQUIRE);
+
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
 		 * and size(ring)-1.
 		 */
-		*entries = (r->prod.tail - *old_head);
+		*entries = (prod_tail - *old_head);
 
 		/* Set the actual entries for dequeue */
 		if (n > *entries)