[1/6] ring: change head and tail to pointer-width size

Message ID 20190110210122.24889-2-gage.eads@intel.com
State Superseded
Delegated to: Thomas Monjalon
Headers show
Series
  • Add non-blocking ring
Related show

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Eads, Gage Jan. 10, 2019, 9:01 p.m.
For 64-bit architectures, doubling the head and tail index widths greatly
increases the time it takes for them to wrap-around (with current CPU
speeds, it won't happen within the author's lifetime). This is important in
avoiding the ABA problem -- in which a thread mistakes reading the same
tail index in two accesses to mean that the ring was not modified in the
intervening time -- in the upcoming non-blocking ring implementation. Using
a 64-bit index makes the possibility of this occurring effectively zero.

I tested this commit's performance impact with an x86_64 build on a
dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the change made
no significant difference -- the few differences appear to be system noise.
(The test ran on isolcpus cores using a tickless scheduler, but some
variation was stll observed.) Each test was run three times and the results
were averaged:

                                  | 64b head/tail cycle cost minus
             Test                 |     32b head/tail cycle cost
------------------------------------------------------------------
SP/SC single enq/dequeue          | 0.33
MP/MC single enq/dequeue          | 0.00
SP/SC burst enq/dequeue (size 8)  | 0.00
MP/MC burst enq/dequeue (size 8)  | 1.00
SP/SC burst enq/dequeue (size 32) | 0.00
MP/MC burst enq/dequeue (size 32) | -1.00
SC empty dequeue                  | 0.01
MC empty dequeue                  | 0.00

Single lcore:
SP/SC bulk enq/dequeue (size 8)   | -0.36
MP/MC bulk enq/dequeue (size 8)   | 0.99
SP/SC bulk enq/dequeue (size 32)  | -0.40
MP/MC bulk enq/dequeue (size 32)  | -0.57

Two physical cores:
SP/SC bulk enq/dequeue (size 8)   | -0.49
MP/MC bulk enq/dequeue (size 8)   | 0.19
SP/SC bulk enq/dequeue (size 32)  | -0.28
MP/MC bulk enq/dequeue (size 32)  | -0.62

Two NUMA nodes:
SP/SC bulk enq/dequeue (size 8)   | 3.25
MP/MC bulk enq/dequeue (size 8)   | 1.87
SP/SC bulk enq/dequeue (size 32)  | -0.44
MP/MC bulk enq/dequeue (size 32)  | -1.10

An earlier version of this patch changed the head and tail indexes to
uint64_t, but that caused a performance drop on 32-bit builds. With
uintptr_t, no performance difference is observed on an i686 build.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_eventdev/rte_event_ring.h |  6 +++---
 lib/librte_ring/rte_ring.c           | 10 +++++-----
 lib/librte_ring/rte_ring.h           | 20 ++++++++++----------
 lib/librte_ring/rte_ring_generic.h   | 16 +++++++++-------
 4 files changed, 27 insertions(+), 25 deletions(-)

Comments

Stephen Hemminger Jan. 11, 2019, 4:38 a.m. | #1
On Thu, 10 Jan 2019 15:01:17 -0600
Gage Eads <gage.eads@intel.com> wrote:

> For 64-bit architectures, doubling the head and tail index widths greatly
> increases the time it takes for them to wrap-around (with current CPU
> speeds, it won't happen within the author's lifetime). This is important in
> avoiding the ABA problem -- in which a thread mistakes reading the same
> tail index in two accesses to mean that the ring was not modified in the
> intervening time -- in the upcoming non-blocking ring implementation. Using
> a 64-bit index makes the possibility of this occurring effectively zero.
> 
> I tested this commit's performance impact with an x86_64 build on a
> dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the change made
> no significant difference -- the few differences appear to be system noise.
> (The test ran on isolcpus cores using a tickless scheduler, but some
> variation was stll observed.) Each test was run three times and the results
> were averaged:
> 
>                                   | 64b head/tail cycle cost minus
>              Test                 |     32b head/tail cycle cost
> ------------------------------------------------------------------
> SP/SC single enq/dequeue          | 0.33
> MP/MC single enq/dequeue          | 0.00
> SP/SC burst enq/dequeue (size 8)  | 0.00
> MP/MC burst enq/dequeue (size 8)  | 1.00
> SP/SC burst enq/dequeue (size 32) | 0.00
> MP/MC burst enq/dequeue (size 32) | -1.00
> SC empty dequeue                  | 0.01
> MC empty dequeue                  | 0.00
> 
> Single lcore:
> SP/SC bulk enq/dequeue (size 8)   | -0.36
> MP/MC bulk enq/dequeue (size 8)   | 0.99
> SP/SC bulk enq/dequeue (size 32)  | -0.40
> MP/MC bulk enq/dequeue (size 32)  | -0.57
> 
> Two physical cores:
> SP/SC bulk enq/dequeue (size 8)   | -0.49
> MP/MC bulk enq/dequeue (size 8)   | 0.19
> SP/SC bulk enq/dequeue (size 32)  | -0.28
> MP/MC bulk enq/dequeue (size 32)  | -0.62
> 
> Two NUMA nodes:
> SP/SC bulk enq/dequeue (size 8)   | 3.25
> MP/MC bulk enq/dequeue (size 8)   | 1.87
> SP/SC bulk enq/dequeue (size 32)  | -0.44
> MP/MC bulk enq/dequeue (size 32)  | -1.10
> 
> An earlier version of this patch changed the head and tail indexes to
> uint64_t, but that caused a performance drop on 32-bit builds. With
> uintptr_t, no performance difference is observed on an i686 build.
> 
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---
>  lib/librte_eventdev/rte_event_ring.h |  6 +++---
>  lib/librte_ring/rte_ring.c           | 10 +++++-----
>  lib/librte_ring/rte_ring.h           | 20 ++++++++++----------
>  lib/librte_ring/rte_ring_generic.h   | 16 +++++++++-------
>  4 files changed, 27 insertions(+), 25 deletions(-)
> 
> diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h
> index 827a3209e..eae70f904 100644
> --- a/lib/librte_eventdev/rte_event_ring.h
> +++ b/lib/librte_eventdev/rte_event_ring.h
> @@ -1,5 +1,5 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
> - * Copyright(c) 2016-2017 Intel Corporation
> + * Copyright(c) 2016-2019 Intel Corporation
>   */
>  
>  /**
> @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
>  		const struct rte_event *events,
>  		unsigned int n, uint16_t *free_space)
>  {
> -	uint32_t prod_head, prod_next;
> +	uintptr_t prod_head, prod_next;
>  	uint32_t free_entries;
>  
>  	n = __rte_ring_move_prod_head(&r->r, r->r.prod.single, n,
> @@ -129,7 +129,7 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
>  		struct rte_event *events,
>  		unsigned int n, uint16_t *available)
>  {
> -	uint32_t cons_head, cons_next;
> +	uintptr_t cons_head, cons_next;
>  	uint32_t entries;
>  
>  	n = __rte_ring_move_cons_head(&r->r, r->r.cons.single, n,
> diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
> index d215acecc..b15ee0eb3 100644
> --- a/lib/librte_ring/rte_ring.c
> +++ b/lib/librte_ring/rte_ring.c
> @@ -1,6 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   *
> - * Copyright (c) 2010-2015 Intel Corporation
> + * Copyright (c) 2010-2019 Intel Corporation
>   * Copyright (c) 2007,2008 Kip Macy kmacy@freebsd.org
>   * All rights reserved.
>   * Derived from FreeBSD's bufring.h
> @@ -227,10 +227,10 @@ rte_ring_dump(FILE *f, const struct rte_ring *r)
>  	fprintf(f, "  flags=%x\n", r->flags);
>  	fprintf(f, "  size=%"PRIu32"\n", r->size);
>  	fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
> -	fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
> -	fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
> -	fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
> -	fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
> +	fprintf(f, "  ct=%"PRIuPTR"\n", r->cons.tail);
> +	fprintf(f, "  ch=%"PRIuPTR"\n", r->cons.head);
> +	fprintf(f, "  pt=%"PRIuPTR"\n", r->prod.tail);
> +	fprintf(f, "  ph=%"PRIuPTR"\n", r->prod.head);
>  	fprintf(f, "  used=%u\n", rte_ring_count(r));
>  	fprintf(f, "  avail=%u\n", rte_ring_free_count(r));
>  }
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> index af5444a9f..12af64e13 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -1,6 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   *
> - * Copyright (c) 2010-2017 Intel Corporation
> + * Copyright (c) 2010-2019 Intel Corporation
>   * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
>   * All rights reserved.
>   * Derived from FreeBSD's bufring.h
> @@ -65,8 +65,8 @@ struct rte_memzone; /* forward declaration, so as not to require memzone.h */
>  
>  /* structure to hold a pair of head/tail values and other metadata */
>  struct rte_ring_headtail {
> -	volatile uint32_t head;  /**< Prod/consumer head. */
> -	volatile uint32_t tail;  /**< Prod/consumer tail. */
> +	volatile uintptr_t head;  /**< Prod/consumer head. */
> +	volatile uintptr_t tail;  /**< Prod/consumer tail. */
>  	uint32_t single;         /**< True if single prod/cons */
>  };

Isn't this a major ABI change which will break existing applications?
Anatoly Burakov Jan. 11, 2019, 10:25 a.m. | #2
On 10-Jan-19 9:01 PM, Gage Eads wrote:
> For 64-bit architectures, doubling the head and tail index widths greatly
> increases the time it takes for them to wrap-around (with current CPU
> speeds, it won't happen within the author's lifetime). This is important in
> avoiding the ABA problem -- in which a thread mistakes reading the same
> tail index in two accesses to mean that the ring was not modified in the
> intervening time -- in the upcoming non-blocking ring implementation. Using
> a 64-bit index makes the possibility of this occurring effectively zero.
> 
> I tested this commit's performance impact with an x86_64 build on a
> dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the change made
> no significant difference -- the few differences appear to be system noise.
> (The test ran on isolcpus cores using a tickless scheduler, but some
> variation was stll observed.) Each test was run three times and the results
> were averaged:
> 
>                                    | 64b head/tail cycle cost minus
>               Test                 |     32b head/tail cycle cost
> ------------------------------------------------------------------
> SP/SC single enq/dequeue          | 0.33
> MP/MC single enq/dequeue          | 0.00
> SP/SC burst enq/dequeue (size 8)  | 0.00
> MP/MC burst enq/dequeue (size 8)  | 1.00
> SP/SC burst enq/dequeue (size 32) | 0.00
> MP/MC burst enq/dequeue (size 32) | -1.00
> SC empty dequeue                  | 0.01
> MC empty dequeue                  | 0.00
> 
> Single lcore:
> SP/SC bulk enq/dequeue (size 8)   | -0.36
> MP/MC bulk enq/dequeue (size 8)   | 0.99
> SP/SC bulk enq/dequeue (size 32)  | -0.40
> MP/MC bulk enq/dequeue (size 32)  | -0.57
> 
> Two physical cores:
> SP/SC bulk enq/dequeue (size 8)   | -0.49
> MP/MC bulk enq/dequeue (size 8)   | 0.19
> SP/SC bulk enq/dequeue (size 32)  | -0.28
> MP/MC bulk enq/dequeue (size 32)  | -0.62
> 
> Two NUMA nodes:
> SP/SC bulk enq/dequeue (size 8)   | 3.25
> MP/MC bulk enq/dequeue (size 8)   | 1.87
> SP/SC bulk enq/dequeue (size 32)  | -0.44
> MP/MC bulk enq/dequeue (size 32)  | -1.10
> 
> An earlier version of this patch changed the head and tail indexes to
> uint64_t, but that caused a performance drop on 32-bit builds. With
> uintptr_t, no performance difference is observed on an i686 build.
> 
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---

You're breaking the ABI - version bump for affected libraries is needed.
Anatoly Burakov Jan. 11, 2019, 10:40 a.m. | #3
<...>

> + * Copyright(c) 2016-2019 Intel Corporation
>    */
>   
>   /**
> @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
>   		const struct rte_event *events,
>   		unsigned int n, uint16_t *free_space)
>   {
> -	uint32_t prod_head, prod_next;
> +	uintptr_t prod_head, prod_next;

I would also question the use of uinptr_t. I think semnatically, size_t 
is more appropriate.
Bruce Richardson Jan. 11, 2019, 10:58 a.m. | #4
On Fri, Jan 11, 2019 at 10:40:19AM +0000, Burakov, Anatoly wrote:
> <...>
> 
> > + * Copyright(c) 2016-2019 Intel Corporation
> >    */
> >   /**
> > @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
> >   		const struct rte_event *events,
> >   		unsigned int n, uint16_t *free_space)
> >   {
> > -	uint32_t prod_head, prod_next;
> > +	uintptr_t prod_head, prod_next;
> 
> I would also question the use of uinptr_t. I think semnatically, size_t is
> more appropriate.
> 
Yes, it would, but I believe in this case they want to use the largest size
of (unsigned)int where there exists an atomic for manipulating 2 of them
simultaneously. [The largest size is to minimize any chance of an ABA issue
occuring]. Therefore we need 32-bit values on 32-bit and 64-bit on 64, and
I suspect the best way to guarantee this is to use pointer-sized values. If
size_t is guaranteed across all OS's to have the same size as uintptr_t it
could also be used, though.

/Bruce
Anatoly Burakov Jan. 11, 2019, 11:30 a.m. | #5
On 11-Jan-19 10:58 AM, Bruce Richardson wrote:
> On Fri, Jan 11, 2019 at 10:40:19AM +0000, Burakov, Anatoly wrote:
>> <...>
>>
>>> + * Copyright(c) 2016-2019 Intel Corporation
>>>     */
>>>    /**
>>> @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
>>>    		const struct rte_event *events,
>>>    		unsigned int n, uint16_t *free_space)
>>>    {
>>> -	uint32_t prod_head, prod_next;
>>> +	uintptr_t prod_head, prod_next;
>>
>> I would also question the use of uinptr_t. I think semnatically, size_t is
>> more appropriate.
>>
> Yes, it would, but I believe in this case they want to use the largest size
> of (unsigned)int where there exists an atomic for manipulating 2 of them
> simultaneously. [The largest size is to minimize any chance of an ABA issue
> occuring]. Therefore we need 32-bit values on 32-bit and 64-bit on 64, and
> I suspect the best way to guarantee this is to use pointer-sized values. If
> size_t is guaranteed across all OS's to have the same size as uintptr_t it
> could also be used, though.
> 
> /Bruce
> 

Technically, size_t and uintptr_t are not guaranteed to match. In 
practice, they won't match only on architectures that DPDK doesn't 
intend to run on (such as 16-bit segmented archs, where size_t would be 
16-bit but uinptr_t would be 32-bit).

In all the rest of DPDK code, we use size_t for this kind of thing.
Eads, Gage Jan. 11, 2019, 7:07 p.m. | #6
> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Thursday, January 10, 2019 10:39 PM
> To: Eads, Gage <gage.eads@intel.com>
> Cc: dev@dpdk.org; olivier.matz@6wind.com; arybchenko@solarflare.com;
> Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width
> size
> 
> On Thu, 10 Jan 2019 15:01:17 -0600
> Gage Eads <gage.eads@intel.com> wrote:
> 
> > For 64-bit architectures, doubling the head and tail index widths
> > greatly increases the time it takes for them to wrap-around (with
> > current CPU speeds, it won't happen within the author's lifetime).
> > This is important in avoiding the ABA problem -- in which a thread
> > mistakes reading the same tail index in two accesses to mean that the
> > ring was not modified in the intervening time -- in the upcoming
> > non-blocking ring implementation. Using a 64-bit index makes the possibility of
> this occurring effectively zero.
> >
> > I tested this commit's performance impact with an x86_64 build on a
> > dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the change
> > made no significant difference -- the few differences appear to be system
> noise.
> > (The test ran on isolcpus cores using a tickless scheduler, but some
> > variation was stll observed.) Each test was run three times and the
> > results were averaged:
> >
> >                                   | 64b head/tail cycle cost minus
> >              Test                 |     32b head/tail cycle cost
> > ------------------------------------------------------------------
> > SP/SC single enq/dequeue          | 0.33
> > MP/MC single enq/dequeue          | 0.00
> > SP/SC burst enq/dequeue (size 8)  | 0.00 MP/MC burst enq/dequeue (size
> > 8)  | 1.00 SP/SC burst enq/dequeue (size 32) | 0.00 MP/MC burst
> > enq/dequeue (size 32) | -1.00
> > SC empty dequeue                  | 0.01
> > MC empty dequeue                  | 0.00
> >
> > Single lcore:
> > SP/SC bulk enq/dequeue (size 8)   | -0.36
> > MP/MC bulk enq/dequeue (size 8)   | 0.99
> > SP/SC bulk enq/dequeue (size 32)  | -0.40 MP/MC bulk enq/dequeue (size
> > 32)  | -0.57
> >
> > Two physical cores:
> > SP/SC bulk enq/dequeue (size 8)   | -0.49
> > MP/MC bulk enq/dequeue (size 8)   | 0.19
> > SP/SC bulk enq/dequeue (size 32)  | -0.28 MP/MC bulk enq/dequeue (size
> > 32)  | -0.62
> >
> > Two NUMA nodes:
> > SP/SC bulk enq/dequeue (size 8)   | 3.25
> > MP/MC bulk enq/dequeue (size 8)   | 1.87
> > SP/SC bulk enq/dequeue (size 32)  | -0.44 MP/MC bulk enq/dequeue (size
> > 32)  | -1.10
> >
> > An earlier version of this patch changed the head and tail indexes to
> > uint64_t, but that caused a performance drop on 32-bit builds. With
> > uintptr_t, no performance difference is observed on an i686 build.
> >
> > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > ---
> >  lib/librte_eventdev/rte_event_ring.h |  6 +++---
> >  lib/librte_ring/rte_ring.c           | 10 +++++-----
> >  lib/librte_ring/rte_ring.h           | 20 ++++++++++----------
> >  lib/librte_ring/rte_ring_generic.h   | 16 +++++++++-------
> >  4 files changed, 27 insertions(+), 25 deletions(-)
> >
> > diff --git a/lib/librte_eventdev/rte_event_ring.h
> > b/lib/librte_eventdev/rte_event_ring.h
> > index 827a3209e..eae70f904 100644
> > --- a/lib/librte_eventdev/rte_event_ring.h
> > +++ b/lib/librte_eventdev/rte_event_ring.h
> > @@ -1,5 +1,5 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> > - * Copyright(c) 2016-2017 Intel Corporation
> > + * Copyright(c) 2016-2019 Intel Corporation
> >   */
> >
> >  /**
> > @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
> >  		const struct rte_event *events,
> >  		unsigned int n, uint16_t *free_space)  {
> > -	uint32_t prod_head, prod_next;
> > +	uintptr_t prod_head, prod_next;
> >  	uint32_t free_entries;
> >
> >  	n = __rte_ring_move_prod_head(&r->r, r->r.prod.single, n, @@ -129,7
> > +129,7 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
> >  		struct rte_event *events,
> >  		unsigned int n, uint16_t *available)  {
> > -	uint32_t cons_head, cons_next;
> > +	uintptr_t cons_head, cons_next;
> >  	uint32_t entries;
> >
> >  	n = __rte_ring_move_cons_head(&r->r, r->r.cons.single, n, diff --git
> > a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c index
> > d215acecc..b15ee0eb3 100644
> > --- a/lib/librte_ring/rte_ring.c
> > +++ b/lib/librte_ring/rte_ring.c
> > @@ -1,6 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   *
> > - * Copyright (c) 2010-2015 Intel Corporation
> > + * Copyright (c) 2010-2019 Intel Corporation
> >   * Copyright (c) 2007,2008 Kip Macy kmacy@freebsd.org
> >   * All rights reserved.
> >   * Derived from FreeBSD's bufring.h
> > @@ -227,10 +227,10 @@ rte_ring_dump(FILE *f, const struct rte_ring *r)
> >  	fprintf(f, "  flags=%x\n", r->flags);
> >  	fprintf(f, "  size=%"PRIu32"\n", r->size);
> >  	fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
> > -	fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
> > -	fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
> > -	fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
> > -	fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
> > +	fprintf(f, "  ct=%"PRIuPTR"\n", r->cons.tail);
> > +	fprintf(f, "  ch=%"PRIuPTR"\n", r->cons.head);
> > +	fprintf(f, "  pt=%"PRIuPTR"\n", r->prod.tail);
> > +	fprintf(f, "  ph=%"PRIuPTR"\n", r->prod.head);
> >  	fprintf(f, "  used=%u\n", rte_ring_count(r));
> >  	fprintf(f, "  avail=%u\n", rte_ring_free_count(r));  } diff --git
> > a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index
> > af5444a9f..12af64e13 100644
> > --- a/lib/librte_ring/rte_ring.h
> > +++ b/lib/librte_ring/rte_ring.h
> > @@ -1,6 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   *
> > - * Copyright (c) 2010-2017 Intel Corporation
> > + * Copyright (c) 2010-2019 Intel Corporation
> >   * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
> >   * All rights reserved.
> >   * Derived from FreeBSD's bufring.h
> > @@ -65,8 +65,8 @@ struct rte_memzone; /* forward declaration, so as
> > not to require memzone.h */
> >
> >  /* structure to hold a pair of head/tail values and other metadata */
> > struct rte_ring_headtail {
> > -	volatile uint32_t head;  /**< Prod/consumer head. */
> > -	volatile uint32_t tail;  /**< Prod/consumer tail. */
> > +	volatile uintptr_t head;  /**< Prod/consumer head. */
> > +	volatile uintptr_t tail;  /**< Prod/consumer tail. */
> >  	uint32_t single;         /**< True if single prod/cons */
> >  };
> 
> Isn't this a major ABI change which will break existing applications?

Correct, and this patch needs to be reworked with the RTE_NEXT_ABI ifdef, as described in the versioning guidelines. I had misunderstood the ABI change procedure, but I'll fix this in v2.
Eads, Gage Jan. 11, 2019, 7:12 p.m. | #7
> -----Original Message-----
> From: Burakov, Anatoly
> Sent: Friday, January 11, 2019 4:25 AM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width
> size
> 
> On 10-Jan-19 9:01 PM, Gage Eads wrote:
> > For 64-bit architectures, doubling the head and tail index widths
> > greatly increases the time it takes for them to wrap-around (with
> > current CPU speeds, it won't happen within the author's lifetime).
> > This is important in avoiding the ABA problem -- in which a thread
> > mistakes reading the same tail index in two accesses to mean that the
> > ring was not modified in the intervening time -- in the upcoming
> > non-blocking ring implementation. Using a 64-bit index makes the possibility of
> this occurring effectively zero.
> >
> > I tested this commit's performance impact with an x86_64 build on a
> > dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the change
> > made no significant difference -- the few differences appear to be system
> noise.
> > (The test ran on isolcpus cores using a tickless scheduler, but some
> > variation was stll observed.) Each test was run three times and the
> > results were averaged:
> >
> >                                    | 64b head/tail cycle cost minus
> >               Test                 |     32b head/tail cycle cost
> > ------------------------------------------------------------------
> > SP/SC single enq/dequeue          | 0.33
> > MP/MC single enq/dequeue          | 0.00
> > SP/SC burst enq/dequeue (size 8)  | 0.00 MP/MC burst enq/dequeue (size
> > 8)  | 1.00 SP/SC burst enq/dequeue (size 32) | 0.00 MP/MC burst
> > enq/dequeue (size 32) | -1.00
> > SC empty dequeue                  | 0.01
> > MC empty dequeue                  | 0.00
> >
> > Single lcore:
> > SP/SC bulk enq/dequeue (size 8)   | -0.36
> > MP/MC bulk enq/dequeue (size 8)   | 0.99
> > SP/SC bulk enq/dequeue (size 32)  | -0.40 MP/MC bulk enq/dequeue (size
> > 32)  | -0.57
> >
> > Two physical cores:
> > SP/SC bulk enq/dequeue (size 8)   | -0.49
> > MP/MC bulk enq/dequeue (size 8)   | 0.19
> > SP/SC bulk enq/dequeue (size 32)  | -0.28 MP/MC bulk enq/dequeue (size
> > 32)  | -0.62
> >
> > Two NUMA nodes:
> > SP/SC bulk enq/dequeue (size 8)   | 3.25
> > MP/MC bulk enq/dequeue (size 8)   | 1.87
> > SP/SC bulk enq/dequeue (size 32)  | -0.44 MP/MC bulk enq/dequeue (size
> > 32)  | -1.10
> >
> > An earlier version of this patch changed the head and tail indexes to
> > uint64_t, but that caused a performance drop on 32-bit builds. With
> > uintptr_t, no performance difference is observed on an i686 build.
> >
> > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > ---
> 
> You're breaking the ABI - version bump for affected libraries is needed.
> 
> --
> Thanks,
> Anatoly

If I'm reading the versioning guidelines correctly, I'll need to gate the changes with the RTE_NEXT_ABI macro and provide a deprecation notice, then after a full deprecation cycle we can revert that and bump the library version. Not to mention the 3 ML ACKs.

I'll address this in v2.

Thanks,
Gage
Eads, Gage Jan. 11, 2019, 7:27 p.m. | #8
> -----Original Message-----
> From: Richardson, Bruce
> Sent: Friday, January 11, 2019 5:59 AM
> To: Burakov, Anatoly <anatoly.burakov@intel.com>
> Cc: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org;
> olivier.matz@6wind.com; arybchenko@solarflare.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width
> size
> 
> On Fri, Jan 11, 2019 at 11:30:24AM +0000, Burakov, Anatoly wrote:
> > On 11-Jan-19 10:58 AM, Bruce Richardson wrote:
> > > On Fri, Jan 11, 2019 at 10:40:19AM +0000, Burakov, Anatoly wrote:
> > > > <...>
> > > >
> > > > > + * Copyright(c) 2016-2019 Intel Corporation
> > > > >     */
> > > > >    /**
> > > > > @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct
> rte_event_ring *r,
> > > > >    		const struct rte_event *events,
> > > > >    		unsigned int n, uint16_t *free_space)
> > > > >    {
> > > > > -	uint32_t prod_head, prod_next;
> > > > > +	uintptr_t prod_head, prod_next;
> > > >
> > > > I would also question the use of uinptr_t. I think semnatically,
> > > > size_t is more appropriate.
> > > >
> > > Yes, it would, but I believe in this case they want to use the
> > > largest size of (unsigned)int where there exists an atomic for
> > > manipulating 2 of them simultaneously. [The largest size is to
> > > minimize any chance of an ABA issue occuring]. Therefore we need
> > > 32-bit values on 32-bit and 64-bit on 64, and I suspect the best way
> > > to guarantee this is to use pointer-sized values. If size_t is
> > > guaranteed across all OS's to have the same size as uintptr_t it could also be
> used, though.
> > >
> > > /Bruce
> > >
> >
> > Technically, size_t and uintptr_t are not guaranteed to match. In
> > practice, they won't match only on architectures that DPDK doesn't
> > intend to run on (such as 16-bit segmented archs, where size_t would
> > be 16-bit but uinptr_t would be 32-bit).
> >
> > In all the rest of DPDK code, we use size_t for this kind of thing.
> >
> 
> Ok.
> If we do use size_t, I think we also need to add a compile-time check into the
> build too, to error out if sizeof(size_t) != sizeof(uintptr_t).

Ok, I wasn't aware of the precedent of using size_t for this purpose. I'll change it and look into adding a static assert.

Thanks,
Gage
Stephen Hemminger Jan. 11, 2019, 7:55 p.m. | #9
On Fri, 11 Jan 2019 19:12:40 +0000
"Eads, Gage" <gage.eads@intel.com> wrote:

> > -----Original Message-----
> > From: Burakov, Anatoly
> > Sent: Friday, January 11, 2019 4:25 AM
> > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width
> > size
> > 
> > On 10-Jan-19 9:01 PM, Gage Eads wrote:  
> > > For 64-bit architectures, doubling the head and tail index widths
> > > greatly increases the time it takes for them to wrap-around (with
> > > current CPU speeds, it won't happen within the author's lifetime).
> > > This is important in avoiding the ABA problem -- in which a thread
> > > mistakes reading the same tail index in two accesses to mean that the
> > > ring was not modified in the intervening time -- in the upcoming
> > > non-blocking ring implementation. Using a 64-bit index makes the possibility of  
> > this occurring effectively zero.  
> > >
> > > I tested this commit's performance impact with an x86_64 build on a
> > > dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the change
> > > made no significant difference -- the few differences appear to be system  
> > noise.  
> > > (The test ran on isolcpus cores using a tickless scheduler, but some
> > > variation was stll observed.) Each test was run three times and the
> > > results were averaged:
> > >
> > >                                    | 64b head/tail cycle cost minus
> > >               Test                 |     32b head/tail cycle cost
> > > ------------------------------------------------------------------
> > > SP/SC single enq/dequeue          | 0.33
> > > MP/MC single enq/dequeue          | 0.00
> > > SP/SC burst enq/dequeue (size 8)  | 0.00 MP/MC burst enq/dequeue (size
> > > 8)  | 1.00 SP/SC burst enq/dequeue (size 32) | 0.00 MP/MC burst
> > > enq/dequeue (size 32) | -1.00
> > > SC empty dequeue                  | 0.01
> > > MC empty dequeue                  | 0.00
> > >
> > > Single lcore:
> > > SP/SC bulk enq/dequeue (size 8)   | -0.36
> > > MP/MC bulk enq/dequeue (size 8)   | 0.99
> > > SP/SC bulk enq/dequeue (size 32)  | -0.40 MP/MC bulk enq/dequeue (size
> > > 32)  | -0.57
> > >
> > > Two physical cores:
> > > SP/SC bulk enq/dequeue (size 8)   | -0.49
> > > MP/MC bulk enq/dequeue (size 8)   | 0.19
> > > SP/SC bulk enq/dequeue (size 32)  | -0.28 MP/MC bulk enq/dequeue (size
> > > 32)  | -0.62
> > >
> > > Two NUMA nodes:
> > > SP/SC bulk enq/dequeue (size 8)   | 3.25
> > > MP/MC bulk enq/dequeue (size 8)   | 1.87
> > > SP/SC bulk enq/dequeue (size 32)  | -0.44 MP/MC bulk enq/dequeue (size
> > > 32)  | -1.10
> > >
> > > An earlier version of this patch changed the head and tail indexes to
> > > uint64_t, but that caused a performance drop on 32-bit builds. With
> > > uintptr_t, no performance difference is observed on an i686 build.
> > >
> > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > ---  
> > 
> > You're breaking the ABI - version bump for affected libraries is needed.
> > 
> > --
> > Thanks,
> > Anatoly  
> 
> If I'm reading the versioning guidelines correctly, I'll need to gate the changes with the RTE_NEXT_ABI macro and provide a deprecation notice, then after a full deprecation cycle we can revert that and bump the library version. Not to mention the 3 ML ACKs.
> 
> I'll address this in v2.

My understanding is that RTE_NEXT_API method is not used any more. Replaced by rte_experimental.
But this kind of change is more of a flag day event. Which means it needs to be pushed
off to a release that is planned as an ABI break (usually once a year) which would
mean 19.11.
Eads, Gage Jan. 15, 2019, 3:48 p.m. | #10
> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, January 11, 2019 1:55 PM
> To: Eads, Gage <gage.eads@intel.com>
> Cc: Burakov, Anatoly <anatoly.burakov@intel.com>; dev@dpdk.org;
> olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width
> size
> 
> On Fri, 11 Jan 2019 19:12:40 +0000
> "Eads, Gage" <gage.eads@intel.com> wrote:
> 
> > > -----Original Message-----
> > > From: Burakov, Anatoly
> > > Sent: Friday, January 11, 2019 4:25 AM
> > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson,
> > > Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to
> > > pointer-width size
> > >
> > > On 10-Jan-19 9:01 PM, Gage Eads wrote:
> > > > For 64-bit architectures, doubling the head and tail index widths
> > > > greatly increases the time it takes for them to wrap-around (with
> > > > current CPU speeds, it won't happen within the author's lifetime).
> > > > This is important in avoiding the ABA problem -- in which a thread
> > > > mistakes reading the same tail index in two accesses to mean that
> > > > the ring was not modified in the intervening time -- in the
> > > > upcoming non-blocking ring implementation. Using a 64-bit index
> > > > makes the possibility of
> > > this occurring effectively zero.
> > > >
> > > > I tested this commit's performance impact with an x86_64 build on
> > > > a dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the
> > > > change made no significant difference -- the few differences
> > > > appear to be system
> > > noise.
> > > > (The test ran on isolcpus cores using a tickless scheduler, but
> > > > some variation was stll observed.) Each test was run three times
> > > > and the results were averaged:
> > > >
> > > >                                    | 64b head/tail cycle cost minus
> > > >               Test                 |     32b head/tail cycle cost
> > > > ------------------------------------------------------------------
> > > > SP/SC single enq/dequeue          | 0.33
> > > > MP/MC single enq/dequeue          | 0.00
> > > > SP/SC burst enq/dequeue (size 8)  | 0.00 MP/MC burst enq/dequeue
> > > > (size
> > > > 8)  | 1.00 SP/SC burst enq/dequeue (size 32) | 0.00 MP/MC burst
> > > > enq/dequeue (size 32) | -1.00
> > > > SC empty dequeue                  | 0.01
> > > > MC empty dequeue                  | 0.00
> > > >
> > > > Single lcore:
> > > > SP/SC bulk enq/dequeue (size 8)   | -0.36
> > > > MP/MC bulk enq/dequeue (size 8)   | 0.99
> > > > SP/SC bulk enq/dequeue (size 32)  | -0.40 MP/MC bulk enq/dequeue
> > > > (size
> > > > 32)  | -0.57
> > > >
> > > > Two physical cores:
> > > > SP/SC bulk enq/dequeue (size 8)   | -0.49
> > > > MP/MC bulk enq/dequeue (size 8)   | 0.19
> > > > SP/SC bulk enq/dequeue (size 32)  | -0.28 MP/MC bulk enq/dequeue
> > > > (size
> > > > 32)  | -0.62
> > > >
> > > > Two NUMA nodes:
> > > > SP/SC bulk enq/dequeue (size 8)   | 3.25
> > > > MP/MC bulk enq/dequeue (size 8)   | 1.87
> > > > SP/SC bulk enq/dequeue (size 32)  | -0.44 MP/MC bulk enq/dequeue
> > > > (size
> > > > 32)  | -1.10
> > > >
> > > > An earlier version of this patch changed the head and tail indexes
> > > > to uint64_t, but that caused a performance drop on 32-bit builds.
> > > > With uintptr_t, no performance difference is observed on an i686 build.
> > > >
> > > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > > ---
> > >
> > > You're breaking the ABI - version bump for affected libraries is needed.
> > >
> > > --
> > > Thanks,
> > > Anatoly
> >
> > If I'm reading the versioning guidelines correctly, I'll need to gate the changes
> with the RTE_NEXT_ABI macro and provide a deprecation notice, then after a
> full deprecation cycle we can revert that and bump the library version. Not to
> mention the 3 ML ACKs.
> >
> > I'll address this in v2.
> 
> My understanding is that RTE_NEXT_API method is not used any more. Replaced
> by rte_experimental.
> But this kind of change is more of a flag day event. Which means it needs to be
> pushed off to a release that is planned as an ABI break (usually once a year)
> which would mean 19.11.

In recent release notes, I see ABI changes can happen more frequently than once per year; 18.11, 18.05, 17.11, and 17.08 have ABI changes -- and soon 19.02 will too.

At any rate, I'll create a separate deprecation notice patch and update this patchset accordingly.

Thanks,
Gage

Patch

diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h
index 827a3209e..eae70f904 100644
--- a/lib/librte_eventdev/rte_event_ring.h
+++ b/lib/librte_eventdev/rte_event_ring.h
@@ -1,5 +1,5 @@ 
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2016-2017 Intel Corporation
+ * Copyright(c) 2016-2019 Intel Corporation
  */
 
 /**
@@ -88,7 +88,7 @@  rte_event_ring_enqueue_burst(struct rte_event_ring *r,
 		const struct rte_event *events,
 		unsigned int n, uint16_t *free_space)
 {
-	uint32_t prod_head, prod_next;
+	uintptr_t prod_head, prod_next;
 	uint32_t free_entries;
 
 	n = __rte_ring_move_prod_head(&r->r, r->r.prod.single, n,
@@ -129,7 +129,7 @@  rte_event_ring_dequeue_burst(struct rte_event_ring *r,
 		struct rte_event *events,
 		unsigned int n, uint16_t *available)
 {
-	uint32_t cons_head, cons_next;
+	uintptr_t cons_head, cons_next;
 	uint32_t entries;
 
 	n = __rte_ring_move_cons_head(&r->r, r->r.cons.single, n,
diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d215acecc..b15ee0eb3 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -1,6 +1,6 @@ 
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2015 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007,2008 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -227,10 +227,10 @@  rte_ring_dump(FILE *f, const struct rte_ring *r)
 	fprintf(f, "  flags=%x\n", r->flags);
 	fprintf(f, "  size=%"PRIu32"\n", r->size);
 	fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
-	fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
-	fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
-	fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
-	fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+	fprintf(f, "  ct=%"PRIuPTR"\n", r->cons.tail);
+	fprintf(f, "  ch=%"PRIuPTR"\n", r->cons.head);
+	fprintf(f, "  pt=%"PRIuPTR"\n", r->prod.tail);
+	fprintf(f, "  ph=%"PRIuPTR"\n", r->prod.head);
 	fprintf(f, "  used=%u\n", rte_ring_count(r));
 	fprintf(f, "  avail=%u\n", rte_ring_free_count(r));
 }
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index af5444a9f..12af64e13 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -1,6 +1,6 @@ 
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -65,8 +65,8 @@  struct rte_memzone; /* forward declaration, so as not to require memzone.h */
 
 /* structure to hold a pair of head/tail values and other metadata */
 struct rte_ring_headtail {
-	volatile uint32_t head;  /**< Prod/consumer head. */
-	volatile uint32_t tail;  /**< Prod/consumer tail. */
+	volatile uintptr_t head;  /**< Prod/consumer head. */
+	volatile uintptr_t tail;  /**< Prod/consumer tail. */
 	uint32_t single;         /**< True if single prod/cons */
 };
 
@@ -242,7 +242,7 @@  void rte_ring_dump(FILE *f, const struct rte_ring *r);
 #define ENQUEUE_PTRS(r, ring_start, prod_head, obj_table, n, obj_type) do { \
 	unsigned int i; \
 	const uint32_t size = (r)->size; \
-	uint32_t idx = prod_head & (r)->mask; \
+	uintptr_t idx = prod_head & (r)->mask; \
 	obj_type *ring = (obj_type *)ring_start; \
 	if (likely(idx + n < size)) { \
 		for (i = 0; i < (n & ((~(unsigned)0x3))); i+=4, idx+=4) { \
@@ -272,7 +272,7 @@  void rte_ring_dump(FILE *f, const struct rte_ring *r);
  * single and multi consumer dequeue functions */
 #define DEQUEUE_PTRS(r, ring_start, cons_head, obj_table, n, obj_type) do { \
 	unsigned int i; \
-	uint32_t idx = cons_head & (r)->mask; \
+	uintptr_t idx = cons_head & (r)->mask; \
 	const uint32_t size = (r)->size; \
 	obj_type *ring = (obj_type *)ring_start; \
 	if (likely(idx + n < size)) { \
@@ -338,7 +338,7 @@  __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
 		 unsigned int n, enum rte_ring_queue_behavior behavior,
 		 unsigned int is_sp, unsigned int *free_space)
 {
-	uint32_t prod_head, prod_next;
+	uintptr_t prod_head, prod_next;
 	uint32_t free_entries;
 
 	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
@@ -380,7 +380,7 @@  __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table,
 		 unsigned int n, enum rte_ring_queue_behavior behavior,
 		 unsigned int is_sc, unsigned int *available)
 {
-	uint32_t cons_head, cons_next;
+	uintptr_t cons_head, cons_next;
 	uint32_t entries;
 
 	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
@@ -681,9 +681,9 @@  rte_ring_dequeue(struct rte_ring *r, void **obj_p)
 static inline unsigned
 rte_ring_count(const struct rte_ring *r)
 {
-	uint32_t prod_tail = r->prod.tail;
-	uint32_t cons_tail = r->cons.tail;
-	uint32_t count = (prod_tail - cons_tail) & r->mask;
+	uintptr_t prod_tail = r->prod.tail;
+	uintptr_t cons_tail = r->cons.tail;
+	uintptr_t count = (prod_tail - cons_tail) & r->mask;
 	return (count > r->capacity) ? r->capacity : count;
 }
 
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index ea7dbe5b9..3fd1150f6 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -1,6 +1,6 @@ 
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -11,7 +11,7 @@ 
 #define _RTE_RING_GENERIC_H_
 
 static __rte_always_inline void
-update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
+update_tail(struct rte_ring_headtail *ht, uintptr_t old_val, uintptr_t new_val,
 		uint32_t single, uint32_t enqueue)
 {
 	if (enqueue)
@@ -55,7 +55,7 @@  update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 static __rte_always_inline unsigned int
 __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		unsigned int n, enum rte_ring_queue_behavior behavior,
-		uint32_t *old_head, uint32_t *new_head,
+		uintptr_t *old_head, uintptr_t *new_head,
 		uint32_t *free_entries)
 {
 	const uint32_t capacity = r->capacity;
@@ -93,7 +93,8 @@  __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		if (is_sp)
 			r->prod.head = *new_head, success = 1;
 		else
-			success = rte_atomic32_cmpset(&r->prod.head,
+			/* Built-in used to handle variable-sized head index. */
+			success = __sync_bool_compare_and_swap(&r->prod.head,
 					*old_head, *new_head);
 	} while (unlikely(success == 0));
 	return n;
@@ -125,7 +126,7 @@  __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 static __rte_always_inline unsigned int
 __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
 		unsigned int n, enum rte_ring_queue_behavior behavior,
-		uint32_t *old_head, uint32_t *new_head,
+		uintptr_t *old_head, uintptr_t *new_head,
 		uint32_t *entries)
 {
 	unsigned int max = n;
@@ -161,8 +162,9 @@  __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
 		if (is_sc)
 			r->cons.head = *new_head, success = 1;
 		else
-			success = rte_atomic32_cmpset(&r->cons.head, *old_head,
-					*new_head);
+			/* Built-in used to handle variable-sized head index. */
+			success = __sync_bool_compare_and_swap(&r->cons.head,
+					*old_head, *new_head);
 	} while (unlikely(success == 0));
 	return n;
 }