lib/distributor: fix deadlock issue for aarch64

Message ID 20191008095524.1585-1-ruifeng.wang@arm.com (mailing list archive)
State Superseded, archived
Headers
Series lib/distributor: fix deadlock issue for aarch64 |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-compilation success Compile Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/Intel-compilation fail Compilation issues
ci/iol-mellanox-Performance success Performance Testing PASS

Commit Message

Ruifeng Wang Oct. 8, 2019, 9:55 a.m. UTC
  Distributor and worker threads rely on data structs in cache line
for synchronization. The shared data structs were not protected.
This caused deadlock issue on weaker memory ordering platforms as
aarch64.
Fix this issue by adding memory barriers to ensure synchronization
among cores.

Bugzilla ID: 342
Fixes: 775003ad2f96 ("distributor: add new burst-capable library")
Cc: stable@dpdk.org

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_distributor/rte_distributor.c     | 28 ++++++++++------
 lib/librte_distributor/rte_distributor_v20.c | 34 +++++++++++++-------
 2 files changed, 41 insertions(+), 21 deletions(-)
  

Comments

Hunt, David Oct. 8, 2019, 12:53 p.m. UTC | #1
On 08/10/2019 10:55, Ruifeng Wang wrote:
> Distributor and worker threads rely on data structs in cache line
> for synchronization. The shared data structs were not protected.
> This caused deadlock issue on weaker memory ordering platforms as
> aarch64.
> Fix this issue by adding memory barriers to ensure synchronization
> among cores.
>
> Bugzilla ID: 342
> Fixes: 775003ad2f96 ("distributor: add new burst-capable library")
> Cc: stable@dpdk.org
>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> ---
>   lib/librte_distributor/rte_distributor.c     | 28 ++++++++++------
>   lib/librte_distributor/rte_distributor_v20.c | 34 +++++++++++++-------
>   2 files changed, 41 insertions(+), 21 deletions(-)
>
--snip--

I tested this on my system, and saw no performance degradation. Looks 
good. Thanks.

Acked-by: David Hunt <david.hunt@intel.com>
  
Aaron Conole Oct. 8, 2019, 5:05 p.m. UTC | #2
Ruifeng Wang <ruifeng.wang@arm.com> writes:

> Distributor and worker threads rely on data structs in cache line
> for synchronization. The shared data structs were not protected.
> This caused deadlock issue on weaker memory ordering platforms as
> aarch64.
> Fix this issue by adding memory barriers to ensure synchronization
> among cores.
>
> Bugzilla ID: 342
> Fixes: 775003ad2f96 ("distributor: add new burst-capable library")
> Cc: stable@dpdk.org
>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> ---

I see a failure in the distributor_autotest (on one of the builds):

64/82 DPDK:fast-tests / distributor_autotest  FAIL     0.37 s (exit status 255 or signal 127 SIGinvalid)

--- command ---

DPDK_TEST='distributor_autotest' /home/travis/build/ovsrobot/dpdk/build/app/test/dpdk-test -l 0-1 --file-prefix=distributor_autotest

--- stdout ---

EAL: Probing VFIO support...

APP: HPET is not enabled, using TSC as default timer

RTE>>distributor_autotest

=== Basic distributor sanity tests ===

Worker 0 handled 32 packets

Sanity test with all zero hashes done.

Worker 0 handled 32 packets

Sanity test with non-zero hashes done

=== testing big burst (single) ===

Sanity test of returned packets done

=== Sanity test with mbuf alloc/free (single) ===

Sanity test with mbuf alloc/free passed

Too few cores to run worker shutdown test

=== Basic distributor sanity tests ===

Worker 0 handled 32 packets

Sanity test with all zero hashes done.

Worker 0 handled 32 packets

Sanity test with non-zero hashes done

=== testing big burst (burst) ===

Sanity test of returned packets done

=== Sanity test with mbuf alloc/free (burst) ===

Line 326: Packet count is incorrect, 1048568, expected 1048576

Test Failed

RTE>>

--- stderr ---

EAL: Detected 2 lcore(s)

EAL: Detected 1 NUMA nodes

EAL: Multi-process socket /var/run/dpdk/distributor_autotest/mp_socket

EAL: Selected IOVA mode 'PA'

EAL: No available hugepages reported in hugepages-1048576kB

-------

Not sure how to help debug further.  I'll re-start the job to see if
it 'clears' up - but I guess there may be a delicate synchronization
somewhere that needs to be accounted.

>  lib/librte_distributor/rte_distributor.c     | 28 ++++++++++------
>  lib/librte_distributor/rte_distributor_v20.c | 34 +++++++++++++-------
>  2 files changed, 41 insertions(+), 21 deletions(-)
>
> diff --git a/lib/librte_distributor/rte_distributor.c b/lib/librte_distributor/rte_distributor.c
> index 21eb1fb0a..7bf96e224 100644
> --- a/lib/librte_distributor/rte_distributor.c
> +++ b/lib/librte_distributor/rte_distributor.c
> @@ -50,7 +50,8 @@ rte_distributor_request_pkt_v1705(struct rte_distributor *d,
>  
>  	retptr64 = &(buf->retptr64[0]);
>  	/* Spin while handshake bits are set (scheduler clears it) */
> -	while (unlikely(*retptr64 & RTE_DISTRIB_GET_BUF)) {
> +	while (unlikely(__atomic_load_n(retptr64, __ATOMIC_ACQUIRE)
> +			& RTE_DISTRIB_GET_BUF)) {
>  		rte_pause();
>  		uint64_t t = rte_rdtsc()+100;
>  
> @@ -76,7 +77,8 @@ rte_distributor_request_pkt_v1705(struct rte_distributor *d,
>  	 * Finally, set the GET_BUF  to signal to distributor that cache
>  	 * line is ready for processing
>  	 */
> -	*retptr64 |= RTE_DISTRIB_GET_BUF;
> +	__atomic_store_n(retptr64, *retptr64 | RTE_DISTRIB_GET_BUF,
> +			__ATOMIC_RELEASE);
>  }
>  BIND_DEFAULT_SYMBOL(rte_distributor_request_pkt, _v1705, 17.05);
>  MAP_STATIC_SYMBOL(void rte_distributor_request_pkt(struct rte_distributor *d,
> @@ -99,7 +101,8 @@ rte_distributor_poll_pkt_v1705(struct rte_distributor *d,
>  	}
>  
>  	/* If bit is set, return */
> -	if (buf->bufptr64[0] & RTE_DISTRIB_GET_BUF)
> +	if (__atomic_load_n(&(buf->bufptr64[0]), __ATOMIC_ACQUIRE)
> +		& RTE_DISTRIB_GET_BUF)
>  		return -1;
>  
>  	/* since bufptr64 is signed, this should be an arithmetic shift */
> @@ -116,6 +119,8 @@ rte_distributor_poll_pkt_v1705(struct rte_distributor *d,
>  	 * on the next cacheline while we're working.
>  	 */
>  	buf->bufptr64[0] |= RTE_DISTRIB_GET_BUF;
> +	__atomic_store_n(&(buf->bufptr64[0]),
> +		buf->bufptr64[0] | RTE_DISTRIB_GET_BUF, __ATOMIC_RELEASE);
>  
>  	return count;
>  }
> @@ -183,7 +188,8 @@ rte_distributor_return_pkt_v1705(struct rte_distributor *d,
>  			RTE_DISTRIB_FLAG_BITS) | RTE_DISTRIB_RETURN_BUF;
>  
>  	/* set the GET_BUF but even if we got no returns */
> -	buf->retptr64[0] |= RTE_DISTRIB_GET_BUF;
> +	__atomic_store_n(&(buf->retptr64[0]),
> +		buf->retptr64[0] | RTE_DISTRIB_GET_BUF, __ATOMIC_RELEASE);
>  
>  	return 0;
>  }
> @@ -273,7 +279,8 @@ handle_returns(struct rte_distributor *d, unsigned int wkr)
>  	unsigned int count = 0;
>  	unsigned int i;
>  
> -	if (buf->retptr64[0] & RTE_DISTRIB_GET_BUF) {
> +	if (__atomic_load_n(&(buf->retptr64[0]), __ATOMIC_ACQUIRE)
> +		& RTE_DISTRIB_GET_BUF) {
>  		for (i = 0; i < RTE_DIST_BURST_SIZE; i++) {
>  			if (buf->retptr64[i] & RTE_DISTRIB_RETURN_BUF) {
>  				oldbuf = ((uintptr_t)(buf->retptr64[i] >>
> @@ -287,7 +294,7 @@ handle_returns(struct rte_distributor *d, unsigned int wkr)
>  		d->returns.start = ret_start;
>  		d->returns.count = ret_count;
>  		/* Clear for the worker to populate with more returns */
> -		buf->retptr64[0] = 0;
> +		__atomic_store_n(&(buf->retptr64[0]), 0, __ATOMIC_RELEASE);
>  	}
>  	return count;
>  }
> @@ -307,7 +314,8 @@ release(struct rte_distributor *d, unsigned int wkr)
>  	struct rte_distributor_buffer *buf = &(d->bufs[wkr]);
>  	unsigned int i;
>  
> -	while (!(d->bufs[wkr].bufptr64[0] & RTE_DISTRIB_GET_BUF))
> +	while (!(__atomic_load_n(&(d->bufs[wkr].bufptr64[0]), __ATOMIC_ACQUIRE)
> +		& RTE_DISTRIB_GET_BUF))
>  		rte_pause();
>  
>  	handle_returns(d, wkr);
> @@ -328,7 +336,8 @@ release(struct rte_distributor *d, unsigned int wkr)
>  	d->backlog[wkr].count = 0;
>  
>  	/* Clear the GET bit */
> -	buf->bufptr64[0] &= ~RTE_DISTRIB_GET_BUF;
> +	__atomic_store_n(&(buf->bufptr64[0]),
> +		buf->bufptr64[0] & ~RTE_DISTRIB_GET_BUF, __ATOMIC_RELEASE);
>  	return  buf->count;
>  
>  }
> @@ -574,7 +583,8 @@ rte_distributor_clear_returns_v1705(struct rte_distributor *d)
>  
>  	/* throw away returns, so workers can exit */
>  	for (wkr = 0; wkr < d->num_workers; wkr++)
> -		d->bufs[wkr].retptr64[0] = 0;
> +		__atomic_store_n(&(d->bufs[wkr].retptr64[0]), 0,
> +				__ATOMIC_RELEASE);
>  }
>  BIND_DEFAULT_SYMBOL(rte_distributor_clear_returns, _v1705, 17.05);
>  MAP_STATIC_SYMBOL(void rte_distributor_clear_returns(struct rte_distributor *d),
> diff --git a/lib/librte_distributor/rte_distributor_v20.c b/lib/librte_distributor/rte_distributor_v20.c
> index cdc0969a8..3a5810c6d 100644
> --- a/lib/librte_distributor/rte_distributor_v20.c
> +++ b/lib/librte_distributor/rte_distributor_v20.c
> @@ -34,9 +34,10 @@ rte_distributor_request_pkt_v20(struct rte_distributor_v20 *d,
>  	union rte_distributor_buffer_v20 *buf = &d->bufs[worker_id];
>  	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
>  			| RTE_DISTRIB_GET_BUF;
> -	while (unlikely(buf->bufptr64 & RTE_DISTRIB_FLAGS_MASK))
> +	while (unlikely(__atomic_load_n(&(buf->bufptr64), __ATOMIC_ACQUIRE)
> +		& RTE_DISTRIB_FLAGS_MASK))
>  		rte_pause();
> -	buf->bufptr64 = req;
> +	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
>  }
>  VERSION_SYMBOL(rte_distributor_request_pkt, _v20, 2.0);
>  
> @@ -45,7 +46,8 @@ rte_distributor_poll_pkt_v20(struct rte_distributor_v20 *d,
>  		unsigned worker_id)
>  {
>  	union rte_distributor_buffer_v20 *buf = &d->bufs[worker_id];
> -	if (buf->bufptr64 & RTE_DISTRIB_GET_BUF)
> +	if (__atomic_load_n(&(buf->bufptr64), __ATOMIC_ACQUIRE)
> +		& RTE_DISTRIB_GET_BUF)
>  		return NULL;
>  
>  	/* since bufptr64 is signed, this should be an arithmetic shift */
> @@ -73,7 +75,7 @@ rte_distributor_return_pkt_v20(struct rte_distributor_v20 *d,
>  	union rte_distributor_buffer_v20 *buf = &d->bufs[worker_id];
>  	uint64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
>  			| RTE_DISTRIB_RETURN_BUF;
> -	buf->bufptr64 = req;
> +	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
>  	return 0;
>  }
>  VERSION_SYMBOL(rte_distributor_return_pkt, _v20, 2.0);
> @@ -117,7 +119,7 @@ handle_worker_shutdown(struct rte_distributor_v20 *d, unsigned int wkr)
>  {
>  	d->in_flight_tags[wkr] = 0;
>  	d->in_flight_bitmask &= ~(1UL << wkr);
> -	d->bufs[wkr].bufptr64 = 0;
> +	__atomic_store_n(&(d->bufs[wkr].bufptr64), 0, __ATOMIC_RELEASE);
>  	if (unlikely(d->backlog[wkr].count != 0)) {
>  		/* On return of a packet, we need to move the
>  		 * queued packets for this core elsewhere.
> @@ -165,13 +167,17 @@ process_returns(struct rte_distributor_v20 *d)
>  		const int64_t data = d->bufs[wkr].bufptr64;
>  		uintptr_t oldbuf = 0;
>  
> -		if (data & RTE_DISTRIB_GET_BUF) {
> +		if (__atomic_load_n(&data, __ATOMIC_ACQUIRE)
> +			& RTE_DISTRIB_GET_BUF) {
>  			flushed++;
>  			if (d->backlog[wkr].count)
> -				d->bufs[wkr].bufptr64 =
> -						backlog_pop(&d->backlog[wkr]);
> +				__atomic_store_n(&(d->bufs[wkr].bufptr64),
> +					backlog_pop(&d->backlog[wkr]),
> +					__ATOMIC_RELEASE);
>  			else {
> -				d->bufs[wkr].bufptr64 = RTE_DISTRIB_GET_BUF;
> +				__atomic_store_n(&(d->bufs[wkr].bufptr64),
> +					RTE_DISTRIB_GET_BUF,
> +					__ATOMIC_RELEASE);
>  				d->in_flight_tags[wkr] = 0;
>  				d->in_flight_bitmask &= ~(1UL << wkr);
>  			}
> @@ -251,7 +257,8 @@ rte_distributor_process_v20(struct rte_distributor_v20 *d,
>  			}
>  		}
>  
> -		if ((data & RTE_DISTRIB_GET_BUF) &&
> +		if ((__atomic_load_n(&data, __ATOMIC_ACQUIRE)
> +			& RTE_DISTRIB_GET_BUF) &&
>  				(d->backlog[wkr].count || next_mb)) {
>  
>  			if (d->backlog[wkr].count)
> @@ -280,13 +287,16 @@ rte_distributor_process_v20(struct rte_distributor_v20 *d,
>  	 * if they are ready */
>  	for (wkr = 0; wkr < d->num_workers; wkr++)
>  		if (d->backlog[wkr].count &&
> -				(d->bufs[wkr].bufptr64 & RTE_DISTRIB_GET_BUF)) {
> +				(__atomic_load_n(&(d->bufs[wkr].bufptr64),
> +				__ATOMIC_ACQUIRE) & RTE_DISTRIB_GET_BUF)) {
>  
>  			int64_t oldbuf = d->bufs[wkr].bufptr64 >>
>  					RTE_DISTRIB_FLAG_BITS;
>  			store_return(oldbuf, d, &ret_start, &ret_count);
>  
> -			d->bufs[wkr].bufptr64 = backlog_pop(&d->backlog[wkr]);
> +			__atomic_store_n(&(d->bufs[wkr].bufptr64),
> +				backlog_pop(&d->backlog[wkr]),
> +				__ATOMIC_RELEASE);
>  		}
>  
>  	d->returns.start = ret_start;
  
David Marchand Oct. 8, 2019, 7:46 p.m. UTC | #3
On Tue, Oct 8, 2019 at 7:06 PM Aaron Conole <aconole@redhat.com> wrote:
>
> Ruifeng Wang <ruifeng.wang@arm.com> writes:
>
> > Distributor and worker threads rely on data structs in cache line
> > for synchronization. The shared data structs were not protected.
> > This caused deadlock issue on weaker memory ordering platforms as
> > aarch64.
> > Fix this issue by adding memory barriers to ensure synchronization
> > among cores.
> >
> > Bugzilla ID: 342
> > Fixes: 775003ad2f96 ("distributor: add new burst-capable library")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > ---
>
> I see a failure in the distributor_autotest (on one of the builds):
>
> 64/82 DPDK:fast-tests / distributor_autotest  FAIL     0.37 s (exit status 255 or signal 127 SIGinvalid)
>
> --- command ---
>
> DPDK_TEST='distributor_autotest' /home/travis/build/ovsrobot/dpdk/build/app/test/dpdk-test -l 0-1 --file-prefix=distributor_autotest
>
> --- stdout ---
>
> EAL: Probing VFIO support...
>
> APP: HPET is not enabled, using TSC as default timer
>
> RTE>>distributor_autotest
>
> === Basic distributor sanity tests ===
>
> Worker 0 handled 32 packets
>
> Sanity test with all zero hashes done.
>
> Worker 0 handled 32 packets
>
> Sanity test with non-zero hashes done
>
> === testing big burst (single) ===
>
> Sanity test of returned packets done
>
> === Sanity test with mbuf alloc/free (single) ===
>
> Sanity test with mbuf alloc/free passed
>
> Too few cores to run worker shutdown test
>
> === Basic distributor sanity tests ===
>
> Worker 0 handled 32 packets
>
> Sanity test with all zero hashes done.
>
> Worker 0 handled 32 packets
>
> Sanity test with non-zero hashes done
>
> === testing big burst (burst) ===
>
> Sanity test of returned packets done
>
> === Sanity test with mbuf alloc/free (burst) ===
>
> Line 326: Packet count is incorrect, 1048568, expected 1048576
>
> Test Failed
>
> RTE>>
>
> --- stderr ---
>
> EAL: Detected 2 lcore(s)
>
> EAL: Detected 1 NUMA nodes
>
> EAL: Multi-process socket /var/run/dpdk/distributor_autotest/mp_socket
>
> EAL: Selected IOVA mode 'PA'
>
> EAL: No available hugepages reported in hugepages-1048576kB
>
> -------
>
> Not sure how to help debug further.  I'll re-start the job to see if
> it 'clears' up - but I guess there may be a delicate synchronization
> somewhere that needs to be accounted.

Idem, and with the same loop I used before, it can be caught quickly.

# time (log=/tmp/$$.log; while true; do echo distributor_autotest
|taskset -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8
-l 0-1 >$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm
-f $log)

[snip]

RTE>>distributor_autotest
EAL: Trying to obtain current memory policy.
EAL: Setting policy MPOL_PREFERRED for socket 0
EAL: Restoring previous memory policy: 0
EAL: request: mp_malloc_sync
EAL: Heap on socket 0 was expanded by 2MB
EAL: Trying to obtain current memory policy.
EAL: Setting policy MPOL_PREFERRED for socket 0
EAL: Restoring previous memory policy: 0
EAL: alloc_pages_on_heap(): couldn't allocate physically contiguous space
EAL: Trying to obtain current memory policy.
EAL: Setting policy MPOL_PREFERRED for socket 0
EAL: Restoring previous memory policy: 0
EAL: request: mp_malloc_sync
EAL: Heap on socket 0 was expanded by 8MB
=== Basic distributor sanity tests ===
Worker 0 handled 32 packets
Sanity test with all zero hashes done.
Worker 0 handled 32 packets
Sanity test with non-zero hashes done
=== testing big burst (single) ===
Sanity test of returned packets done

=== Sanity test with mbuf alloc/free (single) ===
Sanity test with mbuf alloc/free passed

Too few cores to run worker shutdown test
=== Basic distributor sanity tests ===
Worker 0 handled 32 packets
Sanity test with all zero hashes done.
Worker 0 handled 32 packets
Sanity test with non-zero hashes done
=== testing big burst (burst) ===
Sanity test of returned packets done

=== Sanity test with mbuf alloc/free (burst) ===
Line 326: Packet count is incorrect, 1048568, expected 1048576
Test Failed
RTE>>
real    0m36.668s
user    1m7.293s
sys    0m1.560s

Could be worth running this loop on all tests? (not talking about the
CI, it would be a manual effort to catch lurking issues).
  
Aaron Conole Oct. 8, 2019, 8:08 p.m. UTC | #4
David Marchand <david.marchand@redhat.com> writes:

> On Tue, Oct 8, 2019 at 7:06 PM Aaron Conole <aconole@redhat.com> wrote:
>>
>> Ruifeng Wang <ruifeng.wang@arm.com> writes:
>>
>> > Distributor and worker threads rely on data structs in cache line
>> > for synchronization. The shared data structs were not protected.
>> > This caused deadlock issue on weaker memory ordering platforms as
>> > aarch64.
>> > Fix this issue by adding memory barriers to ensure synchronization
>> > among cores.
>> >
>> > Bugzilla ID: 342
>> > Fixes: 775003ad2f96 ("distributor: add new burst-capable library")
>> > Cc: stable@dpdk.org
>> >
>> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
>> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
>> > ---
>>
>> I see a failure in the distributor_autotest (on one of the builds):
>>
>> 64/82 DPDK:fast-tests / distributor_autotest FAIL 0.37 s (exit
>> status 255 or signal 127 SIGinvalid)
>>
>> --- command ---
>>
>> DPDK_TEST='distributor_autotest'
>> /home/travis/build/ovsrobot/dpdk/build/app/test/dpdk-test -l 0-1
>> --file-prefix=distributor_autotest
>>
>> --- stdout ---
>>
>> EAL: Probing VFIO support...
>>
>> APP: HPET is not enabled, using TSC as default timer
>>
>> RTE>>distributor_autotest
>>
>> === Basic distributor sanity tests ===
>>
>> Worker 0 handled 32 packets
>>
>> Sanity test with all zero hashes done.
>>
>> Worker 0 handled 32 packets
>>
>> Sanity test with non-zero hashes done
>>
>> === testing big burst (single) ===
>>
>> Sanity test of returned packets done
>>
>> === Sanity test with mbuf alloc/free (single) ===
>>
>> Sanity test with mbuf alloc/free passed
>>
>> Too few cores to run worker shutdown test
>>
>> === Basic distributor sanity tests ===
>>
>> Worker 0 handled 32 packets
>>
>> Sanity test with all zero hashes done.
>>
>> Worker 0 handled 32 packets
>>
>> Sanity test with non-zero hashes done
>>
>> === testing big burst (burst) ===
>>
>> Sanity test of returned packets done
>>
>> === Sanity test with mbuf alloc/free (burst) ===
>>
>> Line 326: Packet count is incorrect, 1048568, expected 1048576
>>
>> Test Failed
>>
>> RTE>>
>>
>> --- stderr ---
>>
>> EAL: Detected 2 lcore(s)
>>
>> EAL: Detected 1 NUMA nodes
>>
>> EAL: Multi-process socket /var/run/dpdk/distributor_autotest/mp_socket
>>
>> EAL: Selected IOVA mode 'PA'
>>
>> EAL: No available hugepages reported in hugepages-1048576kB
>>
>> -------
>>
>> Not sure how to help debug further.  I'll re-start the job to see if
>> it 'clears' up - but I guess there may be a delicate synchronization
>> somewhere that needs to be accounted.
>
> Idem, and with the same loop I used before, it can be caught quickly.
>
> # time (log=/tmp/$$.log; while true; do echo distributor_autotest
> |taskset -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8
> -l 0-1 >$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm
> -f $log)

Probably good to document it, yes.  It seems to be a good technique for
reproducing failures.

> [snip]
>
> RTE>>distributor_autotest
> EAL: Trying to obtain current memory policy.
> EAL: Setting policy MPOL_PREFERRED for socket 0
> EAL: Restoring previous memory policy: 0
> EAL: request: mp_malloc_sync
> EAL: Heap on socket 0 was expanded by 2MB
> EAL: Trying to obtain current memory policy.
> EAL: Setting policy MPOL_PREFERRED for socket 0
> EAL: Restoring previous memory policy: 0
> EAL: alloc_pages_on_heap(): couldn't allocate physically contiguous space
> EAL: Trying to obtain current memory policy.
> EAL: Setting policy MPOL_PREFERRED for socket 0
> EAL: Restoring previous memory policy: 0
> EAL: request: mp_malloc_sync
> EAL: Heap on socket 0 was expanded by 8MB
> === Basic distributor sanity tests ===
> Worker 0 handled 32 packets
> Sanity test with all zero hashes done.
> Worker 0 handled 32 packets
> Sanity test with non-zero hashes done
> === testing big burst (single) ===
> Sanity test of returned packets done
>
> === Sanity test with mbuf alloc/free (single) ===
> Sanity test with mbuf alloc/free passed
>
> Too few cores to run worker shutdown test
> === Basic distributor sanity tests ===
> Worker 0 handled 32 packets
> Sanity test with all zero hashes done.
> Worker 0 handled 32 packets
> Sanity test with non-zero hashes done
> === testing big burst (burst) ===
> Sanity test of returned packets done
>
> === Sanity test with mbuf alloc/free (burst) ===
> Line 326: Packet count is incorrect, 1048568, expected 1048576
> Test Failed
> RTE>>
> real    0m36.668s
> user    1m7.293s
> sys    0m1.560s
>
> Could be worth running this loop on all tests? (not talking about the
> CI, it would be a manual effort to catch lurking issues).
  
Ruifeng Wang Oct. 9, 2019, 5:52 a.m. UTC | #5
> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Wednesday, October 9, 2019 03:47
> To: Aaron Conole <aconole@redhat.com>
> Cc: Ruifeng Wang (Arm Technology China) <Ruifeng.Wang@arm.com>; David
> Hunt <david.hunt@intel.com>; dev <dev@dpdk.org>; hkalra@marvell.com;
> Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; dpdk
> stable <stable@dpdk.org>
> Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH] lib/distributor: fix deadlock
> issue for aarch64
> 
> On Tue, Oct 8, 2019 at 7:06 PM Aaron Conole <aconole@redhat.com> wrote:
> >
> > Ruifeng Wang <ruifeng.wang@arm.com> writes:
> >
> > > Distributor and worker threads rely on data structs in cache line
> > > for synchronization. The shared data structs were not protected.
> > > This caused deadlock issue on weaker memory ordering platforms as
> > > aarch64.
> > > Fix this issue by adding memory barriers to ensure synchronization
> > > among cores.
> > >
> > > Bugzilla ID: 342
> > > Fixes: 775003ad2f96 ("distributor: add new burst-capable library")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > ---
> >
> > I see a failure in the distributor_autotest (on one of the builds):
> >
> > 64/82 DPDK:fast-tests / distributor_autotest  FAIL     0.37 s (exit status 255
> or signal 127 SIGinvalid)
> >
> > --- command ---
> >
> > DPDK_TEST='distributor_autotest'
> > /home/travis/build/ovsrobot/dpdk/build/app/test/dpdk-test -l 0-1
> > --file-prefix=distributor_autotest
> >
> > --- stdout ---
> >
> > EAL: Probing VFIO support...
> >
> > APP: HPET is not enabled, using TSC as default timer
> >
> > RTE>>distributor_autotest
> >
> > === Basic distributor sanity tests ===
> >
> > Worker 0 handled 32 packets
> >
> > Sanity test with all zero hashes done.
> >
> > Worker 0 handled 32 packets
> >
> > Sanity test with non-zero hashes done
> >
> > === testing big burst (single) ===
> >
> > Sanity test of returned packets done
> >
> > === Sanity test with mbuf alloc/free (single) ===
> >
> > Sanity test with mbuf alloc/free passed
> >
> > Too few cores to run worker shutdown test
> >
> > === Basic distributor sanity tests ===
> >
> > Worker 0 handled 32 packets
> >
> > Sanity test with all zero hashes done.
> >
> > Worker 0 handled 32 packets
> >
> > Sanity test with non-zero hashes done
> >
> > === testing big burst (burst) ===
> >
> > Sanity test of returned packets done
> >
> > === Sanity test with mbuf alloc/free (burst) ===
> >
> > Line 326: Packet count is incorrect, 1048568, expected 1048576
> >
> > Test Failed
> >
> > RTE>>
> >
> > --- stderr ---
> >
> > EAL: Detected 2 lcore(s)
> >
> > EAL: Detected 1 NUMA nodes
> >
> > EAL: Multi-process socket /var/run/dpdk/distributor_autotest/mp_socket
> >
> > EAL: Selected IOVA mode 'PA'
> >
> > EAL: No available hugepages reported in hugepages-1048576kB
> >
> > -------
> >
> > Not sure how to help debug further.  I'll re-start the job to see if
> > it 'clears' up - but I guess there may be a delicate synchronization
> > somewhere that needs to be accounted.
> 
> Idem, and with the same loop I used before, it can be caught quickly.
> 
> # time (log=/tmp/$$.log; while true; do echo distributor_autotest
> |taskset -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8
> -l 0-1 >$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f $log)
> 
Thanks Aaron and David for your report. I can reproduce this issue with the script.
Will fix it in next version.

> [snip]
> 
> RTE>>distributor_autotest
> EAL: Trying to obtain current memory policy.
> EAL: Setting policy MPOL_PREFERRED for socket 0
> EAL: Restoring previous memory policy: 0
> EAL: request: mp_malloc_sync
> EAL: Heap on socket 0 was expanded by 2MB
> EAL: Trying to obtain current memory policy.
> EAL: Setting policy MPOL_PREFERRED for socket 0
> EAL: Restoring previous memory policy: 0
> EAL: alloc_pages_on_heap(): couldn't allocate physically contiguous space
> EAL: Trying to obtain current memory policy.
> EAL: Setting policy MPOL_PREFERRED for socket 0
> EAL: Restoring previous memory policy: 0
> EAL: request: mp_malloc_sync
> EAL: Heap on socket 0 was expanded by 8MB === Basic distributor sanity
> tests === Worker 0 handled 32 packets Sanity test with all zero hashes done.
> Worker 0 handled 32 packets
> Sanity test with non-zero hashes done
> === testing big burst (single) ===
> Sanity test of returned packets done
> 
> === Sanity test with mbuf alloc/free (single) === Sanity test with mbuf
> alloc/free passed
> 
> Too few cores to run worker shutdown test === Basic distributor sanity tests
> === Worker 0 handled 32 packets Sanity test with all zero hashes done.
> Worker 0 handled 32 packets
> Sanity test with non-zero hashes done
> === testing big burst (burst) ===
> Sanity test of returned packets done
> 
> === Sanity test with mbuf alloc/free (burst) === Line 326: Packet count is
> incorrect, 1048568, expected 1048576 Test Failed
> RTE>>
> real    0m36.668s
> user    1m7.293s
> sys    0m1.560s
> 
> Could be worth running this loop on all tests? (not talking about the CI, it
> would be a manual effort to catch lurking issues).
> 
> 
> --
> David Marchand
  
Harman Kalra Oct. 17, 2019, 11:42 a.m. UTC | #6
Hi

I tested this patch, following are my observations:
1. With this patch distributor_autotest getting suspended on arm64 platform
is resolved. But continous execution of this test results in test failure,
as reported by Aaron.
2. While testing on x86 platform, still I can observe distributor_autotest
getting suspeneded(stuck) on continous execution of the test (it took almost
7-8 iterations to reproduce the suspension).

Thanks

On Wed, Oct 09, 2019 at 05:52:03AM +0000, Ruifeng Wang (Arm Technology China) wrote:
> External Email
> 
> ----------------------------------------------------------------------
> 
> > -----Original Message-----
> > From: David Marchand <david.marchand@redhat.com>
> > Sent: Wednesday, October 9, 2019 03:47
> > To: Aaron Conole <aconole@redhat.com>
> > Cc: Ruifeng Wang (Arm Technology China) <Ruifeng.Wang@arm.com>; David
> > Hunt <david.hunt@intel.com>; dev <dev@dpdk.org>; hkalra@marvell.com;
> > Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; dpdk
> > stable <stable@dpdk.org>
> > Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH] lib/distributor: fix deadlock
> > issue for aarch64
> > 
> > On Tue, Oct 8, 2019 at 7:06 PM Aaron Conole <aconole@redhat.com> wrote:
> > >
> > > Ruifeng Wang <ruifeng.wang@arm.com> writes:
> > >
> > > > Distributor and worker threads rely on data structs in cache line
> > > > for synchronization. The shared data structs were not protected.
> > > > This caused deadlock issue on weaker memory ordering platforms as
> > > > aarch64.
> > > > Fix this issue by adding memory barriers to ensure synchronization
> > > > among cores.
> > > >
> > > > Bugzilla ID: 342
> > > > Fixes: 775003ad2f96 ("distributor: add new burst-capable library")
> > > > Cc: stable@dpdk.org
> > > >
> > > > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > ---
> > >
> > > I see a failure in the distributor_autotest (on one of the builds):
> > >
> > > 64/82 DPDK:fast-tests / distributor_autotest  FAIL     0.37 s (exit status 255
> > or signal 127 SIGinvalid)
> > >
> > > --- command ---
> > >
> > > DPDK_TEST='distributor_autotest'
> > > /home/travis/build/ovsrobot/dpdk/build/app/test/dpdk-test -l 0-1
> > > --file-prefix=distributor_autotest
> > >
> > > --- stdout ---
> > >
> > > EAL: Probing VFIO support...
> > >
> > > APP: HPET is not enabled, using TSC as default timer
> > >
> > > RTE>>distributor_autotest
> > >
> > > === Basic distributor sanity tests ===
> > >
> > > Worker 0 handled 32 packets
> > >
> > > Sanity test with all zero hashes done.
> > >
> > > Worker 0 handled 32 packets
> > >
> > > Sanity test with non-zero hashes done
> > >
> > > === testing big burst (single) ===
> > >
> > > Sanity test of returned packets done
> > >
> > > === Sanity test with mbuf alloc/free (single) ===
> > >
> > > Sanity test with mbuf alloc/free passed
> > >
> > > Too few cores to run worker shutdown test
> > >
> > > === Basic distributor sanity tests ===
> > >
> > > Worker 0 handled 32 packets
> > >
> > > Sanity test with all zero hashes done.
> > >
> > > Worker 0 handled 32 packets
> > >
> > > Sanity test with non-zero hashes done
> > >
> > > === testing big burst (burst) ===
> > >
> > > Sanity test of returned packets done
> > >
> > > === Sanity test with mbuf alloc/free (burst) ===
> > >
> > > Line 326: Packet count is incorrect, 1048568, expected 1048576
> > >
> > > Test Failed
> > >
> > > RTE>>
> > >
> > > --- stderr ---
> > >
> > > EAL: Detected 2 lcore(s)
> > >
> > > EAL: Detected 1 NUMA nodes
> > >
> > > EAL: Multi-process socket /var/run/dpdk/distributor_autotest/mp_socket
> > >
> > > EAL: Selected IOVA mode 'PA'
> > >
> > > EAL: No available hugepages reported in hugepages-1048576kB
> > >
> > > -------
> > >
> > > Not sure how to help debug further.  I'll re-start the job to see if
> > > it 'clears' up - but I guess there may be a delicate synchronization
> > > somewhere that needs to be accounted.
> > 
> > Idem, and with the same loop I used before, it can be caught quickly.
> > 
> > # time (log=/tmp/$$.log; while true; do echo distributor_autotest
> > |taskset -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8
> > -l 0-1 >$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f $log)
> > 
> Thanks Aaron and David for your report. I can reproduce this issue with the script.
> Will fix it in next version.
> 
> > [snip]
> > 
> > RTE>>distributor_autotest
> > EAL: Trying to obtain current memory policy.
> > EAL: Setting policy MPOL_PREFERRED for socket 0
> > EAL: Restoring previous memory policy: 0
> > EAL: request: mp_malloc_sync
> > EAL: Heap on socket 0 was expanded by 2MB
> > EAL: Trying to obtain current memory policy.
> > EAL: Setting policy MPOL_PREFERRED for socket 0
> > EAL: Restoring previous memory policy: 0
> > EAL: alloc_pages_on_heap(): couldn't allocate physically contiguous space
> > EAL: Trying to obtain current memory policy.
> > EAL: Setting policy MPOL_PREFERRED for socket 0
> > EAL: Restoring previous memory policy: 0
> > EAL: request: mp_malloc_sync
> > EAL: Heap on socket 0 was expanded by 8MB === Basic distributor sanity
> > tests === Worker 0 handled 32 packets Sanity test with all zero hashes done.
> > Worker 0 handled 32 packets
> > Sanity test with non-zero hashes done
> > === testing big burst (single) ===
> > Sanity test of returned packets done
> > 
> > === Sanity test with mbuf alloc/free (single) === Sanity test with mbuf
> > alloc/free passed
> > 
> > Too few cores to run worker shutdown test === Basic distributor sanity tests
> > === Worker 0 handled 32 packets Sanity test with all zero hashes done.
> > Worker 0 handled 32 packets
> > Sanity test with non-zero hashes done
> > === testing big burst (burst) ===
> > Sanity test of returned packets done
> > 
> > === Sanity test with mbuf alloc/free (burst) === Line 326: Packet count is
> > incorrect, 1048568, expected 1048576 Test Failed
> > RTE>>
> > real    0m36.668s
> > user    1m7.293s
> > sys    0m1.560s
> > 
> > Could be worth running this loop on all tests? (not talking about the CI, it
> > would be a manual effort to catch lurking issues).
> > 
> > 
> > --
> > David Marchand
  
Ruifeng Wang Oct. 17, 2019, 1:48 p.m. UTC | #7
Hi Harman,

Thank you for testing this.

> -----Original Message-----
> From: Harman Kalra <hkalra@marvell.com>
> Sent: Thursday, October 17, 2019 19:42
> To: Ruifeng Wang (Arm Technology China) <Ruifeng.Wang@arm.com>
> Cc: David Marchand <david.marchand@redhat.com>; Aaron Conole
> <aconole@redhat.com>; David Hunt <david.hunt@intel.com>; dev
> <dev@dpdk.org>; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd
> <nd@arm.com>; dpdk stable <stable@dpdk.org>
> Subject: Re: [EXT] RE: [dpdk-stable] [dpdk-dev] [PATCH] lib/distributor: fix
> deadlock issue for aarch64
> 
> Hi
> 
> I tested this patch, following are my observations:
> 1. With this patch distributor_autotest getting suspended on arm64 platform
> is resolved. But continous execution of this test results in test failure, as
> reported by Aaron.
> 2. While testing on x86 platform, still I can observe distributor_autotest
> getting suspeneded(stuck) on continous execution of the test (it took almost
> 7-8 iterations to reproduce the suspension).

Yes, this v1 patch is not complete to solve the issue.
I have posted v3:
http://patches.dpdk.org/project/dpdk/list/?series=6856
With the new patch set, I didn't observe test failure in my test.
Will you try that?

Thanks.
/Ruifeng
> 
> Thanks
> 
> On Wed, Oct 09, 2019 at 05:52:03AM +0000, Ruifeng Wang (Arm Technology
> China) wrote:
> > External Email
> >
> > ----------------------------------------------------------------------
> >
> > > -----Original Message-----
> > > From: David Marchand <david.marchand@redhat.com>
> > > Sent: Wednesday, October 9, 2019 03:47
> > > To: Aaron Conole <aconole@redhat.com>
> > > Cc: Ruifeng Wang (Arm Technology China) <Ruifeng.Wang@arm.com>;
> > > David Hunt <david.hunt@intel.com>; dev <dev@dpdk.org>;
> > > hkalra@marvell.com; Gavin Hu (Arm Technology China)
> > > <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; dpdk stable
> > > <stable@dpdk.org>
> > > Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH] lib/distributor: fix
> > > deadlock issue for aarch64
> > >
> > > On Tue, Oct 8, 2019 at 7:06 PM Aaron Conole <aconole@redhat.com>
> wrote:
> > > >
> > > > Ruifeng Wang <ruifeng.wang@arm.com> writes:
> > > >
> > > > > Distributor and worker threads rely on data structs in cache
> > > > > line for synchronization. The shared data structs were not protected.
> > > > > This caused deadlock issue on weaker memory ordering platforms
> > > > > as aarch64.
> > > > > Fix this issue by adding memory barriers to ensure
> > > > > synchronization among cores.
> > > > >
> > > > > Bugzilla ID: 342
> > > > > Fixes: 775003ad2f96 ("distributor: add new burst-capable
> > > > > library")
> > > > > Cc: stable@dpdk.org
> > > > >
> > > > > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > ---
> > > >
> > > > I see a failure in the distributor_autotest (on one of the builds):
> > > >
> > > > 64/82 DPDK:fast-tests / distributor_autotest  FAIL     0.37 s (exit status
> 255
> > > or signal 127 SIGinvalid)
> > > >
> > > > --- command ---
> > > >
> > > > DPDK_TEST='distributor_autotest'
> > > > /home/travis/build/ovsrobot/dpdk/build/app/test/dpdk-test -l 0-1
> > > > --file-prefix=distributor_autotest
> > > >
> > > > --- stdout ---
> > > >
> > > > EAL: Probing VFIO support...
> > > >
> > > > APP: HPET is not enabled, using TSC as default timer
> > > >
> > > > RTE>>distributor_autotest
> > > >
> > > > === Basic distributor sanity tests ===
> > > >
> > > > Worker 0 handled 32 packets
> > > >
> > > > Sanity test with all zero hashes done.
> > > >
> > > > Worker 0 handled 32 packets
> > > >
> > > > Sanity test with non-zero hashes done
> > > >
> > > > === testing big burst (single) ===
> > > >
> > > > Sanity test of returned packets done
> > > >
> > > > === Sanity test with mbuf alloc/free (single) ===
> > > >
> > > > Sanity test with mbuf alloc/free passed
> > > >
> > > > Too few cores to run worker shutdown test
> > > >
> > > > === Basic distributor sanity tests ===
> > > >
> > > > Worker 0 handled 32 packets
> > > >
> > > > Sanity test with all zero hashes done.
> > > >
> > > > Worker 0 handled 32 packets
> > > >
> > > > Sanity test with non-zero hashes done
> > > >
> > > > === testing big burst (burst) ===
> > > >
> > > > Sanity test of returned packets done
> > > >
> > > > === Sanity test with mbuf alloc/free (burst) ===
> > > >
> > > > Line 326: Packet count is incorrect, 1048568, expected 1048576
> > > >
> > > > Test Failed
> > > >
> > > > RTE>>
> > > >
> > > > --- stderr ---
> > > >
> > > > EAL: Detected 2 lcore(s)
> > > >
> > > > EAL: Detected 1 NUMA nodes
> > > >
> > > > EAL: Multi-process socket
> > > > /var/run/dpdk/distributor_autotest/mp_socket
> > > >
> > > > EAL: Selected IOVA mode 'PA'
> > > >
> > > > EAL: No available hugepages reported in hugepages-1048576kB
> > > >
> > > > -------
> > > >
> > > > Not sure how to help debug further.  I'll re-start the job to see
> > > > if it 'clears' up - but I guess there may be a delicate
> > > > synchronization somewhere that needs to be accounted.
> > >
> > > Idem, and with the same loop I used before, it can be caught quickly.
> > >
> > > # time (log=/tmp/$$.log; while true; do echo distributor_autotest
> > > |taskset -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level
> > > |*:8
> > > -l 0-1 >$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log;
> > > rm -f $log)
> > >
> > Thanks Aaron and David for your report. I can reproduce this issue with the
> script.
> > Will fix it in next version.
> >
> > > [snip]
> > >
> > > RTE>>distributor_autotest
> > > EAL: Trying to obtain current memory policy.
> > > EAL: Setting policy MPOL_PREFERRED for socket 0
> > > EAL: Restoring previous memory policy: 0
> > > EAL: request: mp_malloc_sync
> > > EAL: Heap on socket 0 was expanded by 2MB
> > > EAL: Trying to obtain current memory policy.
> > > EAL: Setting policy MPOL_PREFERRED for socket 0
> > > EAL: Restoring previous memory policy: 0
> > > EAL: alloc_pages_on_heap(): couldn't allocate physically contiguous
> > > space
> > > EAL: Trying to obtain current memory policy.
> > > EAL: Setting policy MPOL_PREFERRED for socket 0
> > > EAL: Restoring previous memory policy: 0
> > > EAL: request: mp_malloc_sync
> > > EAL: Heap on socket 0 was expanded by 8MB === Basic distributor
> > > sanity tests === Worker 0 handled 32 packets Sanity test with all zero
> hashes done.
> > > Worker 0 handled 32 packets
> > > Sanity test with non-zero hashes done === testing big burst (single)
> > > === Sanity test of returned packets done
> > >
> > > === Sanity test with mbuf alloc/free (single) === Sanity test with
> > > mbuf alloc/free passed
> > >
> > > Too few cores to run worker shutdown test === Basic distributor
> > > sanity tests === Worker 0 handled 32 packets Sanity test with all zero
> hashes done.
> > > Worker 0 handled 32 packets
> > > Sanity test with non-zero hashes done === testing big burst (burst)
> > > === Sanity test of returned packets done
> > >
> > > === Sanity test with mbuf alloc/free (burst) === Line 326: Packet
> > > count is incorrect, 1048568, expected 1048576 Test Failed
> > > RTE>>
> > > real    0m36.668s
> > > user    1m7.293s
> > > sys    0m1.560s
> > >
> > > Could be worth running this loop on all tests? (not talking about
> > > the CI, it would be a manual effort to catch lurking issues).
> > >
> > >
> > > --
> > > David Marchand
  

Patch

diff --git a/lib/librte_distributor/rte_distributor.c b/lib/librte_distributor/rte_distributor.c
index 21eb1fb0a..7bf96e224 100644
--- a/lib/librte_distributor/rte_distributor.c
+++ b/lib/librte_distributor/rte_distributor.c
@@ -50,7 +50,8 @@  rte_distributor_request_pkt_v1705(struct rte_distributor *d,
 
 	retptr64 = &(buf->retptr64[0]);
 	/* Spin while handshake bits are set (scheduler clears it) */
-	while (unlikely(*retptr64 & RTE_DISTRIB_GET_BUF)) {
+	while (unlikely(__atomic_load_n(retptr64, __ATOMIC_ACQUIRE)
+			& RTE_DISTRIB_GET_BUF)) {
 		rte_pause();
 		uint64_t t = rte_rdtsc()+100;
 
@@ -76,7 +77,8 @@  rte_distributor_request_pkt_v1705(struct rte_distributor *d,
 	 * Finally, set the GET_BUF  to signal to distributor that cache
 	 * line is ready for processing
 	 */
-	*retptr64 |= RTE_DISTRIB_GET_BUF;
+	__atomic_store_n(retptr64, *retptr64 | RTE_DISTRIB_GET_BUF,
+			__ATOMIC_RELEASE);
 }
 BIND_DEFAULT_SYMBOL(rte_distributor_request_pkt, _v1705, 17.05);
 MAP_STATIC_SYMBOL(void rte_distributor_request_pkt(struct rte_distributor *d,
@@ -99,7 +101,8 @@  rte_distributor_poll_pkt_v1705(struct rte_distributor *d,
 	}
 
 	/* If bit is set, return */
-	if (buf->bufptr64[0] & RTE_DISTRIB_GET_BUF)
+	if (__atomic_load_n(&(buf->bufptr64[0]), __ATOMIC_ACQUIRE)
+		& RTE_DISTRIB_GET_BUF)
 		return -1;
 
 	/* since bufptr64 is signed, this should be an arithmetic shift */
@@ -116,6 +119,8 @@  rte_distributor_poll_pkt_v1705(struct rte_distributor *d,
 	 * on the next cacheline while we're working.
 	 */
 	buf->bufptr64[0] |= RTE_DISTRIB_GET_BUF;
+	__atomic_store_n(&(buf->bufptr64[0]),
+		buf->bufptr64[0] | RTE_DISTRIB_GET_BUF, __ATOMIC_RELEASE);
 
 	return count;
 }
@@ -183,7 +188,8 @@  rte_distributor_return_pkt_v1705(struct rte_distributor *d,
 			RTE_DISTRIB_FLAG_BITS) | RTE_DISTRIB_RETURN_BUF;
 
 	/* set the GET_BUF but even if we got no returns */
-	buf->retptr64[0] |= RTE_DISTRIB_GET_BUF;
+	__atomic_store_n(&(buf->retptr64[0]),
+		buf->retptr64[0] | RTE_DISTRIB_GET_BUF, __ATOMIC_RELEASE);
 
 	return 0;
 }
@@ -273,7 +279,8 @@  handle_returns(struct rte_distributor *d, unsigned int wkr)
 	unsigned int count = 0;
 	unsigned int i;
 
-	if (buf->retptr64[0] & RTE_DISTRIB_GET_BUF) {
+	if (__atomic_load_n(&(buf->retptr64[0]), __ATOMIC_ACQUIRE)
+		& RTE_DISTRIB_GET_BUF) {
 		for (i = 0; i < RTE_DIST_BURST_SIZE; i++) {
 			if (buf->retptr64[i] & RTE_DISTRIB_RETURN_BUF) {
 				oldbuf = ((uintptr_t)(buf->retptr64[i] >>
@@ -287,7 +294,7 @@  handle_returns(struct rte_distributor *d, unsigned int wkr)
 		d->returns.start = ret_start;
 		d->returns.count = ret_count;
 		/* Clear for the worker to populate with more returns */
-		buf->retptr64[0] = 0;
+		__atomic_store_n(&(buf->retptr64[0]), 0, __ATOMIC_RELEASE);
 	}
 	return count;
 }
@@ -307,7 +314,8 @@  release(struct rte_distributor *d, unsigned int wkr)
 	struct rte_distributor_buffer *buf = &(d->bufs[wkr]);
 	unsigned int i;
 
-	while (!(d->bufs[wkr].bufptr64[0] & RTE_DISTRIB_GET_BUF))
+	while (!(__atomic_load_n(&(d->bufs[wkr].bufptr64[0]), __ATOMIC_ACQUIRE)
+		& RTE_DISTRIB_GET_BUF))
 		rte_pause();
 
 	handle_returns(d, wkr);
@@ -328,7 +336,8 @@  release(struct rte_distributor *d, unsigned int wkr)
 	d->backlog[wkr].count = 0;
 
 	/* Clear the GET bit */
-	buf->bufptr64[0] &= ~RTE_DISTRIB_GET_BUF;
+	__atomic_store_n(&(buf->bufptr64[0]),
+		buf->bufptr64[0] & ~RTE_DISTRIB_GET_BUF, __ATOMIC_RELEASE);
 	return  buf->count;
 
 }
@@ -574,7 +583,8 @@  rte_distributor_clear_returns_v1705(struct rte_distributor *d)
 
 	/* throw away returns, so workers can exit */
 	for (wkr = 0; wkr < d->num_workers; wkr++)
-		d->bufs[wkr].retptr64[0] = 0;
+		__atomic_store_n(&(d->bufs[wkr].retptr64[0]), 0,
+				__ATOMIC_RELEASE);
 }
 BIND_DEFAULT_SYMBOL(rte_distributor_clear_returns, _v1705, 17.05);
 MAP_STATIC_SYMBOL(void rte_distributor_clear_returns(struct rte_distributor *d),
diff --git a/lib/librte_distributor/rte_distributor_v20.c b/lib/librte_distributor/rte_distributor_v20.c
index cdc0969a8..3a5810c6d 100644
--- a/lib/librte_distributor/rte_distributor_v20.c
+++ b/lib/librte_distributor/rte_distributor_v20.c
@@ -34,9 +34,10 @@  rte_distributor_request_pkt_v20(struct rte_distributor_v20 *d,
 	union rte_distributor_buffer_v20 *buf = &d->bufs[worker_id];
 	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
 			| RTE_DISTRIB_GET_BUF;
-	while (unlikely(buf->bufptr64 & RTE_DISTRIB_FLAGS_MASK))
+	while (unlikely(__atomic_load_n(&(buf->bufptr64), __ATOMIC_ACQUIRE)
+		& RTE_DISTRIB_FLAGS_MASK))
 		rte_pause();
-	buf->bufptr64 = req;
+	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
 }
 VERSION_SYMBOL(rte_distributor_request_pkt, _v20, 2.0);
 
@@ -45,7 +46,8 @@  rte_distributor_poll_pkt_v20(struct rte_distributor_v20 *d,
 		unsigned worker_id)
 {
 	union rte_distributor_buffer_v20 *buf = &d->bufs[worker_id];
-	if (buf->bufptr64 & RTE_DISTRIB_GET_BUF)
+	if (__atomic_load_n(&(buf->bufptr64), __ATOMIC_ACQUIRE)
+		& RTE_DISTRIB_GET_BUF)
 		return NULL;
 
 	/* since bufptr64 is signed, this should be an arithmetic shift */
@@ -73,7 +75,7 @@  rte_distributor_return_pkt_v20(struct rte_distributor_v20 *d,
 	union rte_distributor_buffer_v20 *buf = &d->bufs[worker_id];
 	uint64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
 			| RTE_DISTRIB_RETURN_BUF;
-	buf->bufptr64 = req;
+	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
 	return 0;
 }
 VERSION_SYMBOL(rte_distributor_return_pkt, _v20, 2.0);
@@ -117,7 +119,7 @@  handle_worker_shutdown(struct rte_distributor_v20 *d, unsigned int wkr)
 {
 	d->in_flight_tags[wkr] = 0;
 	d->in_flight_bitmask &= ~(1UL << wkr);
-	d->bufs[wkr].bufptr64 = 0;
+	__atomic_store_n(&(d->bufs[wkr].bufptr64), 0, __ATOMIC_RELEASE);
 	if (unlikely(d->backlog[wkr].count != 0)) {
 		/* On return of a packet, we need to move the
 		 * queued packets for this core elsewhere.
@@ -165,13 +167,17 @@  process_returns(struct rte_distributor_v20 *d)
 		const int64_t data = d->bufs[wkr].bufptr64;
 		uintptr_t oldbuf = 0;
 
-		if (data & RTE_DISTRIB_GET_BUF) {
+		if (__atomic_load_n(&data, __ATOMIC_ACQUIRE)
+			& RTE_DISTRIB_GET_BUF) {
 			flushed++;
 			if (d->backlog[wkr].count)
-				d->bufs[wkr].bufptr64 =
-						backlog_pop(&d->backlog[wkr]);
+				__atomic_store_n(&(d->bufs[wkr].bufptr64),
+					backlog_pop(&d->backlog[wkr]),
+					__ATOMIC_RELEASE);
 			else {
-				d->bufs[wkr].bufptr64 = RTE_DISTRIB_GET_BUF;
+				__atomic_store_n(&(d->bufs[wkr].bufptr64),
+					RTE_DISTRIB_GET_BUF,
+					__ATOMIC_RELEASE);
 				d->in_flight_tags[wkr] = 0;
 				d->in_flight_bitmask &= ~(1UL << wkr);
 			}
@@ -251,7 +257,8 @@  rte_distributor_process_v20(struct rte_distributor_v20 *d,
 			}
 		}
 
-		if ((data & RTE_DISTRIB_GET_BUF) &&
+		if ((__atomic_load_n(&data, __ATOMIC_ACQUIRE)
+			& RTE_DISTRIB_GET_BUF) &&
 				(d->backlog[wkr].count || next_mb)) {
 
 			if (d->backlog[wkr].count)
@@ -280,13 +287,16 @@  rte_distributor_process_v20(struct rte_distributor_v20 *d,
 	 * if they are ready */
 	for (wkr = 0; wkr < d->num_workers; wkr++)
 		if (d->backlog[wkr].count &&
-				(d->bufs[wkr].bufptr64 & RTE_DISTRIB_GET_BUF)) {
+				(__atomic_load_n(&(d->bufs[wkr].bufptr64),
+				__ATOMIC_ACQUIRE) & RTE_DISTRIB_GET_BUF)) {
 
 			int64_t oldbuf = d->bufs[wkr].bufptr64 >>
 					RTE_DISTRIB_FLAG_BITS;
 			store_return(oldbuf, d, &ret_start, &ret_count);
 
-			d->bufs[wkr].bufptr64 = backlog_pop(&d->backlog[wkr]);
+			__atomic_store_n(&(d->bufs[wkr].bufptr64),
+				backlog_pop(&d->backlog[wkr]),
+				__ATOMIC_RELEASE);
 		}
 
 	d->returns.start = ret_start;