[v3,11/12] service: optimize with c11 one-way barrier

Message ID 1584407863-774-12-git-send-email-phil.yang@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series generic rte atomic APIs deprecate proposal |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation fail Compilation issues

Commit Message

Phil Yang March 17, 2020, 1:17 a.m. UTC
  The num_mapped_cores and execute_lock are synchronized with rte_atomic_XX
APIs which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 50 ++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 15 deletions(-)
  

Comments

Van Haaren, Harry April 3, 2020, 11:58 a.m. UTC | #1
> -----Original Message-----
> From: Phil Yang <phil.yang@arm.com>
> Sent: Tuesday, March 17, 2020 1:18 AM
> To: thomas@monjalon.net; Van Haaren, Harry <harry.van.haaren@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; ruifeng.wang@arm.com;
> joyce.kong@arm.com; nd@arm.com
> Subject: [PATCH v3 11/12] service: optimize with c11 one-way barrier
> 
> The num_mapped_cores and execute_lock are synchronized with rte_atomic_XX
> APIs which is a full barrier, DMB, on aarch64. This patch optimized it with
> c11 atomic one-way barrier.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Based on discussion on-list, it seems the consensus is to not use
GCC builtins, but instead use C11 APIs "proper"? If my conclusion is
correct, the v+1 of this patchset would require updates to that style API.

Inline comments for context below, -Harry 


> ---
>  lib/librte_eal/common/rte_service.c | 50 ++++++++++++++++++++++++++----------
> -
>  1 file changed, 35 insertions(+), 15 deletions(-)
> 
> diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> index 0843c3c..c033224 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -42,7 +42,7 @@ struct rte_service_spec_impl {
>  	 * running this service callback. When not set, a core may take the
>  	 * lock and then run the service callback.
>  	 */
> -	rte_atomic32_t execute_lock;
> +	uint32_t execute_lock;
> 
>  	/* API set/get-able variables */
>  	int8_t app_runstate;
> @@ -54,7 +54,7 @@ struct rte_service_spec_impl {
>  	 * It does not indicate the number of cores the service is running
>  	 * on currently.
>  	 */
> -	rte_atomic32_t num_mapped_cores;
> +	int32_t num_mapped_cores;

Any reason why "int32_t" or "uint32_t" is used over another?
execute_lock is a uint32_t above, num_mapped_cores is an int32_t?


>  	uint64_t calls;
>  	uint64_t cycles_spent;
>  } __rte_cache_aligned;
> @@ -332,7 +332,8 @@ rte_service_runstate_get(uint32_t id)
>  	rte_smp_rmb();
> 
>  	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
> -	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
> +	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
> +					    __ATOMIC_RELAXED) > 0);
> 
>  	return (s->app_runstate == RUNSTATE_RUNNING) &&
>  		(s->comp_runstate == RUNSTATE_RUNNING) &&
> @@ -375,11 +376,20 @@ service_run(uint32_t i, struct core_state *cs, uint64_t
> service_mask,
>  	cs->service_active_on_lcore[i] = 1;
> 
>  	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
> -		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
> +		uint32_t expected = 0;
> +		/* ACQUIRE ordering here is to prevent the callback
> +		 * function from hoisting up before the execute_lock
> +		 * setting.
> +		 */
> +		if (!__atomic_compare_exchange_n(&s->execute_lock, &expected, 1,
> +			    0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
>  			return -EBUSY;

Let's try improve the magic "1" and "0" constants, I believe the "1" here
is the desired "new value on success", and the 0 is "bool weak", where our 0/false constant implies a strongly ordered compare exchange?

"Weak is true for weak compare_exchange, which may fail spuriously, and false for the strong variation, which never fails spuriously.", from https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

const uint32_t on_success_value = 1;
const bool weak = 0;
__atomic_compare_exchange_n(&s->execute_lock, &expected, on_success_value, weak, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);


Although a bit more verbose, I feel this documents usage a lot better,
particularly for those who aren't as familiar with the C11 function
arguments order.

Admittedly with the API change to not use __builtins, perhaps this
comment is moot.


> 
>  		service_runner_do_callback(s, cs, i);
> -		rte_atomic32_clear(&s->execute_lock);
> +		/* RELEASE ordering here is used to pair with ACQUIRE
> +		 * above to achieve lock semantic.
> +		 */
> +		__atomic_store_n(&s->execute_lock, 0, __ATOMIC_RELEASE);
>  	} else
>  		service_runner_do_callback(s, cs, i);
> 
> @@ -415,11 +425,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t
> serialize_mt_unsafe)
>  	/* Increment num_mapped_cores to indicate that the service
>  	 * is running on a core.
>  	 */
> -	rte_atomic32_inc(&s->num_mapped_cores);
> +	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_ACQUIRE);
> 
>  	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
> 
> -	rte_atomic32_dec(&s->num_mapped_cores);
> +	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELEASE);
> 
>  	return ret;
>  }
> @@ -552,24 +562,32 @@ service_update(uint32_t sid, uint32_t lcore,
> 
>  	uint64_t sid_mask = UINT64_C(1) << sid;
>  	if (set) {
> -		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
> -			sid_mask;
> +		/* When multiple threads try to update the same lcore
> +		 * service concurrently, e.g. set lcore map followed
> +		 * by clear lcore map, the unsynchronized service_mask
> +		 * values have issues on the num_mapped_cores value
> +		 * consistency. So we use ACQUIRE ordering to pair with
> +		 * the RELEASE ordering to synchronize the service_mask.
> +		 */
> +		uint64_t lcore_mapped = __atomic_load_n(
> +					&lcore_states[lcore].service_mask,
> +					__ATOMIC_ACQUIRE) & sid_mask;

Thanks for the comment - it helps me understand things a bit better.
Some questions/theories to validate;
1) The service_mask ACQUIRE avoids other loads being hoisted above it, correct?

2) There are non-atomic stores to service_mask. Is it correct that the stores themselves aren't the issue, but relative visibility of service_mask stores vs num_mapped_cores? (Detail in (3) below)


>  		if (*set && !lcore_mapped) {
>  			lcore_states[lcore].service_mask |= sid_mask;
> -			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
> +			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
> +					    1, __ATOMIC_RELEASE);
>  		}
>  		if (!*set && lcore_mapped) {
>  			lcore_states[lcore].service_mask &= ~(sid_mask);
> -			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
> +			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
> +					    1, __ATOMIC_RELEASE);
>  		}

3) Here we update the core-local service_mask, and then update the
num_mapped_cores with an ATOMIC_RELEASE. The RELEASE here ensures
that the previous store to service_mask is guaranteed to be visible
on all cores if this store is visible. Why do we care about this property?
The service_mask is core local anway.

4) Even with the load ACQ service_mask, and REL num_mapped_cores store, is there not still a race-condition possible where 2 lcores simultaneously load-ACQ the service_mask, and then both do atomic add/sub_fetch with REL?

5) Assuming 4 above race is true, it raises the real question - the service-cores control APIs are not designed to be multi-thread-safe. Orchestration of service/lcore mappings is not meant to be done by multiple threads at the same time. Documenting this loudly may help, I'm happy to send a patch to do so if we're agreed on the above?




>  	}
> 
>  	if (enabled)
>  		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
> 
> -	rte_smp_wmb();
> -
>  	return 0;
>  }
> 
> @@ -625,7 +643,8 @@ rte_service_lcore_reset_all(void)
>  		}
>  	}
>  	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
> -		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
> +		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
> +				    __ATOMIC_RELAXED);
> 
>  	rte_smp_wmb();
> 
> @@ -708,7 +727,8 @@ rte_service_lcore_stop(uint32_t lcore)
>  		int32_t enabled = service_mask & (UINT64_C(1) << i);
>  		int32_t service_running = rte_service_runstate_get(i);
>  		int32_t only_core = (1 ==
> -			rte_atomic32_read(&rte_services[i].num_mapped_cores));
> +			__atomic_load_n(&rte_services[i].num_mapped_cores,
> +					__ATOMIC_RELAXED));
> 
>  		/* if the core is mapped, and the service is running, and this
>  		 * is the only core that is mapped, the service would cease to
> --
> 2.7.4
  
Honnappa Nagarahalli April 6, 2020, 4:22 a.m. UTC | #2
<snip>

> > Subject: [PATCH v3 11/12] service: optimize with c11 one-way barrier
> >
> > The num_mapped_cores and execute_lock are synchronized with
> > rte_atomic_XX APIs which is a full barrier, DMB, on aarch64. This
> > patch optimized it with
> > c11 atomic one-way barrier.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> Based on discussion on-list, it seems the consensus is to not use GCC builtins,
> but instead use C11 APIs "proper"? If my conclusion is correct, the v+1 of this
> patchset would require updates to that style API.
> 
> Inline comments for context below, -Harry
> 
> 
> > ---
> >  lib/librte_eal/common/rte_service.c | 50
> > ++++++++++++++++++++++++++----------
> > -
> >  1 file changed, 35 insertions(+), 15 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> > b/lib/librte_eal/common/rte_service.c
> > index 0843c3c..c033224 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -42,7 +42,7 @@ struct rte_service_spec_impl {
> >  	 * running this service callback. When not set, a core may take the
> >  	 * lock and then run the service callback.
> >  	 */
> > -	rte_atomic32_t execute_lock;
> > +	uint32_t execute_lock;
> >
> >  	/* API set/get-able variables */
> >  	int8_t app_runstate;
> > @@ -54,7 +54,7 @@ struct rte_service_spec_impl {
> >  	 * It does not indicate the number of cores the service is running
> >  	 * on currently.
> >  	 */
> > -	rte_atomic32_t num_mapped_cores;
> > +	int32_t num_mapped_cores;
> 
> Any reason why "int32_t" or "uint32_t" is used over another?
> execute_lock is a uint32_t above, num_mapped_cores is an int32_t?
> 
> 
> >  	uint64_t calls;
> >  	uint64_t cycles_spent;
> >  } __rte_cache_aligned;
> > @@ -332,7 +332,8 @@ rte_service_runstate_get(uint32_t id)
> >  	rte_smp_rmb();
> >
> >  	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
> > -	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) >
> 0);
> > +	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
> > +					    __ATOMIC_RELAXED) > 0);
> >
> >  	return (s->app_runstate == RUNSTATE_RUNNING) &&
> >  		(s->comp_runstate == RUNSTATE_RUNNING) && @@ -375,11
> +376,20 @@
> > service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
> >  	cs->service_active_on_lcore[i] = 1;
> >
> >  	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
> > -		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
> > +		uint32_t expected = 0;
> > +		/* ACQUIRE ordering here is to prevent the callback
> > +		 * function from hoisting up before the execute_lock
> > +		 * setting.
> > +		 */
> > +		if (!__atomic_compare_exchange_n(&s->execute_lock,
> &expected, 1,
> > +			    0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> >  			return -EBUSY;
> 
> Let's try improve the magic "1" and "0" constants, I believe the "1" here is the
> desired "new value on success", and the 0 is "bool weak", where our 0/false
> constant implies a strongly ordered compare exchange?
> 
> "Weak is true for weak compare_exchange, which may fail spuriously, and
> false for the strong variation, which never fails spuriously.", from
> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> 
> const uint32_t on_success_value = 1;
> const bool weak = 0;
> __atomic_compare_exchange_n(&s->execute_lock, &expected,
> on_success_value, weak, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> 
> 
> Although a bit more verbose, I feel this documents usage a lot better,
> particularly for those who aren't as familiar with the C11 function arguments
> order.
> 
> Admittedly with the API change to not use __builtins, perhaps this comment is
> moot.
Suggest changing the execute_lock to rte_spinlock_t and use rte_spinlock_trylock API.

> 
> 
> >
> >  		service_runner_do_callback(s, cs, i);
> > -		rte_atomic32_clear(&s->execute_lock);
> > +		/* RELEASE ordering here is used to pair with ACQUIRE
> > +		 * above to achieve lock semantic.
> > +		 */
> > +		__atomic_store_n(&s->execute_lock, 0, __ATOMIC_RELEASE);
> >  	} else
> >  		service_runner_do_callback(s, cs, i);
> >
> > @@ -415,11 +425,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id,
> > uint32_t
> > serialize_mt_unsafe)
> >  	/* Increment num_mapped_cores to indicate that the service
> >  	 * is running on a core.
> >  	 */
> > -	rte_atomic32_inc(&s->num_mapped_cores);
> > +	__atomic_add_fetch(&s->num_mapped_cores, 1,
> __ATOMIC_ACQUIRE);
> >
> >  	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
> >
> > -	rte_atomic32_dec(&s->num_mapped_cores);
> > +	__atomic_sub_fetch(&s->num_mapped_cores, 1,
> __ATOMIC_RELEASE);
> >
> >  	return ret;
> >  }
> > @@ -552,24 +562,32 @@ service_update(uint32_t sid, uint32_t lcore,
> >
> >  	uint64_t sid_mask = UINT64_C(1) << sid;
> >  	if (set) {
> > -		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
> > -			sid_mask;
> > +		/* When multiple threads try to update the same lcore
> > +		 * service concurrently, e.g. set lcore map followed
> > +		 * by clear lcore map, the unsynchronized service_mask
> > +		 * values have issues on the num_mapped_cores value
> > +		 * consistency. So we use ACQUIRE ordering to pair with
> > +		 * the RELEASE ordering to synchronize the service_mask.
> > +		 */
> > +		uint64_t lcore_mapped = __atomic_load_n(
> > +					&lcore_states[lcore].service_mask,
> > +					__ATOMIC_ACQUIRE) & sid_mask;
> 
> Thanks for the comment - it helps me understand things a bit better.
> Some questions/theories to validate;
> 1) The service_mask ACQUIRE avoids other loads being hoisted above it,
> correct?
> 
> 2) There are non-atomic stores to service_mask. Is it correct that the stores
> themselves aren't the issue, but relative visibility of service_mask stores vs
> num_mapped_cores? (Detail in (3) below)
> 
> 
> >  		if (*set && !lcore_mapped) {
> >  			lcore_states[lcore].service_mask |= sid_mask;
> > -
> 	rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
> > +
> 	__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
> > +					    1, __ATOMIC_RELEASE);
> >  		}
> >  		if (!*set && lcore_mapped) {
> >  			lcore_states[lcore].service_mask &= ~(sid_mask);
> > -
> 	rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
> > +
> 	__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
> > +					    1, __ATOMIC_RELEASE);
> >  		}
> 
> 3) Here we update the core-local service_mask, and then update the
> num_mapped_cores with an ATOMIC_RELEASE. The RELEASE here ensures
> that the previous store to service_mask is guaranteed to be visible on all cores
> if this store is visible. Why do we care about this property?
> The service_mask is core local anway.
We are working on concurrency between the reader and writer. The service_mask is local to the core, but it is accessed by a reader and writer.
I think we should wait to conclude on the meaning of 'num_mapped_cores', that will dictate what the order should be. For ex: if it is just for statistics purpose, then we could use just RELAXED memory order and then the order for service_mask will also change.

> 
> 4) Even with the load ACQ service_mask, and REL num_mapped_cores store,
> is there not still a race-condition possible where 2 lcores simultaneously load-
> ACQ the service_mask, and then both do atomic add/sub_fetch with REL?
> 
> 5) Assuming 4 above race is true, it raises the real question - the service-cores
> control APIs are not designed to be multi-thread-safe. Orchestration of
> service/lcore mappings is not meant to be done by multiple threads at the
> same time. Documenting this loudly may help, I'm happy to send a patch to do
> so if we're agreed on the above?
I completely agree here. writer-writer concurrency is another topic and we should (for now at least) say that the control plane APIs are not thread safe.

> 
> 
> 
> 
> >  	}
> >
> >  	if (enabled)
> >  		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
> >
> > -	rte_smp_wmb();
> > -
> >  	return 0;
> >  }
> >
> > @@ -625,7 +643,8 @@ rte_service_lcore_reset_all(void)
> >  		}
> >  	}
> >  	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
> > -		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
> > +		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
> > +				    __ATOMIC_RELAXED);
> >
> >  	rte_smp_wmb();
> >
> > @@ -708,7 +727,8 @@ rte_service_lcore_stop(uint32_t lcore)
> >  		int32_t enabled = service_mask & (UINT64_C(1) << i);
> >  		int32_t service_running = rte_service_runstate_get(i);
> >  		int32_t only_core = (1 ==
> > -
> 	rte_atomic32_read(&rte_services[i].num_mapped_cores));
> > +
> 	__atomic_load_n(&rte_services[i].num_mapped_cores,
> > +					__ATOMIC_RELAXED));
> >
> >  		/* if the core is mapped, and the service is running, and this
> >  		 * is the only core that is mapped, the service would cease to
> > --
> > 2.7.4
  
Phil Yang April 8, 2020, 10:15 a.m. UTC | #3
> -----Original Message-----
> From: Van Haaren, Harry <harry.van.haaren@intel.com>
> Sent: Friday, April 3, 2020 7:58 PM
> To: Phil Yang <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev,
> Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com;
> dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu <Gavin.Hu@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 11/12] service: optimize with c11 one-way barrier
> 
> > -----Original Message-----
> > From: Phil Yang <phil.yang@arm.com>
> > Sent: Tuesday, March 17, 2020 1:18 AM
> > To: thomas@monjalon.net; Van Haaren, Harry
> <harry.van.haaren@intel.com>;
> > Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > stephen@networkplumber.org; maxime.coquelin@redhat.com;
> dev@dpdk.org
> > Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com;
> > Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com;
> ruifeng.wang@arm.com;
> > joyce.kong@arm.com; nd@arm.com
> > Subject: [PATCH v3 11/12] service: optimize with c11 one-way barrier
> >
> > The num_mapped_cores and execute_lock are synchronized with
> rte_atomic_XX
> > APIs which is a full barrier, DMB, on aarch64. This patch optimized it with
> > c11 atomic one-way barrier.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> Based on discussion on-list, it seems the consensus is to not use
> GCC builtins, but instead use C11 APIs "proper"? If my conclusion is
> correct, the v+1 of this patchset would require updates to that style API.
> 
> Inline comments for context below, -Harry
> 
> 
> > ---
> >  lib/librte_eal/common/rte_service.c | 50
> ++++++++++++++++++++++++++----------
> > -
> >  1 file changed, 35 insertions(+), 15 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> > b/lib/librte_eal/common/rte_service.c
> > index 0843c3c..c033224 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -42,7 +42,7 @@ struct rte_service_spec_impl {
> >  	 * running this service callback. When not set, a core may take the
> >  	 * lock and then run the service callback.
> >  	 */
> > -	rte_atomic32_t execute_lock;
> > +	uint32_t execute_lock;
> >
> >  	/* API set/get-able variables */
> >  	int8_t app_runstate;
> > @@ -54,7 +54,7 @@ struct rte_service_spec_impl {
> >  	 * It does not indicate the number of cores the service is running
> >  	 * on currently.
> >  	 */
> > -	rte_atomic32_t num_mapped_cores;
> > +	int32_t num_mapped_cores;
> 
> Any reason why "int32_t" or "uint32_t" is used over another?
> execute_lock is a uint32_t above, num_mapped_cores is an int32_t?

It should be uint32_t for num_mapped_cores. 
This value will not be negative after __atomic_sub_fetch operation, because of the sequence of writer and reader accesses are guaranteed by the memory ordering.
I will update it in the next version.

Thanks,
Phil

<snip>
  

Patch

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 0843c3c..c033224 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -42,7 +42,7 @@  struct rte_service_spec_impl {
 	 * running this service callback. When not set, a core may take the
 	 * lock and then run the service callback.
 	 */
-	rte_atomic32_t execute_lock;
+	uint32_t execute_lock;
 
 	/* API set/get-able variables */
 	int8_t app_runstate;
@@ -54,7 +54,7 @@  struct rte_service_spec_impl {
 	 * It does not indicate the number of cores the service is running
 	 * on currently.
 	 */
-	rte_atomic32_t num_mapped_cores;
+	int32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
 } __rte_cache_aligned;
@@ -332,7 +332,8 @@  rte_service_runstate_get(uint32_t id)
 	rte_smp_rmb();
 
 	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
+	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+					    __ATOMIC_RELAXED) > 0);
 
 	return (s->app_runstate == RUNSTATE_RUNNING) &&
 		(s->comp_runstate == RUNSTATE_RUNNING) &&
@@ -375,11 +376,20 @@  service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	cs->service_active_on_lcore[i] = 1;
 
 	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
-		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
+		uint32_t expected = 0;
+		/* ACQUIRE ordering here is to prevent the callback
+		 * function from hoisting up before the execute_lock
+		 * setting.
+		 */
+		if (!__atomic_compare_exchange_n(&s->execute_lock, &expected, 1,
+			    0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
 			return -EBUSY;
 
 		service_runner_do_callback(s, cs, i);
-		rte_atomic32_clear(&s->execute_lock);
+		/* RELEASE ordering here is used to pair with ACQUIRE
+		 * above to achieve lock semantic.
+		 */
+		__atomic_store_n(&s->execute_lock, 0, __ATOMIC_RELEASE);
 	} else
 		service_runner_do_callback(s, cs, i);
 
@@ -415,11 +425,11 @@  rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 	/* Increment num_mapped_cores to indicate that the service
 	 * is running on a core.
 	 */
-	rte_atomic32_inc(&s->num_mapped_cores);
+	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_ACQUIRE);
 
 	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	rte_atomic32_dec(&s->num_mapped_cores);
+	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELEASE);
 
 	return ret;
 }
@@ -552,24 +562,32 @@  service_update(uint32_t sid, uint32_t lcore,
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		/* When multiple threads try to update the same lcore
+		 * service concurrently, e.g. set lcore map followed
+		 * by clear lcore map, the unsynchronized service_mask
+		 * values have issues on the num_mapped_cores value
+		 * consistency. So we use ACQUIRE ordering to pair with
+		 * the RELEASE ordering to synchronize the service_mask.
+		 */
+		uint64_t lcore_mapped = __atomic_load_n(
+					&lcore_states[lcore].service_mask,
+					__ATOMIC_ACQUIRE) & sid_mask;
 
 		if (*set && !lcore_mapped) {
 			lcore_states[lcore].service_mask |= sid_mask;
-			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
+			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 		if (!*set && lcore_mapped) {
 			lcore_states[lcore].service_mask &= ~(sid_mask);
-			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
+			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 	}
 
 	if (enabled)
 		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -625,7 +643,8 @@  rte_service_lcore_reset_all(void)
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
-		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
+		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
+				    __ATOMIC_RELAXED);
 
 	rte_smp_wmb();
 
@@ -708,7 +727,8 @@  rte_service_lcore_stop(uint32_t lcore)
 		int32_t enabled = service_mask & (UINT64_C(1) << i);
 		int32_t service_running = rte_service_runstate_get(i);
 		int32_t only_core = (1 ==
-			rte_atomic32_read(&rte_services[i].num_mapped_cores));
+			__atomic_load_n(&rte_services[i].num_mapped_cores,
+					__ATOMIC_RELAXED));
 
 		/* if the core is mapped, and the service is running, and this
 		 * is the only core that is mapped, the service would cease to