[v3,09/12] service: avoid race condition for MT unsafe service

Message ID 1584407863-774-10-git-send-email-phil.yang@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series generic rte atomic APIs deprecate proposal |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation fail Compilation issues

Commit Message

Phil Yang March 17, 2020, 1:17 a.m. UTC
  From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

There has possible that a MT unsafe service might get configured to
run on another core while the service is running currently. This
might result in the MT unsafe service running on multiple cores
simultaneously. Use 'execute_lock' always when the service is
MT unsafe.

Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)
  

Comments

Van Haaren, Harry April 3, 2020, 11:58 a.m. UTC | #1
> From: Phil Yang <phil.yang@arm.com>
> Sent: Tuesday, March 17, 2020 1:18 AM
> To: thomas@monjalon.net; Van Haaren, Harry <harry.van.haaren@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; ruifeng.wang@arm.com;
> joyce.kong@arm.com; nd@arm.com; Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>; stable@dpdk.org
> Subject: [PATCH v3 09/12] service: avoid race condition for MT unsafe service
> 
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> There has possible that a MT unsafe service might get configured to
> run on another core while the service is running currently. This
> might result in the MT unsafe service running on multiple cores
> simultaneously. Use 'execute_lock' always when the service is
> MT unsafe.
> 
> Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>

We should put "fix" in the title, once converged on an implementation.

Regarding Fixes and stable backport, we should consider if
fixing this in stable with a performance degradation, fixing with more
complex solution, or documenting a known issue a better solution.


This fix (always taking the atomic lock) will have a negative performance
impact on existing code using services. We should investigate a way
to fix it without causing datapath performance degradation.

I think there is a way to achieve this by moving more checks/time
to the control path (lcore updating the map), and not forcing the
datapath lcore to always take an atomic.

In this particular case, we have a counter for number of iterations
that a service has done. If this increments we know that the lcore
running the service has re-entered the critical section, so would
see an updated "needs atomic" flag.

This approach may introduce a predictable branch on the datapath,
however the cost of a predictable branch vs always taking an atomic
is order(s?) of magnitude, so a branch is much preferred.

It must be possible to avoid the datapath overhead using a scheme
like this. It will likely be more complex than your proposed change
below, however if it avoids datapath performance drops I feel that
a more complex solution is worth investigating at least.

A unit test is required to validate a fix like this - although perhaps
found by inspection/review, a real-world test to validate would give
confidence.


Thoughts on such an approach?



> ---
>  lib/librte_eal/common/rte_service.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> index 557b5a9..32a2f8a 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -50,6 +50,10 @@ struct rte_service_spec_impl {
>  	uint8_t internal_flags;
> 
>  	/* per service statistics */
> +	/* Indicates how many cores the service is mapped to run on.
> +	 * It does not indicate the number of cores the service is running
> +	 * on currently.
> +	 */
>  	rte_atomic32_t num_mapped_cores;
>  	uint64_t calls;
>  	uint64_t cycles_spent;
> @@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t
> service_mask,
> 
>  	cs->service_active_on_lcore[i] = 1;
> 
> -	/* check do we need cmpset, if MT safe or <= 1 core
> -	 * mapped, atomic ops are not required.
> -	 */
> -	const int use_atomics = (service_mt_safe(s) == 0) &&
> -				(rte_atomic32_read(&s->num_mapped_cores) > 1);
> -	if (use_atomics) {
> +	if (service_mt_safe(s) == 0) {
>  		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
>  			return -EBUSY;
> 
> --
> 2.7.4
  
Honnappa Nagarahalli April 4, 2020, 6:03 p.m. UTC | #2
<snip>

> > Subject: [PATCH v3 09/12] service: avoid race condition for MT unsafe
> > service
> >
> > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
> > There has possible that a MT unsafe service might get configured to
> > run on another core while the service is running currently. This might
> > result in the MT unsafe service running on multiple cores
> > simultaneously. Use 'execute_lock' always when the service is MT
> > unsafe.
> >
> > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> 
> We should put "fix" in the title, once converged on an implementation.
Ok, will replace 'avoid' with 'fix' (once we agree on the solution)

> 
> Regarding Fixes and stable backport, we should consider if fixing this in stable
> with a performance degradation, fixing with more complex solution, or
> documenting a known issue a better solution.
> 
> 
> This fix (always taking the atomic lock) will have a negative performance
> impact on existing code using services. We should investigate a way to fix it
> without causing datapath performance degradation.
Trying to gauge the impact on the existing applications...
The documentation does not explicitly disallow run time mapping of cores to service.
1) If the applications are mapping the cores to services at run time, they are running with a bug. IMO, bug fix resulting in a performance drop should be acceptable.
2) If the service is configured to run on single core (num_mapped_cores == 1), but service is set to MT unsafe - this will have a (possible) performance impact.
	a) This can be solved by setting the service to MT safe and can be documented. This might be a reasonable solution for applications which are compiling with
                   future DPDK releases.
	b) We can also solve this using symbol versioning - the old version of this function will use the old code, the new version of this function will use the code in
                   this patch. So, if the application is run with future DPDK releases without recompiling, it will continue to use the old version. If the application is compiled 
                   with future releases, they can use solution in 2a. We also should think if this is an appropriate solution as this would force 1) to recompile to get the fix.
3) If the service is configured to run on multiple cores (num_mapped_cores > 1), then for those applications, the lock is being taken already. These applications might see some improvements as this patch removes few instructions.

> 
> I think there is a way to achieve this by moving more checks/time to the
> control path (lcore updating the map), and not forcing the datapath lcore to
> always take an atomic.
I think 2a above is the solution.

> 
> In this particular case, we have a counter for number of iterations that a
Which counter are you thinking about?
All the counters I checked are not atomic operations currently. If we are going to use counters they have to be atomic, which means additional cycles in the data path.

> service has done. If this increments we know that the lcore running the
> service has re-entered the critical section, so would see an updated "needs
> atomic" flag.
> 
> This approach may introduce a predictable branch on the datapath, however
> the cost of a predictable branch vs always taking an atomic is order(s?) of
> magnitude, so a branch is much preferred.
> 
> It must be possible to avoid the datapath overhead using a scheme like this. It
> will likely be more complex than your proposed change below, however if it
> avoids datapath performance drops I feel that a more complex solution is
> worth investigating at least.
I do not completely understand the approach you are proposing, may be you can elaborate more. But, it seems to be based on a counter approach. Following is my assessment on what happens if we use a counter. Let us say we kept track of how many cores are running the service currently. We need an atomic counter other than 'num_mapped_cores'. Let us call that counter 'num_current_cores'. The code to call the service would look like below.

1) rte_atomic32_inc(&num_current_cores); /* this results in a full memory barrier */
2) if (__atomic_load_n(&num_current_cores, __ATOMIC_ACQUIRE) == 1) { /* rte_atomic_read is not enough here as it does not provide the required memory barrier for any architecture */
3) 	run_service(); /* Call the service */
4) }
5) rte_atomic32_sub(&num_current_cores); /* Calling rte_atomic32_clear is not enough as it is not an atomic operation and does not provide the required memory barrier */

But the above code has race conditions in lines 1 and 2. It is possible that none of the cores will ever get to run the service as they all could simultaneously increment the counter. Hence lines 1 and 2 together need to be atomic, which is nothing but 'compare-exchange' operation.

BTW, the current code has a bug where it calls 'rte_atomic_clear(&s->execute_lock)', it is missing memory barriers which results in clearing the execute_lock before the service has completed running. I suggest changing the 'execute_lock' to rte_spinlock_t and using rte_spinlock_try_lock and rte_spinlock_unlock APIs.

> 
> A unit test is required to validate a fix like this - although perhaps found by
> inspection/review, a real-world test to validate would give confidence.
Agree, need to have a test case.

> 
> 
> Thoughts on such an approach?
> 
> 
> 
> > ---
> >  lib/librte_eal/common/rte_service.c | 11 +++++------
> >  1 file changed, 5 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> > b/lib/librte_eal/common/rte_service.c
> > index 557b5a9..32a2f8a 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -50,6 +50,10 @@ struct rte_service_spec_impl {
> >  	uint8_t internal_flags;
> >
> >  	/* per service statistics */
> > +	/* Indicates how many cores the service is mapped to run on.
> > +	 * It does not indicate the number of cores the service is running
> > +	 * on currently.
> > +	 */
> >  	rte_atomic32_t num_mapped_cores;
> >  	uint64_t calls;
> >  	uint64_t cycles_spent;
> > @@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs,
> > uint64_t service_mask,
> >
> >  	cs->service_active_on_lcore[i] = 1;
> >
> > -	/* check do we need cmpset, if MT safe or <= 1 core
> > -	 * mapped, atomic ops are not required.
> > -	 */
> > -	const int use_atomics = (service_mt_safe(s) == 0) &&
> > -				(rte_atomic32_read(&s-
> >num_mapped_cores) > 1);
> > -	if (use_atomics) {
> > +	if (service_mt_safe(s) == 0) {
> >  		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
> >  			return -EBUSY;
> >
> > --
> > 2.7.4
  
Van Haaren, Harry April 8, 2020, 6:05 p.m. UTC | #3
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Saturday, April 4, 2020 7:03 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang
> <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; nd <nd@arm.com>; stable@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 09/12] service: avoid race condition for MT unsafe
> service
> 
> <snip>
> 
> > > Subject: [PATCH v3 09/12] service: avoid race condition for MT unsafe
> > > service
> > >
> > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >
> > > There has possible that a MT unsafe service might get configured to
> > > run on another core while the service is running currently. This might
> > > result in the MT unsafe service running on multiple cores
> > > simultaneously. Use 'execute_lock' always when the service is MT
> > > unsafe.
> > >
> > > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> >
> > We should put "fix" in the title, once converged on an implementation.
> Ok, will replace 'avoid' with 'fix' (once we agree on the solution)
> 
> >
> > Regarding Fixes and stable backport, we should consider if fixing this in
> stable
> > with a performance degradation, fixing with more complex solution, or
> > documenting a known issue a better solution.
> >
> >
> > This fix (always taking the atomic lock) will have a negative performance
> > impact on existing code using services. We should investigate a way to fix
> it
> > without causing datapath performance degradation.
> Trying to gauge the impact on the existing applications...
> The documentation does not explicitly disallow run time mapping of cores to
> service.
> 1) If the applications are mapping the cores to services at run time, they are
> running with a bug. IMO, bug fix resulting in a performance drop should be
> acceptable.
> 2) If the service is configured to run on single core (num_mapped_cores == 1),
> but service is set to MT unsafe - this will have a (possible) performance
> impact.
> 	a) This can be solved by setting the service to MT safe and can be
> documented. This might be a reasonable solution for applications which are
> compiling with
>                    future DPDK releases.
> 	b) We can also solve this using symbol versioning - the old version of
> this function will use the old code, the new version of this function will use
> the code in
>                    this patch. So, if the application is run with future DPDK
> releases without recompiling, it will continue to use the old version. If the
> application is compiled
>                    with future releases, they can use solution in 2a. We also
> should think if this is an appropriate solution as this would force 1) to
> recompile to get the fix.
> 3) If the service is configured to run on multiple cores (num_mapped_cores >
> 1), then for those applications, the lock is being taken already. These
> applications might see some improvements as this patch removes few
> instructions.
>
> >
> > I think there is a way to achieve this by moving more checks/time to the
> > control path (lcore updating the map), and not forcing the datapath lcore to
> > always take an atomic.
> I think 2a above is the solution.

2a above is e.g. the Eventdev SW routines like Rx/Tx scheduler services. 
We should strive to not reduce datapath performance at all here.


> > In this particular case, we have a counter for number of iterations that a
> Which counter are you thinking about?
> All the counters I checked are not atomic operations currently. If we are
> going to use counters they have to be atomic, which means additional cycles in
> the data path.

I'll try to explain the concept better, take this example:
 - One service core is mapped to a MT_UNSAFE service, like event/sw pmd
 - Application wants to map a 2nd lcore to the same service
 - You point out that today there is a race over the lock
    -- the lock is not taken if (num_mapped_lcores == 1)
    -- this avoids an atomic lock/unlock on the datapath

To achieve our desired goal;
 - control thread doing mapping performs the following operations
    -- write service->num_mapped_lcores++ (atomic not required, only single-writer allowed by APIs)
    -- MFENCE (full st-ld barrier) to flush write, and force later loads to issue after
    -- read the "calls_per_service" counter for each lcores, add them up.
    ---- Wait :)
    -- re-read the "calls_per_service", and ensure the count has changed.
    ---- The fact that the calls_per_service has changed ensures the service-lcore
         has seen the new "num_mapped_cores" value, and has now taken the lock!
    -- *now* it is safe to map the 2nd lcore to the service

There is a caveat here that the increments to the "calls_per_service" variable
must become globally-observable. To force this immediately would require a
write memory barrier, which could impact datapath performance. Given the service
is now taking a lock, the unlock() thereof would ensure the "calls_per_service"
is flushed to memory.

Note: we could use calls_per_service, or add a new variable to the service struct.
Writes to this do not need to be atomic, as it is either mapped to a single core,
or else there's a lock around it.


> > service has done. If this increments we know that the lcore running the
> > service has re-entered the critical section, so would see an updated "needs
> > atomic" flag.
> >
> > This approach may introduce a predictable branch on the datapath, however
> > the cost of a predictable branch vs always taking an atomic is order(s?) of
> > magnitude, so a branch is much preferred.
> >
> > It must be possible to avoid the datapath overhead using a scheme like this.
> It
> > will likely be more complex than your proposed change below, however if it
> > avoids datapath performance drops I feel that a more complex solution is
> > worth investigating at least.

> I do not completely understand the approach you are proposing, may be you can
> elaborate more.

Expanded above, showing a possible solution that does not require additional
atomics on the datapath.


> But, it seems to be based on a counter approach. Following is
> my assessment on what happens if we use a counter. Let us say we kept track of
> how many cores are running the service currently. We need an atomic counter
> other than 'num_mapped_cores'. Let us call that counter 'num_current_cores'.
> The code to call the service would look like below.
> 
> 1) rte_atomic32_inc(&num_current_cores); /* this results in a full memory
> barrier */
> 2) if (__atomic_load_n(&num_current_cores, __ATOMIC_ACQUIRE) == 1) { /*
> rte_atomic_read is not enough here as it does not provide the required memory
> barrier for any architecture */
> 3) 	run_service(); /* Call the service */
> 4) }
> 5) rte_atomic32_sub(&num_current_cores); /* Calling rte_atomic32_clear is not
> enough as it is not an atomic operation and does not provide the required
> memory barrier */
> 
> But the above code has race conditions in lines 1 and 2. It is possible that
> none of the cores will ever get to run the service as they all could
> simultaneously increment the counter. Hence lines 1 and 2 together need to be
> atomic, which is nothing but 'compare-exchange' operation.
> 
> BTW, the current code has a bug where it calls 'rte_atomic_clear(&s-
> >execute_lock)', it is missing memory barriers which results in clearing the
> execute_lock before the service has completed running. I suggest changing the
> 'execute_lock' to rte_spinlock_t and using rte_spinlock_try_lock and
> rte_spinlock_unlock APIs.

I don't think a spinlock is what we want here:

The idea is that a service-lcore can be mapped to multiple services.
If one service is already being run (by another thread), we do not want to
spin here waiting for it to become "free" to run by this thread, it should
continue to the next service that it is mapped to.


> >
> > A unit test is required to validate a fix like this - although perhaps found
> by
> > inspection/review, a real-world test to validate would give confidence.
> Agree, need to have a test case.
> 
> >
> >
> > Thoughts on such an approach?
> >
<snip patch contents>
  
Honnappa Nagarahalli April 9, 2020, 1:31 a.m. UTC | #4
<snip>

> >
> > > > Subject: [PATCH v3 09/12] service: avoid race condition for MT
> > > > unsafe service
> > > >
> > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > >
> > > > There has possible that a MT unsafe service might get configured
> > > > to run on another core while the service is running currently.
> > > > This might result in the MT unsafe service running on multiple
> > > > cores simultaneously. Use 'execute_lock' always when the service
> > > > is MT unsafe.
> > > >
> > > > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> > > > Cc: stable@dpdk.org
> > > >
> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > >
> > > We should put "fix" in the title, once converged on an implementation.
> > Ok, will replace 'avoid' with 'fix' (once we agree on the solution)
> >
> > >
> > > Regarding Fixes and stable backport, we should consider if fixing
> > > this in
> > stable
> > > with a performance degradation, fixing with more complex solution,
> > > or documenting a known issue a better solution.
> > >
> > >
> > > This fix (always taking the atomic lock) will have a negative
> > > performance impact on existing code using services. We should
> > > investigate a way to fix
> > it
> > > without causing datapath performance degradation.
> > Trying to gauge the impact on the existing applications...
> > The documentation does not explicitly disallow run time mapping of
> > cores to service.
> > 1) If the applications are mapping the cores to services at run time,
> > they are running with a bug. IMO, bug fix resulting in a performance
> > drop should be acceptable.
> > 2) If the service is configured to run on single core
> > (num_mapped_cores == 1), but service is set to MT unsafe - this will
> > have a (possible) performance impact.
> > 	a) This can be solved by setting the service to MT safe and can be
> > documented. This might be a reasonable solution for applications which
> > are compiling with
> >                    future DPDK releases.
> > 	b) We can also solve this using symbol versioning - the old version
> > of this function will use the old code, the new version of this
> > function will use the code in
> >                    this patch. So, if the application is run with
> > future DPDK releases without recompiling, it will continue to use the
> > old version. If the application is compiled
> >                    with future releases, they can use solution in 2a.
> > We also should think if this is an appropriate solution as this would
> > force 1) to recompile to get the fix.
> > 3) If the service is configured to run on multiple cores
> > (num_mapped_cores > 1), then for those applications, the lock is being
> > taken already. These applications might see some improvements as this
> > patch removes few instructions.
> >
> > >
> > > I think there is a way to achieve this by moving more checks/time to
> > > the control path (lcore updating the map), and not forcing the
> > > datapath lcore to always take an atomic.
> > I think 2a above is the solution.
> 
> 2a above is e.g. the Eventdev SW routines like Rx/Tx scheduler services.
I scanned through the code briefly
I see that Eth RX/TX, Crypto adapters are setting the MT_SAFE capabilities, can be ignored.
Timer adaptor and some others do not set MT_SAFE. Seems like the cores to run on are mapped during run time. But it is not clear to me if it can get mapped to run on multiple cores. If they are, they are running with the bug.
But, these are all internal to DPDK and can be fixed.
Are there no performance tests in these components that we can run?

> We should strive to not reduce datapath performance at all here.
> 
> 
> > > In this particular case, we have a counter for number of iterations
> > > that a
> > Which counter are you thinking about?
> > All the counters I checked are not atomic operations currently. If we
> > are going to use counters they have to be atomic, which means
> > additional cycles in the data path.
> 
> I'll try to explain the concept better, take this example:
>  - One service core is mapped to a MT_UNSAFE service, like event/sw pmd
>  - Application wants to map a 2nd lcore to the same service
>  - You point out that today there is a race over the lock
>     -- the lock is not taken if (num_mapped_lcores == 1)
>     -- this avoids an atomic lock/unlock on the datapath
> 
> To achieve our desired goal;
>  - control thread doing mapping performs the following operations
>     -- write service->num_mapped_lcores++ (atomic not required, only single-
> writer allowed by APIs)
This has to be atomic because of rte_service_run_iter_on_app_lcore API. Performance should be fine as this API is not called frequently. But need to consider the implications of more than one thread updating num_mapped_cores.

>     -- MFENCE (full st-ld barrier) to flush write, and force later loads to issue
> after
I am not exactly sure what MFENCE on x86 does. On Arm platforms, the full barrier (DMB ISH) just makes sure that memory operations are not re-ordered around it. It does not say anything about when that store is visible to other cores. It will be visible at some point in time to cores.
But, I do not think we need to be worried about flushing to memory.

>     -- read the "calls_per_service" counter for each lcores, add them up.
This can be trimmed down to the single core the service is mapped to currently, no need to add all the counters.

>     ---- Wait :)
>     -- re-read the "calls_per_service", and ensure the count has changed.
Basically, polling. This causes more traffic on the interconnect between the cores. But might be ok since this API might not be called frequently.

>     ---- The fact that the calls_per_service has changed ensures the service-
> lcore
>          has seen the new "num_mapped_cores" value, and has now taken the
> lock!
>     -- *now* it is safe to map the 2nd lcore to the service
> 
> There is a caveat here that the increments to the "calls_per_service" variable
> must become globally-observable. To force this immediately would require a
> write memory barrier, which could impact datapath performance. Given the
> service is now taking a lock, the unlock() thereof would ensure the
> "calls_per_service"
> is flushed to memory.
If we increment this variable only when the lock is held, we should be fine. We could have a separate variable.

> 
> Note: we could use calls_per_service, or add a new variable to the service
> struct.
> Writes to this do not need to be atomic, as it is either mapped to a single core,
> or else there's a lock around it.
I think it is better to have a separate variable that is updated only when the lock is held.
I do not see any change in API sequence. We do this hand-shake only if the service is running (which is all controlled in the writer thread), correct?

This does not solve the problem with rte_service_run_iter_on_app_lcore getting called on multiple cores concurrently for the same service. 

> 
> 
> > > service has done. If this increments we know that the lcore running
> > > the service has re-entered the critical section, so would see an
> > > updated "needs atomic" flag.
> > >
> > > This approach may introduce a predictable branch on the datapath,
> > > however the cost of a predictable branch vs always taking an atomic
> > > is order(s?) of magnitude, so a branch is much preferred.
> > >
> > > It must be possible to avoid the datapath overhead using a scheme like
> this.
> > It
> > > will likely be more complex than your proposed change below, however
> > > if it avoids datapath performance drops I feel that a more complex
> > > solution is worth investigating at least.
> 
> > I do not completely understand the approach you are proposing, may be
> > you can elaborate more.
> 
> Expanded above, showing a possible solution that does not require additional
> atomics on the datapath.
> 
> 
> > But, it seems to be based on a counter approach. Following is my
> > assessment on what happens if we use a counter. Let us say we kept
> > track of how many cores are running the service currently. We need an
> > atomic counter other than 'num_mapped_cores'. Let us call that counter
> 'num_current_cores'.
> > The code to call the service would look like below.
> >
> > 1) rte_atomic32_inc(&num_current_cores); /* this results in a full
> > memory barrier */
> > 2) if (__atomic_load_n(&num_current_cores, __ATOMIC_ACQUIRE) == 1) {
> > /* rte_atomic_read is not enough here as it does not provide the
> > required memory barrier for any architecture */
> > 3) 	run_service(); /* Call the service */
> > 4) }
> > 5) rte_atomic32_sub(&num_current_cores); /* Calling rte_atomic32_clear
> > is not enough as it is not an atomic operation and does not provide
> > the required memory barrier */
> >
> > But the above code has race conditions in lines 1 and 2. It is
> > possible that none of the cores will ever get to run the service as
> > they all could simultaneously increment the counter. Hence lines 1 and
> > 2 together need to be atomic, which is nothing but 'compare-exchange'
> operation.
> >
> > BTW, the current code has a bug where it calls 'rte_atomic_clear(&s-
> > >execute_lock)', it is missing memory barriers which results in
> > >clearing the
> > execute_lock before the service has completed running. I suggest
> > changing the 'execute_lock' to rte_spinlock_t and using
> > rte_spinlock_try_lock and rte_spinlock_unlock APIs.
> 
> I don't think a spinlock is what we want here:
> 
> The idea is that a service-lcore can be mapped to multiple services.
> If one service is already being run (by another thread), we do not want to spin
> here waiting for it to become "free" to run by this thread, it should continue
> to the next service that it is mapped to.
Agree. I am suggesting to use 'rte_spinlock_try_lock' (does not spin) which is nothing but 'compare-exchange'. Since the API is available, we should make use of it instead of repeating the code.

> 
> 
> > >
> > > A unit test is required to validate a fix like this - although
> > > perhaps found
> > by
> > > inspection/review, a real-world test to validate would give confidence.
> > Agree, need to have a test case.
> >
> > >
> > >
> > > Thoughts on such an approach?
> > >
> <snip patch contents>
  
Van Haaren, Harry April 9, 2020, 4:46 p.m. UTC | #5
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Thursday, April 9, 2020 2:32 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang
> <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; nd <nd@arm.com>; stable@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 09/12] service: avoid race condition for MT unsafe
> service
> 
> <snip>
> 
> > >
> > > > > Subject: [PATCH v3 09/12] service: avoid race condition for MT
> > > > > unsafe service
> > > > >
> > > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > >
> > > > > There has possible that a MT unsafe service might get configured
> > > > > to run on another core while the service is running currently.
> > > > > This might result in the MT unsafe service running on multiple
> > > > > cores simultaneously. Use 'execute_lock' always when the service
> > > > > is MT unsafe.
> > > > >
> > > > > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> > > > > Cc: stable@dpdk.org
> > > > >
> > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > >
> > > > We should put "fix" in the title, once converged on an implementation.
> > > Ok, will replace 'avoid' with 'fix' (once we agree on the solution)
> > >
> > > >
> > > > Regarding Fixes and stable backport, we should consider if fixing
> > > > this in
> > > stable
> > > > with a performance degradation, fixing with more complex solution,
> > > > or documenting a known issue a better solution.
> > > >
> > > >
> > > > This fix (always taking the atomic lock) will have a negative
> > > > performance impact on existing code using services. We should
> > > > investigate a way to fix
> > > it
> > > > without causing datapath performance degradation.
> > > Trying to gauge the impact on the existing applications...
> > > The documentation does not explicitly disallow run time mapping of
> > > cores to service.
> > > 1) If the applications are mapping the cores to services at run time,
> > > they are running with a bug. IMO, bug fix resulting in a performance
> > > drop should be acceptable.
> > > 2) If the service is configured to run on single core
> > > (num_mapped_cores == 1), but service is set to MT unsafe - this will
> > > have a (possible) performance impact.
> > > 	a) This can be solved by setting the service to MT safe and can be
> > > documented. This might be a reasonable solution for applications which
> > > are compiling with
> > >                    future DPDK releases.
> > > 	b) We can also solve this using symbol versioning - the old version
> > > of this function will use the old code, the new version of this
> > > function will use the code in
> > >                    this patch. So, if the application is run with
> > > future DPDK releases without recompiling, it will continue to use the
> > > old version. If the application is compiled
> > >                    with future releases, they can use solution in 2a.
> > > We also should think if this is an appropriate solution as this would
> > > force 1) to recompile to get the fix.
> > > 3) If the service is configured to run on multiple cores
> > > (num_mapped_cores > 1), then for those applications, the lock is being
> > > taken already. These applications might see some improvements as this
> > > patch removes few instructions.
> > >
> > > >
> > > > I think there is a way to achieve this by moving more checks/time to
> > > > the control path (lcore updating the map), and not forcing the
> > > > datapath lcore to always take an atomic.
> > > I think 2a above is the solution.
> >
> > 2a above is e.g. the Eventdev SW routines like Rx/Tx scheduler services.
> I scanned through the code briefly
> I see that Eth RX/TX, Crypto adapters are setting the MT_SAFE capabilities,
> can be ignored.
> Timer adaptor and some others do not set MT_SAFE. Seems like the cores to run
> on are mapped during run time. But it is not clear to me if it can get mapped
> to run on multiple cores. If they are, they are running with the bug.

EAL will map each service to a single lcore. It will "round-robin" if
there are more services than service-lcores to run them on. So agree
that DPDK's default mappings will not suffer this issue.


> But, these are all internal to DPDK and can be fixed.
> Are there no performance tests in these components that we can run?
>
> > We should strive to not reduce datapath performance at all here.
> >
> >
> > > > In this particular case, we have a counter for number of iterations
> > > > that a
> > > Which counter are you thinking about?
> > > All the counters I checked are not atomic operations currently. If we
> > > are going to use counters they have to be atomic, which means
> > > additional cycles in the data path.
> >
> > I'll try to explain the concept better, take this example:
> >  - One service core is mapped to a MT_UNSAFE service, like event/sw pmd
> >  - Application wants to map a 2nd lcore to the same service
> >  - You point out that today there is a race over the lock
> >     -- the lock is not taken if (num_mapped_lcores == 1)
> >     -- this avoids an atomic lock/unlock on the datapath
> >
> > To achieve our desired goal;
> >  - control thread doing mapping performs the following operations
> >     -- write service->num_mapped_lcores++ (atomic not required, only single-
> > writer allowed by APIs)
> This has to be atomic because of rte_service_run_iter_on_app_lcore API.
> Performance should be fine as this API is not called frequently. But need to
> consider the implications of more than one thread updating num_mapped_cores.
> 
> >     -- MFENCE (full st-ld barrier) to flush write, and force later loads to
> issue
> > after
> I am not exactly sure what MFENCE on x86 does. On Arm platforms, the full
> barrier (DMB ISH) just makes sure that memory operations are not re-ordered
> around it. It does not say anything about when that store is visible to other
> cores. It will be visible at some point in time to cores.
> But, I do not think we need to be worried about flushing to memory.
> 
> >     -- read the "calls_per_service" counter for each lcores, add them up.
> This can be trimmed down to the single core the service is mapped to
> currently, no need to add all the counters.

Correct - however that requires figuring out which lcore is running the
service. Anyway, agree - it's an implementation detail as to exactly how
we detect it.

> 
> >     ---- Wait :)
> >     -- re-read the "calls_per_service", and ensure the count has changed.
> Basically, polling. This causes more traffic on the interconnect between the
> cores. But might be ok since this API might not be called frequently.

Agree this will not be called frequently, and that some polling here will
not be a problem.


> >     ---- The fact that the calls_per_service has changed ensures the
> service-
> > lcore
> >          has seen the new "num_mapped_cores" value, and has now taken the
> > lock!
> >     -- *now* it is safe to map the 2nd lcore to the service
> >
> > There is a caveat here that the increments to the "calls_per_service"
> variable
> > must become globally-observable. To force this immediately would require a
> > write memory barrier, which could impact datapath performance. Given the
> > service is now taking a lock, the unlock() thereof would ensure the
> > "calls_per_service"
> > is flushed to memory.
> If we increment this variable only when the lock is held, we should be fine.
> We could have a separate variable.

Sure, if a separate variable is preferred that's fine with me.


> > Note: we could use calls_per_service, or add a new variable to the service
> > struct.
> > Writes to this do not need to be atomic, as it is either mapped to a single
> core,
> > or else there's a lock around it.
> I think it is better to have a separate variable that is updated only when the
> lock is held.
> I do not see any change in API sequence. We do this hand-shake only if the
> service is running (which is all controlled in the writer thread), correct?

Yes this increment can be localized to just the branch when the unlock() occurs,
as that is the only time it could make a difference.

> This does not solve the problem with rte_service_run_iter_on_app_lcore getting
> called on multiple cores concurrently for the same service.

Agreed. This "on_app_lcore" API was an addition required to enable unit-testing
in a sane way, to run iterations of eg Eventdev PMD.

I am in favor of documenting that the application is responsible to ensure
the service being run on a specific application lcore is not concurrently
running on another application lcore.


> > > > service has done. If this increments we know that the lcore running
> > > > the service has re-entered the critical section, so would see an
> > > > updated "needs atomic" flag.
> > > >
> > > > This approach may introduce a predictable branch on the datapath,
> > > > however the cost of a predictable branch vs always taking an atomic
> > > > is order(s?) of magnitude, so a branch is much preferred.
> > > >
> > > > It must be possible to avoid the datapath overhead using a scheme like
> > this.
> > > It
> > > > will likely be more complex than your proposed change below, however
> > > > if it avoids datapath performance drops I feel that a more complex
> > > > solution is worth investigating at least.
> >
> > > I do not completely understand the approach you are proposing, may be
> > > you can elaborate more.
> >
> > Expanded above, showing a possible solution that does not require additional
> > atomics on the datapath.
> >
> >
> > > But, it seems to be based on a counter approach. Following is my
> > > assessment on what happens if we use a counter. Let us say we kept
> > > track of how many cores are running the service currently. We need an
> > > atomic counter other than 'num_mapped_cores'. Let us call that counter
> > 'num_current_cores'.
> > > The code to call the service would look like below.
> > >
> > > 1) rte_atomic32_inc(&num_current_cores); /* this results in a full
> > > memory barrier */
> > > 2) if (__atomic_load_n(&num_current_cores, __ATOMIC_ACQUIRE) == 1) {
> > > /* rte_atomic_read is not enough here as it does not provide the
> > > required memory barrier for any architecture */
> > > 3) 	run_service(); /* Call the service */
> > > 4) }
> > > 5) rte_atomic32_sub(&num_current_cores); /* Calling rte_atomic32_clear
> > > is not enough as it is not an atomic operation and does not provide
> > > the required memory barrier */
> > >
> > > But the above code has race conditions in lines 1 and 2. It is
> > > possible that none of the cores will ever get to run the service as
> > > they all could simultaneously increment the counter. Hence lines 1 and
> > > 2 together need to be atomic, which is nothing but 'compare-exchange'
> > operation.
> > >
> > > BTW, the current code has a bug where it calls 'rte_atomic_clear(&s-
> > > >execute_lock)', it is missing memory barriers which results in
> > > >clearing the
> > > execute_lock before the service has completed running. I suggest
> > > changing the 'execute_lock' to rte_spinlock_t and using
> > > rte_spinlock_try_lock and rte_spinlock_unlock APIs.
> >
> > I don't think a spinlock is what we want here:
> >
> > The idea is that a service-lcore can be mapped to multiple services.
> > If one service is already being run (by another thread), we do not want to
> spin
> > here waiting for it to become "free" to run by this thread, it should
> continue
> > to the next service that it is mapped to.
> Agree. I am suggesting to use 'rte_spinlock_try_lock' (does not spin) which is
> nothing but 'compare-exchange'. Since the API is available, we should make use
> of it instead of repeating the code.

Ah apologies, I misread the spinlock usage. Sure if the spinlock_t code
is preferred I'm ok with a change. It would be clean to have a separate
patch in the patchset to make this change, and have it later in the set
than the changes for backporting to ease integration with stable branch.


> > > > A unit test is required to validate a fix like this - although
> > > > perhaps found
> > > by
> > > > inspection/review, a real-world test to validate would give confidence.
> > > Agree, need to have a test case.
> > >
> > > >
> > > >
> > > > Thoughts on such an approach?
> > > >
> > <snip patch contents>
  
Honnappa Nagarahalli April 18, 2020, 6:21 a.m. UTC | #6
<snip>

> > > >
> > > > > > Subject: [PATCH v3 09/12] service: avoid race condition for MT
> > > > > > unsafe service
> > > > > >
> > > > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > > >
> > > > > > There has possible that a MT unsafe service might get
> > > > > > configured to run on another core while the service is running
> currently.
> > > > > > This might result in the MT unsafe service running on multiple
> > > > > > cores simultaneously. Use 'execute_lock' always when the
> > > > > > service is MT unsafe.
> > > > > >
> > > > > > Fixes: e9139a32f6e8 ("service: add function to run on app
> > > > > > lcore")
> > > > > > Cc: stable@dpdk.org
> > > > > >
> > > > > > Signed-off-by: Honnappa Nagarahalli
> > > > > > <honnappa.nagarahalli@arm.com>
> > > > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > >
> > > > > We should put "fix" in the title, once converged on an implementation.
> > > > Ok, will replace 'avoid' with 'fix' (once we agree on the
> > > > solution)
> > > >
> > > > >
> > > > > Regarding Fixes and stable backport, we should consider if
> > > > > fixing this in
> > > > stable
> > > > > with a performance degradation, fixing with more complex
> > > > > solution, or documenting a known issue a better solution.
> > > > >
> > > > >
> > > > > This fix (always taking the atomic lock) will have a negative
> > > > > performance impact on existing code using services. We should
> > > > > investigate a way to fix
> > > > it
> > > > > without causing datapath performance degradation.
> > > > Trying to gauge the impact on the existing applications...
> > > > The documentation does not explicitly disallow run time mapping of
> > > > cores to service.
> > > > 1) If the applications are mapping the cores to services at run
> > > > time, they are running with a bug. IMO, bug fix resulting in a
> > > > performance drop should be acceptable.
> > > > 2) If the service is configured to run on single core
> > > > (num_mapped_cores == 1), but service is set to MT unsafe - this
> > > > will have a (possible) performance impact.
> > > > 	a) This can be solved by setting the service to MT safe and can
> > > > be documented. This might be a reasonable solution for
> > > > applications which are compiling with
> > > >                    future DPDK releases.
> > > > 	b) We can also solve this using symbol versioning - the old
> > > > version of this function will use the old code, the new version of
> > > > this function will use the code in
> > > >                    this patch. So, if the application is run with
> > > > future DPDK releases without recompiling, it will continue to use
> > > > the old version. If the application is compiled
> > > >                    with future releases, they can use solution in 2a.
> > > > We also should think if this is an appropriate solution as this
> > > > would force 1) to recompile to get the fix.
> > > > 3) If the service is configured to run on multiple cores
> > > > (num_mapped_cores > 1), then for those applications, the lock is
> > > > being taken already. These applications might see some
> > > > improvements as this patch removes few instructions.
> > > >
> > > > >
> > > > > I think there is a way to achieve this by moving more
> > > > > checks/time to the control path (lcore updating the map), and
> > > > > not forcing the datapath lcore to always take an atomic.
> > > > I think 2a above is the solution.
> > >
> > > 2a above is e.g. the Eventdev SW routines like Rx/Tx scheduler services.
> > I scanned through the code briefly
> > I see that Eth RX/TX, Crypto adapters are setting the MT_SAFE
> > capabilities, can be ignored.
> > Timer adaptor and some others do not set MT_SAFE. Seems like the cores
> > to run on are mapped during run time. But it is not clear to me if it
> > can get mapped to run on multiple cores. If they are, they are running with
> the bug.
> 
> EAL will map each service to a single lcore. It will "round-robin" if there are
> more services than service-lcores to run them on. So agree that DPDK's
> default mappings will not suffer this issue.
> 
> 
> > But, these are all internal to DPDK and can be fixed.
> > Are there no performance tests in these components that we can run?
> >
> > > We should strive to not reduce datapath performance at all here.
> > >
> > >
> > > > > In this particular case, we have a counter for number of
> > > > > iterations that a
> > > > Which counter are you thinking about?
> > > > All the counters I checked are not atomic operations currently. If
> > > > we are going to use counters they have to be atomic, which means
> > > > additional cycles in the data path.
> > >
> > > I'll try to explain the concept better, take this example:
I tried to implement this algorithm, but there are few issues, please see below.

> > >  - One service core is mapped to a MT_UNSAFE service, like event/sw
> > > pmd
> > >  - Application wants to map a 2nd lcore to the same service
> > >  - You point out that today there is a race over the lock
> > >     -- the lock is not taken if (num_mapped_lcores == 1)
> > >     -- this avoids an atomic lock/unlock on the datapath
> > >
> > > To achieve our desired goal;
> > >  - control thread doing mapping performs the following operations
> > >     -- write service->num_mapped_lcores++ (atomic not required, only
> > > single- writer allowed by APIs)
> > This has to be atomic because of rte_service_run_iter_on_app_lcore API.
> > Performance should be fine as this API is not called frequently. But
> > need to consider the implications of more than one thread updating
> num_mapped_cores.
> >
> > >     -- MFENCE (full st-ld barrier) to flush write, and force later
> > > loads to
> > issue
> > > after
> > I am not exactly sure what MFENCE on x86 does. On Arm platforms, the
> > full barrier (DMB ISH) just makes sure that memory operations are not
> > re-ordered around it. It does not say anything about when that store
> > is visible to other cores. It will be visible at some point in time to cores.
> > But, I do not think we need to be worried about flushing to memory.
> >
> > >     -- read the "calls_per_service" counter for each lcores, add them up.
> > This can be trimmed down to the single core the service is mapped to
> > currently, no need to add all the counters.
> 
> Correct - however that requires figuring out which lcore is running the service.
> Anyway, agree - it's an implementation detail as to exactly how we detect it.
> 
> >
> > >     ---- Wait :)
> > >     -- re-read the "calls_per_service", and ensure the count has changed.
Here, there is an assumption that the service core function is running on the service core. If the service core is not running, the code will be stuck in this polling loop.

I could not come up with a good way to check if the service core is running. Checking the app_runstate and comp_runstate is not enough as they just indicate that the service is ready to run. Using the counter 'calls_per_service' introduces race conditions.

Only way I can think of is asking the user to follow a specific sequence of APIs to ensure the service core is running before calling rte_service_map_lcore_set.


> > Basically, polling. This causes more traffic on the interconnect
> > between the cores. But might be ok since this API might not be called
> frequently.
> 
> Agree this will not be called frequently, and that some polling here will not be
> a problem.
> 
> 
> > >     ---- The fact that the calls_per_service has changed ensures the
> > service-
> > > lcore
> > >          has seen the new "num_mapped_cores" value, and has now
> > > taken the lock!
> > >     -- *now* it is safe to map the 2nd lcore to the service
> > >
> > > There is a caveat here that the increments to the "calls_per_service"
> > variable
> > > must become globally-observable. To force this immediately would
> > > require a write memory barrier, which could impact datapath
> > > performance. Given the service is now taking a lock, the unlock()
> > > thereof would ensure the "calls_per_service"
> > > is flushed to memory.
> > If we increment this variable only when the lock is held, we should be fine.
> > We could have a separate variable.
> 
> Sure, if a separate variable is preferred that's fine with me.
> 
> 
> > > Note: we could use calls_per_service, or add a new variable to the
> > > service struct.
> > > Writes to this do not need to be atomic, as it is either mapped to a
> > > single
> > core,
> > > or else there's a lock around it.
> > I think it is better to have a separate variable that is updated only
> > when the lock is held.
> > I do not see any change in API sequence. We do this hand-shake only if
> > the service is running (which is all controlled in the writer thread), correct?
> 
> Yes this increment can be localized to just the branch when the unlock()
> occurs, as that is the only time it could make a difference.
> 
> > This does not solve the problem with rte_service_run_iter_on_app_lcore
> > getting called on multiple cores concurrently for the same service.
> 
> Agreed. This "on_app_lcore" API was an addition required to enable unit-
> testing in a sane way, to run iterations of eg Eventdev PMD.
> 
> I am in favor of documenting that the application is responsible to ensure the
> service being run on a specific application lcore is not concurrently running on
> another application lcore.
> 
> 
> > > > > service has done. If this increments we know that the lcore
> > > > > running the service has re-entered the critical section, so
> > > > > would see an updated "needs atomic" flag.
> > > > >
> > > > > This approach may introduce a predictable branch on the
> > > > > datapath, however the cost of a predictable branch vs always
> > > > > taking an atomic is order(s?) of magnitude, so a branch is much
> preferred.
> > > > >
> > > > > It must be possible to avoid the datapath overhead using a
> > > > > scheme like
> > > this.
> > > > It
> > > > > will likely be more complex than your proposed change below,
> > > > > however if it avoids datapath performance drops I feel that a
> > > > > more complex solution is worth investigating at least.
> > >
> > > > I do not completely understand the approach you are proposing, may
> > > > be you can elaborate more.
> > >
> > > Expanded above, showing a possible solution that does not require
> > > additional atomics on the datapath.
> > >
> > >
> > > > But, it seems to be based on a counter approach. Following is my
> > > > assessment on what happens if we use a counter. Let us say we kept
> > > > track of how many cores are running the service currently. We need
> > > > an atomic counter other than 'num_mapped_cores'. Let us call that
> > > > counter
> > > 'num_current_cores'.
> > > > The code to call the service would look like below.
> > > >
> > > > 1) rte_atomic32_inc(&num_current_cores); /* this results in a full
> > > > memory barrier */
> > > > 2) if (__atomic_load_n(&num_current_cores, __ATOMIC_ACQUIRE) == 1)
> > > > {
> > > > /* rte_atomic_read is not enough here as it does not provide the
> > > > required memory barrier for any architecture */
> > > > 3) 	run_service(); /* Call the service */
> > > > 4) }
> > > > 5) rte_atomic32_sub(&num_current_cores); /* Calling
> > > > rte_atomic32_clear is not enough as it is not an atomic operation
> > > > and does not provide the required memory barrier */
> > > >
> > > > But the above code has race conditions in lines 1 and 2. It is
> > > > possible that none of the cores will ever get to run the service
> > > > as they all could simultaneously increment the counter. Hence
> > > > lines 1 and
> > > > 2 together need to be atomic, which is nothing but 'compare-exchange'
> > > operation.
> > > >
> > > > BTW, the current code has a bug where it calls
> > > > 'rte_atomic_clear(&s-
> > > > >execute_lock)', it is missing memory barriers which results in
> > > > >clearing the
> > > > execute_lock before the service has completed running. I suggest
> > > > changing the 'execute_lock' to rte_spinlock_t and using
> > > > rte_spinlock_try_lock and rte_spinlock_unlock APIs.
> > >
> > > I don't think a spinlock is what we want here:
> > >
> > > The idea is that a service-lcore can be mapped to multiple services.
> > > If one service is already being run (by another thread), we do not
> > > want to
> > spin
> > > here waiting for it to become "free" to run by this thread, it
> > > should
> > continue
> > > to the next service that it is mapped to.
> > Agree. I am suggesting to use 'rte_spinlock_try_lock' (does not spin)
> > which is nothing but 'compare-exchange'. Since the API is available,
> > we should make use of it instead of repeating the code.
> 
> Ah apologies, I misread the spinlock usage. Sure if the spinlock_t code is
> preferred I'm ok with a change. It would be clean to have a separate patch in
> the patchset to make this change, and have it later in the set than the changes
> for backporting to ease integration with stable branch.
> 
> 
> > > > > A unit test is required to validate a fix like this - although
> > > > > perhaps found
> > > > by
> > > > > inspection/review, a real-world test to validate would give confidence.
> > > > Agree, need to have a test case.
> > > >
> > > > >
> > > > >
> > > > > Thoughts on such an approach?
> > > > >
> > > <snip patch contents>
  
Van Haaren, Harry April 21, 2020, 5:43 p.m. UTC | #7
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Saturday, April 18, 2020 7:22 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang
> <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Joyce Kong <Joyce.Kong@arm.com>; nd
> <nd@arm.com>; stable@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 09/12] service: avoid race condition for MT unsafe service
> 
> <snip>
<snip snip>
> > > > To achieve our desired goal;
> > > >  - control thread doing mapping performs the following operations
> > > >     -- write service->num_mapped_lcores++ (atomic not required, only
> > > > single- writer allowed by APIs)
> > > This has to be atomic because of rte_service_run_iter_on_app_lcore API.
> > > Performance should be fine as this API is not called frequently. But
> > > need to consider the implications of more than one thread updating
> > num_mapped_cores.
> > >
> > > >     -- MFENCE (full st-ld barrier) to flush write, and force later
> > > > loads to
> > > issue
> > > > after
> > > I am not exactly sure what MFENCE on x86 does. On Arm platforms, the
> > > full barrier (DMB ISH) just makes sure that memory operations are not
> > > re-ordered around it. It does not say anything about when that store
> > > is visible to other cores. It will be visible at some point in time to cores.
> > > But, I do not think we need to be worried about flushing to memory.
> > >
> > > >     -- read the "calls_per_service" counter for each lcores, add them up.
> > > This can be trimmed down to the single core the service is mapped to
> > > currently, no need to add all the counters.
> >
> > Correct - however that requires figuring out which lcore is running the service.
> > Anyway, agree - it's an implementation detail as to exactly how we detect it.
> >
> > >
> > > >     ---- Wait :)
> > > >     -- re-read the "calls_per_service", and ensure the count has changed.
> Here, there is an assumption that the service core function is running on the
> service core. If the service core is not running, the code will be stuck in this
> polling loop.

Right - we could add a timeout ehre, but that just moves the problem somewhere
else (the application) which now needs to handle error rets, and possibly retries.
It could be a possible solution.. I'm not in favour of it at the moment, but it
needs some more time to think.

> I could not come up with a good way to check if the service core is running.
> Checking the app_runstate and comp_runstate is not enough as they just
> indicate that the service is ready to run. Using the counter 'calls_per_service'
> introduces race conditions.
> 
> Only way I can think of is asking the user to follow a specific sequence of APIs to
> ensure the service core is running before calling rte_service_map_lcore_set.

Good point - I'm thinking about this - but haven't come to an obvious conclusion yet.
I'm considering other ways to detect the core is/isn't running, and also considering
just "high-jacking" the service function pointer temporarily with a CAS, which gives
some new options on avoiding threads entering the critical section.

As above, I don't have a good solution yet.

<snip irrelevant to above discussion stuff>
  

Patch

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 557b5a9..32a2f8a 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -50,6 +50,10 @@  struct rte_service_spec_impl {
 	uint8_t internal_flags;
 
 	/* per service statistics */
+	/* Indicates how many cores the service is mapped to run on.
+	 * It does not indicate the number of cores the service is running
+	 * on currently.
+	 */
 	rte_atomic32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
@@ -370,12 +374,7 @@  service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	/* check do we need cmpset, if MT safe or <= 1 core
-	 * mapped, atomic ops are not required.
-	 */
-	const int use_atomics = (service_mt_safe(s) == 0) &&
-				(rte_atomic32_read(&s->num_mapped_cores) > 1);
-	if (use_atomics) {
+	if (service_mt_safe(s) == 0) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;