service: fix stop API to wait for service thread

Message ID 20200720120938.34660-1-harry.van.haaren@intel.com (mailing list archive)
State Superseded, archived
Headers
Series service: fix stop API to wait for service thread |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/Intel-compilation fail Compilation issues
ci/iol-testing success Testing PASS

Commit Message

Van Haaren, Harry July 20, 2020, 12:09 p.m. UTC
  This commit improves the service_lcore_stop() implementation,
waiting for the service core in question to return. The service
thread itself now has a variable to indicate if its thread is
active. When zero the service thread has completed its service,
and has returned from the service_runner_func() function.

This fixes a race condition observed in the DPDK CI, where the
statistics of the service were not consistent with the expectation
due to the service thread still running, and incrementing a stat
after stop was called.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>

---

This is one possible solution, that avoids a class of race-conditions
based on stop() api and following behaviours. Without a change in
implementation of the service core thread, we could not detect when
the thread was actually finished. This is now possible, and the stop
api makes use of it to wait for 1000x one millisecond, or log a warning
that a service core didn't return quickly.

Thanks for the discussion/debug on list - I'm not sure how to add
reported-by/suggested-by etc tags: but I'll resend a V2 (or can add
on apply).

---
 lib/librte_eal/common/rte_service.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)
  

Comments

Lukasz Wojciechowski July 20, 2020, 12:51 p.m. UTC | #1
W dniu 20.07.2020 o 14:09, Harry van Haaren pisze:
> This commit improves the service_lcore_stop() implementation,
> waiting for the service core in question to return. The service
> thread itself now has a variable to indicate if its thread is
> active. When zero the service thread has completed its service,
> and has returned from the service_runner_func() function.
>
> This fixes a race condition observed in the DPDK CI, where the
> statistics of the service were not consistent with the expectation
> due to the service thread still running, and incrementing a stat
> after stop was called.
>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
>
> ---
>
> This is one possible solution, that avoids a class of race-conditions
> based on stop() api and following behaviours. Without a change in
> implementation of the service core thread, we could not detect when
> the thread was actually finished. This is now possible, and the stop
> api makes use of it to wait for 1000x one millisecond, or log a warning
> that a service core didn't return quickly.
>
> Thanks for the discussion/debug on list - I'm not sure how to add
> reported-by/suggested-by etc tags: but I'll resend a V2 (or can add
> on apply).
>
> ---
>   lib/librte_eal/common/rte_service.c | 24 ++++++++++++++++++++++++
>   1 file changed, 24 insertions(+)
>
> diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
> index 6a0e0ff65..d2255587d 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -65,6 +65,7 @@ struct core_state {
>   	/* map of services IDs are run on this core */
>   	uint64_t service_mask;
>   	uint8_t runstate; /* running or stopped */
> +	uint8_t thread_active; /* indicates when the thread is in service_run() */
>   	uint8_t is_service_core; /* set if core is currently a service core */
>   	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
>   	uint64_t loops;
> @@ -457,6 +458,8 @@ service_runner_func(void *arg)
>   	const int lcore = rte_lcore_id();
>   	struct core_state *cs = &lcore_states[lcore];
>   
> +	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_RELAXED);
> +
>   	/* runstate act as the guard variable. Use load-acquire
>   	 * memory order here to synchronize with store-release
>   	 * in runstate update functions.
> @@ -475,6 +478,7 @@ service_runner_func(void *arg)
>   		cs->loops++;
>   	}
>   
> +	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_RELAXED);
>   	return 0;
>   }
>   
> @@ -765,6 +769,26 @@ rte_service_lcore_stop(uint32_t lcore)
>   	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
>   		__ATOMIC_RELEASE);
>   
> +	/* wait for service lcore to return */
> +	i = 0;
> +	uint8_t active;
> +	uint64_t start = rte_rdtsc();
> +	do {
> +		active = __atomic_load_n(&lcore_states[lcore].thread_active,
> +					 __ATOMIC_RELAXED);
> +		if (active == 0)
> +			break;
> +		rte_delay_ms(1);
> +		i++;
> +	} while (i < 1000);
> +
> +	if (active != 0) {
> +		uint64_t end = rte_rdtsc();
> +		RTE_LOG(WARNING, EAL,
> +			"service lcore stop() failed, waited for %ld cycles\n",
> +			end - start);
> +	}
> +
>   	return 0;
>   }
>   
I don't like the idea of inserting this polling loop inside API call. 
And I don't like setting up a 1000 iterations constraint.
How about keeping the thread_active flag, but moving checking state of 
this flag to separate function. This way the user of the API would be 
able to write own loop.
Maybe he/she would like a custom loop, because:
* waiting for more cores
* would like to wait longer
* would like to check if service is finished less often...
  
Van Haaren, Harry July 20, 2020, 2:20 p.m. UTC | #2
> -----Original Message-----
> From: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
> Sent: Monday, July 20, 2020 1:52 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> Cc: david.marchand@redhat.com; igor.romanov@oktetlabs.ru;
> honnappa.nagarahalli@arm.com; Yigit, Ferruh <ferruh.yigit@intel.com>;
> nd@arm.com; aconole@redhat.com
> Subject: Re: [PATCH] service: fix stop API to wait for service thread
> 
> 
> W dniu 20.07.2020 o 14:09, Harry van Haaren pisze:
> > This commit improves the service_lcore_stop() implementation,
> > waiting for the service core in question to return. The service
> > thread itself now has a variable to indicate if its thread is
> > active. When zero the service thread has completed its service,
> > and has returned from the service_runner_func() function.
> >
> > This fixes a race condition observed in the DPDK CI, where the
> > statistics of the service were not consistent with the expectation
> > due to the service thread still running, and incrementing a stat
> > after stop was called.
> >
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> >
> > ---
> >
> > This is one possible solution, that avoids a class of race-conditions
> > based on stop() api and following behaviours. Without a change in
> > implementation of the service core thread, we could not detect when
> > the thread was actually finished. This is now possible, and the stop
> > api makes use of it to wait for 1000x one millisecond, or log a warning
> > that a service core didn't return quickly.
> >
> > Thanks for the discussion/debug on list - I'm not sure how to add
> > reported-by/suggested-by etc tags: but I'll resend a V2 (or can add
> > on apply).
> >
> > ---
> >   lib/librte_eal/common/rte_service.c | 24 ++++++++++++++++++++++++
> >   1 file changed, 24 insertions(+)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> > index 6a0e0ff65..d2255587d 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -65,6 +65,7 @@ struct core_state {
> >   	/* map of services IDs are run on this core */
> >   	uint64_t service_mask;
> >   	uint8_t runstate; /* running or stopped */
> > +	uint8_t thread_active; /* indicates when the thread is in service_run() */
> >   	uint8_t is_service_core; /* set if core is currently a service core */
> >   	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
> >   	uint64_t loops;
> > @@ -457,6 +458,8 @@ service_runner_func(void *arg)
> >   	const int lcore = rte_lcore_id();
> >   	struct core_state *cs = &lcore_states[lcore];
> >
> > +	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_RELAXED);
> > +
> >   	/* runstate act as the guard variable. Use load-acquire
> >   	 * memory order here to synchronize with store-release
> >   	 * in runstate update functions.
> > @@ -475,6 +478,7 @@ service_runner_func(void *arg)
> >   		cs->loops++;
> >   	}
> >
> > +	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_RELAXED);
> >   	return 0;
> >   }
> >
> > @@ -765,6 +769,26 @@ rte_service_lcore_stop(uint32_t lcore)
> >   	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
> >   		__ATOMIC_RELEASE);
> >
> > +	/* wait for service lcore to return */
> > +	i = 0;
> > +	uint8_t active;
> > +	uint64_t start = rte_rdtsc();
> > +	do {
> > +		active = __atomic_load_n(&lcore_states[lcore].thread_active,
> > +					 __ATOMIC_RELAXED);
> > +		if (active == 0)
> > +			break;
> > +		rte_delay_ms(1);
> > +		i++;
> > +	} while (i < 1000);
> > +
> > +	if (active != 0) {
> > +		uint64_t end = rte_rdtsc();
> > +		RTE_LOG(WARNING, EAL,
> > +			"service lcore stop() failed, waited for %ld cycles\n",
> > +			end - start);
> > +	}
> > +
> >   	return 0;
> >   }
> >
> I don't like the idea of inserting this polling loop inside API call.
> And I don't like setting up a 1000 iterations constraint.
> How about keeping the thread_active flag, but moving checking state of
> this flag to separate function. This way the user of the API would be
> able to write own loop.
> Maybe he/she would like a custom loop, because:
> * waiting for more cores
> * would like to wait longer
> * would like to check if service is finished less often...

Agree - good feedback, thanks. v2 on the way, with this approach.
  

Patch

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 6a0e0ff65..d2255587d 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -65,6 +65,7 @@  struct core_state {
 	/* map of services IDs are run on this core */
 	uint64_t service_mask;
 	uint8_t runstate; /* running or stopped */
+	uint8_t thread_active; /* indicates when the thread is in service_run() */
 	uint8_t is_service_core; /* set if core is currently a service core */
 	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
 	uint64_t loops;
@@ -457,6 +458,8 @@  service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
+	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_RELAXED);
+
 	/* runstate act as the guard variable. Use load-acquire
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
@@ -475,6 +478,7 @@  service_runner_func(void *arg)
 		cs->loops++;
 	}
 
+	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_RELAXED);
 	return 0;
 }
 
@@ -765,6 +769,26 @@  rte_service_lcore_stop(uint32_t lcore)
 	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
 		__ATOMIC_RELEASE);
 
+	/* wait for service lcore to return */
+	i = 0;
+	uint8_t active;
+	uint64_t start = rte_rdtsc();
+	do {
+		active = __atomic_load_n(&lcore_states[lcore].thread_active,
+					 __ATOMIC_RELAXED);
+		if (active == 0)
+			break;
+		rte_delay_ms(1);
+		i++;
+	} while (i < 1000);
+
+	if (active != 0) {
+		uint64_t end = rte_rdtsc();
+		RTE_LOG(WARNING, EAL,
+			"service lcore stop() failed, waited for %ld cycles\n",
+			end - start);
+	}
+
 	return 0;
 }