[v2] eal/service: fix exit by resetting service lcores
Checks
Commit Message
This commit releases all service cores from their role,
returning them to ROLE_RTE on rte_service_finalize().
This may fix an issue relating to the service cores causing
a race-condition on eal_cleanup(), where the service core
could still be executing while the main thread has already
free-d the service memory, leading to a segfault.
Fixes: 21698354c832 ("service: introduce service cores concept")
Cc: stable@dpdk.org
Reported-by: David Marchand <david.marchand@redhat.com>
Reported-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Aaron Conole <aconole@redhat.com>
---
v2:
- Added rte_eal_mp_wait_lcore() after reset (David)
- Added Signed-off and Acked from mailing list (David, Aaron)
---
lib/librte_eal/common/rte_service.c | 3 +++
1 file changed, 3 insertions(+)
Comments
On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren
<harry.van.haaren@intel.com> wrote:
>
> This commit releases all service cores from their role,
> returning them to ROLE_RTE on rte_service_finalize().
>
> This may fix an issue relating to the service cores causing
You don't seem convinced.
> a race-condition on eal_cleanup(), where the service core
> could still be executing while the main thread has already
> free-d the service memory, leading to a segfault.
>
> Fixes: 21698354c832 ("service: introduce service cores concept")
> Cc: stable@dpdk.org
>
> Reported-by: David Marchand <david.marchand@redhat.com>
> Reported-by: Aaron Conole <aconole@redhat.com>
> Signed-off-by: David Marchand <david.marchand@redhat.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> Acked-by: Aaron Conole <aconole@redhat.com>
I am okay with merging this so that we stop getting random failures of the ut.
I will let this patch on the ml and apply on Friday at worse.
Please take the time to reply to my question.
Thanks.
> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Wednesday, March 11, 2020 4:16 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: dev <dev@dpdk.org>; Aaron Conole <aconole@redhat.com>; dpdk stable
> <stable@dpdk.org>
> Subject: Re: [PATCH v2] eal/service: fix exit by resetting service lcores
>
> On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren
> <harry.van.haaren@intel.com> wrote:
> >
> > This commit releases all service cores from their role,
> > returning them to ROLE_RTE on rte_service_finalize().
> >
> > This may fix an issue relating to the service cores causing
>
> You don't seem convinced.
Apologies - kept from v1 of commit message, should have removed "may" for v2.
Issue was that service cores can remain running while main thread
has freed service-core memory, later racy return of service lcore
then causes use-after-free.
This commit fixes it by
A) resetting all service cores to return
B) waiting for them to return
C) freeing memory
I am confident in the fix.
> > a race-condition on eal_cleanup(), where the service core
> > could still be executing while the main thread has already
> > free-d the service memory, leading to a segfault.
> >
> > Fixes: 21698354c832 ("service: introduce service cores concept")
> > Cc: stable@dpdk.org
> >
> > Reported-by: David Marchand <david.marchand@redhat.com>
> > Reported-by: Aaron Conole <aconole@redhat.com>
> > Signed-off-by: David Marchand <david.marchand@redhat.com>
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > Acked-by: Aaron Conole <aconole@redhat.com>
>
> I am okay with merging this so that we stop getting random failures of the
> ut. I will let this patch on the ml and apply on Friday at worse.
>
> Please take the time to reply to my question.
> Thanks.
Thanks, -Harry
David Marchand <david.marchand@redhat.com> writes:
> On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren
> <harry.van.haaren@intel.com> wrote:
>>
>> This commit releases all service cores from their role,
>> returning them to ROLE_RTE on rte_service_finalize().
>>
>> This may fix an issue relating to the service cores causing
>
> You don't seem convinced.
>
>
>> a race-condition on eal_cleanup(), where the service core
>> could still be executing while the main thread has already
>> free-d the service memory, leading to a segfault.
>>
>> Fixes: 21698354c832 ("service: introduce service cores concept")
>> Cc: stable@dpdk.org
>>
>> Reported-by: David Marchand <david.marchand@redhat.com>
>> Reported-by: Aaron Conole <aconole@redhat.com>
>> Signed-off-by: David Marchand <david.marchand@redhat.com>
>> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
>> Acked-by: Aaron Conole <aconole@redhat.com>
>
> I am okay with merging this so that we stop getting random failures of the ut.
I think it could also potentially cause errors in user applications that
regularly exit, and which use the service core architecture. So it's
worth getting in now, anyway.
> I will let this patch on the ml and apply on Friday at worse.
>
> Please take the time to reply to my question.
> Thanks.
Hello,
On Wed, Mar 11, 2020 at 5:21 PM Van Haaren, Harry
<harry.van.haaren@intel.com> wrote:
> Issue was that service cores can remain running while main thread
> has freed service-core memory, later racy return of service lcore
> then causes use-after-free.
>
> This commit fixes it by
> A) resetting all service cores to return
> B) waiting for them to return
> C) freeing memory
>
> I am confident in the fix.
Ok.
> > > a race-condition on eal_cleanup(), where the service core
> > > could still be executing while the main thread has already
> > > free-d the service memory, leading to a segfault.
> > >
> > > Fixes: 21698354c832 ("service: introduce service cores concept")
The race per se was introduced with:
da23f0aa87d8 ("service: fix memory leak with new function")
On Wed, Mar 11, 2020 at 6:08 PM Aaron Conole <aconole@redhat.com> wrote:
>
> David Marchand <david.marchand@redhat.com> writes:
>
> > On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren
> > <harry.van.haaren@intel.com> wrote:
> >>
> >> This commit releases all service cores from their role,
> >> returning them to ROLE_RTE on rte_service_finalize().
> >>
> >> This may fix an issue relating to the service cores causing
> >
> > You don't seem convinced.
> >
> >
> >> a race-condition on eal_cleanup(), where the service core
> >> could still be executing while the main thread has already
> >> free-d the service memory, leading to a segfault.
> >>
> >> Fixes: 21698354c832 ("service: introduce service cores concept")
> >> Cc: stable@dpdk.org
> >>
> >> Reported-by: David Marchand <david.marchand@redhat.com>
> >> Reported-by: Aaron Conole <aconole@redhat.com>
> >> Signed-off-by: David Marchand <david.marchand@redhat.com>
> >> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> >> Acked-by: Aaron Conole <aconole@redhat.com>
> >
> > I am okay with merging this so that we stop getting random failures of the ut.
>
> I think it could also potentially cause errors in user applications that
> regularly exit, and which use the service core architecture. So it's
> worth getting in now, anyway.
Indeed, thanks for the precision.
In my defense, we did not get report of such crashes out of the CI.
The CI is the main reason why I (selfishly :-)) have been pressing on
this issue.
On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren
<harry.van.haaren@intel.com> wrote:
>
> This commit releases all service cores from their role,
> returning them to ROLE_RTE on rte_service_finalize().
>
> This may fix an issue relating to the service cores causing
s/may fix/fixes/
> a race-condition on eal_cleanup(), where the service core
> could still be executing while the main thread has already
> free-d the service memory, leading to a segfault.
>
> Fixes: 21698354c832 ("service: introduce service cores concept")
Replaced with:
Fixes: da23f0aa87d8 ("service: fix memory leak with new function")
> Cc: stable@dpdk.org
>
> Reported-by: David Marchand <david.marchand@redhat.com>
> Reported-by: Aaron Conole <aconole@redhat.com>
> Signed-off-by: David Marchand <david.marchand@redhat.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> Acked-by: Aaron Conole <aconole@redhat.com>
Applied, thanks.
On 13-Mar-20 10:04 AM, David Marchand wrote:
> On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren
> <harry.van.haaren@intel.com> wrote:
>>
>> This commit releases all service cores from their role,
>> returning them to ROLE_RTE on rte_service_finalize().
>>
>> This may fix an issue relating to the service cores causing
>
> s/may fix/fixes/
>
>> a race-condition on eal_cleanup(), where the service core
>> could still be executing while the main thread has already
>> free-d the service memory, leading to a segfault.
>>
>> Fixes: 21698354c832 ("service: introduce service cores concept")
>
> Replaced with:
> Fixes: da23f0aa87d8 ("service: fix memory leak with new function")
>
>> Cc: stable@dpdk.org
>>
>> Reported-by: David Marchand <david.marchand@redhat.com>
>> Reported-by: Aaron Conole <aconole@redhat.com>
>> Signed-off-by: David Marchand <david.marchand@redhat.com>
>> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
>> Acked-by: Aaron Conole <aconole@redhat.com>
>
> Applied, thanks.
>
>
This patch breaks a couple of apps (or rather the apps were broken to
begin with, but the brokenness has been exposed with this patch).
A "good" way to handle a SIGINT is to catch it, set some kind of global
exit flag, and exit the signal handler, so that all of the threads see
the exit flag, stop spinning, and exit the main loop and proceed to
gracefully shutdown. That's what majority of our apps do.
A bad way to handle SIGINT is to call rte_exit() inside the signal
handler, without setting any global exit flags. Since rte_exit() now
waits for all of the threads to stop, the exit will never actually
happen because threads can't stop without an exit signal, and no exit
signal was provided by the signal handler.
Affected apps:
* l3fwd-power (i'm preparing a patch)
* ip_reassembly (see main.c:988) - +Konstantin
There are also a bunch of apps that simply call exit(0) and do unclean
shutdown without DPDK cleanup, and also apps i have no idea what they're
doing (call kill() on themselves in the SIGINT handler? l3fwd-cat does
that, so do a bunch of others), but this is probably a bigger problem
that should be addressed separately.
"Burakov, Anatoly" <anatoly.burakov@intel.com> writes:
> On 13-Mar-20 10:04 AM, David Marchand wrote:
>> On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren
>> <harry.van.haaren@intel.com> wrote:
>>>
>>> This commit releases all service cores from their role,
>>> returning them to ROLE_RTE on rte_service_finalize().
>>>
>>> This may fix an issue relating to the service cores causing
>>
>> s/may fix/fixes/
>>
>>> a race-condition on eal_cleanup(), where the service core
>>> could still be executing while the main thread has already
>>> free-d the service memory, leading to a segfault.
>>>
>>> Fixes: 21698354c832 ("service: introduce service cores concept")
>>
>> Replaced with:
>> Fixes: da23f0aa87d8 ("service: fix memory leak with new function")
>>
>>> Cc: stable@dpdk.org
>>>
>>> Reported-by: David Marchand <david.marchand@redhat.com>
>>> Reported-by: Aaron Conole <aconole@redhat.com>
>>> Signed-off-by: David Marchand <david.marchand@redhat.com>
>>> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
>>> Acked-by: Aaron Conole <aconole@redhat.com>
>>
>> Applied, thanks.
>>
>>
>
> This patch breaks a couple of apps (or rather the apps were broken to
> begin with, but the brokenness has been exposed with this patch).
>
> A "good" way to handle a SIGINT is to catch it, set some kind of
> global exit flag, and exit the signal handler, so that all of the
> threads see the exit flag, stop spinning, and exit the main loop and
> proceed to gracefully shutdown. That's what majority of our apps do.
>
> A bad way to handle SIGINT is to call rte_exit() inside the signal
> handler, without setting any global exit flags. Since rte_exit() now
> waits for all of the threads to stop, the exit will never actually
> happen because threads can't stop without an exit signal, and no exit
> signal was provided by the signal handler.
Yes, I don't consider it 'breaking' anything - exit in signal handlers
is always a bad idea. I guess we should correct the examples to show
this.
> Affected apps:
>
> * l3fwd-power (i'm preparing a patch)
> * ip_reassembly (see main.c:988) - +Konstantin
>
> There are also a bunch of apps that simply call exit(0) and do unclean
> shutdown without DPDK cleanup, and also apps i have no idea what
> they're doing (call kill() on themselves in the SIGINT handler?
> l3fwd-cat does that, so do a bunch of others), but this is probably a
> bigger problem that should be addressed separately.
I think one way to mitigate this is to register an at_exit() function
that will check if eal is currently initialized and do the needed
cleanup call. I don't know if there are any side-effects that we need
to consider for it, though.
@@ -122,6 +122,9 @@ rte_service_finalize(void)
if (!rte_service_library_initialized)
return;
+ rte_service_lcore_reset_all();
+ rte_eal_mp_wait_lcore();
+
rte_free(rte_services);
rte_free(lcore_states);