[RFC] service: stop lcore threads before 'finalize'
diff mbox series

Message ID f7two9rxjst.fsf@dhcp-25.97.bos.redhat.com
State RFC
Delegated to: David Marchand
Headers show
Series
  • [RFC] service: stop lcore threads before 'finalize'
Related show

Checks

Context Check Description
ci/Intel-compilation success Compilation OK

Commit Message

Aaron Conole Jan. 16, 2020, 7:50 p.m. UTC
I've noticed an occasional segfault from the build system in the
service_autotest and after talking with David (CC'd), it seems like it's
due to the rte_service_finalize deleting the lcore_states object while
active lcores are running.

The below patch is an attempt to solve it by first reassigning all the
lcores back to ROLE_RTE before releasing the memory.  There is probably
a larger question for DPDK proper about actually closing the pending
lcore threads, but that's a separate issue.  I've been running with the
patch for a while, and haven't seen the crash anymore on my system.

Thoughts?  Is it acceptable as-is?
---
---

Comments

David Marchand Jan. 17, 2020, 8:17 a.m. UTC | #1
On Thu, Jan 16, 2020 at 8:50 PM Aaron Conole <aconole@redhat.com> wrote:
>
> I've noticed an occasional segfault from the build system in the
> service_autotest and after talking with David (CC'd), it seems like it's
> due to the rte_service_finalize deleting the lcore_states object while
> active lcores are running.
>
> The below patch is an attempt to solve it by first reassigning all the
> lcores back to ROLE_RTE before releasing the memory.  There is probably
> a larger question for DPDK proper about actually closing the pending
> lcore threads, but that's a separate issue.  I've been running with the
> patch for a while, and haven't seen the crash anymore on my system.
>
> Thoughts?  Is it acceptable as-is?

Added this patch to my env, still reproducing the same issue after ~10-20 tries.
I added a breakpoint to service_lcore_uninit that is indeed caught
when exiting the test application (just wanted to make sure your
change was in my binary).


To reproduce:

I modified app/test/meson.build to have an explicit "-l 0-1" +
compiled with your patch.
Then, I started a dummy busyloop "while true; do true; done" in a
shell that I had pinned to core 1 (taskset -pc 1 $$).
Finally, started another shell (as root), pinned to cores 0-1 on my
laptop (taskset -pc 0,1 $$) and ran meson test --gdb  --repeat=10000
service_autotest

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff4922700 (LWP 8572)]
rte_service_runner_func (arg=<optimized out>) at
../lib/librte_eal/common/rte_service.c:458
458            cs->loops++;
A debugging session is active.

    Inferior 1 [process 8566] will be killed.

Quit anyway? (y or n) n
Not confirmed.
Missing separate debuginfos, use: debuginfo-install
elfutils-libelf-0.172-2.el7.x86_64 glibc-2.17-260.el7_6.6.x86_64
libgcc-4.8.5-36.el7_6.2.x86_64 libibverbs-17.2-3.el7.x86_64
libnl3-3.2.28-4.el7.x86_64 libpcap-1.5.3-11.el7.x86_64
numactl-libs-2.0.9-7.el7.x86_64 openssl-libs-1.0.2k-16.el7_6.1.x86_64
zlib-1.2.7-18.el7.x86_64
(gdb) info threads
  Id   Target Id         Frame
* 4    Thread 0x7ffff4922700 (LWP 8572) "lcore-slave-1"
rte_service_runner_func (arg=<optimized out>) at
../lib/librte_eal/common/rte_service.c:458
  3    Thread 0x7ffff5123700 (LWP 8571) "rte_mp_handle"
0x00007ffff63a4b4d in recvmsg () from /lib64/libpthread.so.0
  2    Thread 0x7ffff5924700 (LWP 8570) "eal-intr-thread"
0x00007ffff60c7603 in epoll_wait () from /lib64/libc.so.6
  1    Thread 0x7ffff7fd2c00 (LWP 8566) "dpdk-test" 0x00007ffff7deb96f
in _dl_name_match_p () from /lib64/ld-linux-x86-64.so.2
(gdb) bt
#0  rte_service_runner_func (arg=<optimized out>) at
../lib/librte_eal/common/rte_service.c:458
#1  0x0000000000b2c84f in eal_thread_loop (arg=<optimized out>) at
../lib/librte_eal/linux/eal/eal_thread.c:153
#2  0x00007ffff639ddd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ffff60c702d in clone () from /lib64/libc.so.6
(gdb) f 0
#0  rte_service_runner_func (arg=<optimized out>) at
../lib/librte_eal/common/rte_service.c:458
458            cs->loops++;
(gdb) p *cs
$1 = {service_mask = 0, runstate = 0 '\000', is_service_core = 0
'\000', service_active_on_lcore = '\000' <repeats 63 times>, loops =
0, calls_per_service = {0 <repeats 64 times>}}
(gdb) p lcore_config[1]
$2 = {thread_id = 140737296606976, pipe_master2slave = {14, 20},
pipe_slave2master = {21, 22}, f = 0xb26ec0 <rte_service_runner_func>,
arg = 0x0, ret = 0, state = RUNNING, socket_id = 0, core_id = 1,
  core_index = 1, core_role = 0 '\000', detected = 1 '\001', cpuset =
{__bits = {2, 0 <repeats 15 times>}}}
(gdb) p lcore_config[0]
$3 = {thread_id = 0, pipe_master2slave = {0, 0}, pipe_slave2master =
{0, 0}, f = 0x0, arg = 0x0, ret = 0, state = WAIT, socket_id = 0,
core_id = 0, core_index = 0, core_role = 0 '\000', detected = 1
'\001',
  cpuset = {__bits = {1, 0 <repeats 15 times>}}}

(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7fd2c00 (LWP 8566))]
#0  0x00007ffff7deb96f in _dl_name_match_p () from /lib64/ld-linux-x86-64.so.2
(gdb) bt
#0  0x00007ffff7deb96f in _dl_name_match_p () from /lib64/ld-linux-x86-64.so.2
#1  0x00007ffff7de4756 in do_lookup_x () from /lib64/ld-linux-x86-64.so.2
#2  0x00007ffff7de4fcf in _dl_lookup_symbol_x () from
/lib64/ld-linux-x86-64.so.2
#3  0x00007ffff7de9d1e in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#4  0x00007ffff7df19da in _dl_runtime_resolve_xsavec () from
/lib64/ld-linux-x86-64.so.2
#5  0x00007ffff7deafba in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#6  0x00007ffff6002c29 in __run_exit_handlers () from /lib64/libc.so.6
#7  0x00007ffff6002c77 in exit () from /lib64/libc.so.6
#8  0x00007ffff5feb49c in __libc_start_main () from /lib64/libc.so.6
#9  0x00000000004fa126 in _start ()


--
David Marchand
David Marchand Feb. 4, 2020, 1:34 p.m. UTC | #2
On Fri, Jan 17, 2020 at 9:17 AM David Marchand
<david.marchand@redhat.com> wrote:
>
> On Thu, Jan 16, 2020 at 8:50 PM Aaron Conole <aconole@redhat.com> wrote:
> >
> > I've noticed an occasional segfault from the build system in the
> > service_autotest and after talking with David (CC'd), it seems like it's
> > due to the rte_service_finalize deleting the lcore_states object while
> > active lcores are running.
> >
> > The below patch is an attempt to solve it by first reassigning all the
> > lcores back to ROLE_RTE before releasing the memory.  There is probably
> > a larger question for DPDK proper about actually closing the pending
> > lcore threads, but that's a separate issue.  I've been running with the
> > patch for a while, and haven't seen the crash anymore on my system.
> >
> > Thoughts?  Is it acceptable as-is?
>
> Added this patch to my env, still reproducing the same issue after ~10-20 tries.
> I added a breakpoint to service_lcore_uninit that is indeed caught
> when exiting the test application (just wanted to make sure your
> change was in my binary).

Harry,

We need a fix for this issue.

Interestingly, Stephen patch that joins all pthreads at
rte_eal_cleanup [1] makes this issue disappear.
So my understanding is that we are missing a api (well, I could not
find a way) to synchronously stop service lcores.


1: https://patchwork.dpdk.org/patch/64201/
Aaron Conole Feb. 4, 2020, 2:50 p.m. UTC | #3
David Marchand <david.marchand@redhat.com> writes:

> On Fri, Jan 17, 2020 at 9:17 AM David Marchand
> <david.marchand@redhat.com> wrote:
>>
>> On Thu, Jan 16, 2020 at 8:50 PM Aaron Conole <aconole@redhat.com> wrote:
>> >
>> > I've noticed an occasional segfault from the build system in the
>> > service_autotest and after talking with David (CC'd), it seems like it's
>> > due to the rte_service_finalize deleting the lcore_states object while
>> > active lcores are running.
>> >
>> > The below patch is an attempt to solve it by first reassigning all the
>> > lcores back to ROLE_RTE before releasing the memory.  There is probably
>> > a larger question for DPDK proper about actually closing the pending
>> > lcore threads, but that's a separate issue.  I've been running with the
>> > patch for a while, and haven't seen the crash anymore on my system.
>> >
>> > Thoughts?  Is it acceptable as-is?
>>
>> Added this patch to my env, still reproducing the same issue after ~10-20 tries.
>> I added a breakpoint to service_lcore_uninit that is indeed caught
>> when exiting the test application (just wanted to make sure your
>> change was in my binary).
>
> Harry,
>
> We need a fix for this issue.

+1

> Interestingly, Stephen patch that joins all pthreads at
> rte_eal_cleanup [1] makes this issue disappear.
> So my understanding is that we are missing a api (well, I could not
> find a way) to synchronously stop service lcores.

Maybe we can take that patch as a fix.  I hate to see this segfault
in the field.  I need to figure out what I missed in my cleanup
(probably missed a synchronization point).

>
> 1: https://patchwork.dpdk.org/patch/64201/
Van Haaren, Harry Feb. 10, 2020, 2:16 p.m. UTC | #4
> -----Original Message-----
> From: Aaron Conole <aconole@redhat.com>
> Sent: Tuesday, February 4, 2020 2:51 PM
> To: David Marchand <david.marchand@redhat.com>
> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev <dev@dpdk.org>
> Subject: Re: [RFC] service: stop lcore threads before 'finalize'
> 
> David Marchand <david.marchand@redhat.com> writes:
> 
> > On Fri, Jan 17, 2020 at 9:17 AM David Marchand
> > <david.marchand@redhat.com> wrote:
> >>
> >> On Thu, Jan 16, 2020 at 8:50 PM Aaron Conole <aconole@redhat.com> wrote:
> >> >
> >> > I've noticed an occasional segfault from the build system in the
> >> > service_autotest and after talking with David (CC'd), it seems like
> it's
> >> > due to the rte_service_finalize deleting the lcore_states object while
> >> > active lcores are running.
> >> >
> >> > The below patch is an attempt to solve it by first reassigning all the
> >> > lcores back to ROLE_RTE before releasing the memory.  There is probably
> >> > a larger question for DPDK proper about actually closing the pending
> >> > lcore threads, but that's a separate issue.  I've been running with the
> >> > patch for a while, and haven't seen the crash anymore on my system.
> >> >
> >> > Thoughts?  Is it acceptable as-is?
> >>
> >> Added this patch to my env, still reproducing the same issue after ~10-20
> tries.
> >> I added a breakpoint to service_lcore_uninit that is indeed caught
> >> when exiting the test application (just wanted to make sure your
> >> change was in my binary).
> >
> > Harry,
> >
> > We need a fix for this issue.
> 
> +1

Hi All,

> > Interestingly, Stephen patch that joins all pthreads at
> > rte_eal_cleanup [1] makes this issue disappear.
> > So my understanding is that we are missing a api (well, I could not
> > find a way) to synchronously stop service lcores.
> 
> Maybe we can take that patch as a fix.  I hate to see this segfault
> in the field.  I need to figure out what I missed in my cleanup
> (probably missed a synchronization point).

I haven't easily reproduced this yet - so I'll investigate a way to 
reproduce with close to 100% rate, then we can identify the root cause
and actually get a clean fix. If you have pointers to reproduce easily,
please let me know.

-H

> > 1: https://patchwork.dpdk.org/patch/64201/
David Marchand Feb. 10, 2020, 2:42 p.m. UTC | #5
On Mon, Feb 10, 2020 at 3:16 PM Van Haaren, Harry
<harry.van.haaren@intel.com> wrote:
> I haven't easily reproduced this yet - so I'll investigate a way to
> reproduce with close to 100% rate, then we can identify the root cause
> and actually get a clean fix. If you have pointers to reproduce easily,
> please let me know.

- In shell #1:

$ git reset --hard v20.02-rc2
HEAD is now at 2636c2a23 version: 20.02-rc2
$ rm -rf build

$ git diff
diff --git a/app/test/meson.build b/app/test/meson.build
index 3675ffb5c..23c00a618 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -400,7 +400,7 @@ timeout_seconds = 600
 timeout_seconds_fast = 10

 get_coremask = find_program('get-coremask.sh')
-num_cores_arg = '-l ' + run_command(get_coremask).stdout().strip()
+num_cores_arg = '-l 0,1'

 test_args = [num_cores_arg]
 foreach arg : fast_test_names

$ meson --werror --buildtype=debugoptimized build
The Meson build system
Version: 0.47.2
Source dir: /home/dmarchan/dpdk
Build dir: /home/dmarchan/dpdk/build
Build type: native build
Program cat found: YES (/usr/bin/cat)
Project name: DPDK
Project version: 20.02.0-rc2
...

$ ninja-build -C build
ninja: Entering directory `build'
[2081/2081] Linking target app/test/dpdk-test.

$ taskset -pc 1 $$
pid 11143's current affinity list: 0-7
pid 11143's new affinity list: 1

$ while true; do true; done


- Now, in shell #2, as root:

# taskset -pc 0,1 $$
pid 22233's current affinity list: 0-7
pid 22233's new affinity list: 0,1

# meson test --gdb  --repeat=10000 service_autotest
...

 + ------------------------------------------------------- +
 + Test Suite Summary
 + Tests Total :       16
 + Tests Skipped :      3
 + Tests Executed :    16
 + Tests Unsupported:   0
 + Tests Passed :      13
 + Tests Failed :       0
 + ------------------------------------------------------- +

Test OK
RTE>>
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff4922700 (LWP 31194)]
rte_service_runner_func (arg=<optimized out>) at
../lib/librte_eal/common/rte_service.c:453
453            cs->loops++;
A debugging session is active.

    Inferior 1 [process 31187] will be killed.

Quit anyway? (y or n)


I get the crash in like 30s, often less.
In my test right now, I got the crash on the 3rd try.
David Marchand Feb. 20, 2020, 1:25 p.m. UTC | #6
On Mon, Feb 10, 2020 at 3:16 PM Van Haaren, Harry
<harry.van.haaren@intel.com> wrote:
> > > We need a fix for this issue.
> >
> > +1
>
> > > Interestingly, Stephen patch that joins all pthreads at
> > > rte_eal_cleanup [1] makes this issue disappear.
> > > So my understanding is that we are missing a api (well, I could not
> > > find a way) to synchronously stop service lcores.
> >
> > Maybe we can take that patch as a fix.  I hate to see this segfault
> > in the field.  I need to figure out what I missed in my cleanup
> > (probably missed a synchronization point).
>
> I haven't easily reproduced this yet - so I'll investigate a way to
> reproduce with close to 100% rate, then we can identify the root cause
> and actually get a clean fix. If you have pointers to reproduce easily,
> please let me know.
>

ping.
I want a fix in 20.05, or I will start considering how to drop this thing.

Patch
diff mbox series

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 7e537b8cd2..7d13287bee 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -71,6 +71,8 @@  static struct rte_service_spec_impl *rte_services;
 static struct core_state *lcore_states;
 static uint32_t rte_service_library_initialized;
 
+static void service_lcore_uninit(void);
+
 int32_t
 rte_service_init(void)
 {
@@ -122,6 +124,9 @@  rte_service_finalize(void)
 	if (!rte_service_library_initialized)
 		return;
 
+	/* Ensure that all service threads are returned to the ROLE_RTE
+	 */
+	service_lcore_uninit();
 	rte_free(rte_services);
 	rte_free(lcore_states);
 
@@ -897,3 +902,14 @@  rte_service_dump(FILE *f, uint32_t id)
 
 	return 0;
 }
+
+static void service_lcore_uninit(void)
+{
+	unsigned lcore_id;
+	RTE_LCORE_FOREACH(lcore_id) {
+		if (!lcore_states[lcore_id].is_service_core)
+			continue;
+
+		while (rte_service_lcore_del(lcore_id) == -EBUSY);
+	}
+}