mbox series

[v3,0/6] support oops handling

Message ID 20210906041732.1019743-1-jerinj@marvell.com (mailing list archive)
Headers
Series support oops handling |

Message

Jerin Jacob Kollanukkaran Sept. 6, 2021, 4:17 a.m. UTC
  From: Jerin Jacob <jerinj@marvell.com>

v3:

- Updated the release notes
- Introduce "--no-oops" EAL option to disable default EAL handler.
  Default EAL oops handler stores the existing handler and invoke after
  decoding. So there may not be explicit use case to use this. But added,
  just in case for control to application. Taken the similar appoarach like
  telemetry where by default it is enabled to avoid updating all the
  existing applications.
- Change oops_print to fprintf as rte_log is not safe from fault handler.(Stephen)
- Removed "sig" from signal_db as it is duplicate(Stephen)
- Add const to mem32_dump(Stephen)
- Add const to oops_signals[](Stephen)
	
v2:
- Fix powerpc build (David Christensen)

It is handy to get detailed OOPS information like Linux kernel
when DPDK application crashes without losing any of the features
provided by coredump infrastructure by the OS.

This patch series introduces the APIs to handle OOPS in DPDK.

Following section details the implementation and API interface to application.

On rte_eal_init() invocation and if –no-oops not provided in the EAL
command line argument, then EAL library installs the
oops handler for the essential signals.
The rte_oops_signals_enabled() API provides the list
of signals the library installed by the EAL.

The default EAL oops handler decodes the oops message using rte_oops_decode()
and then calls the signal handler installed by the application 
before invoking the rte_eal_init(). This scheme will also enable the use of
the default coredump handler(for gdb etc.) provided by OS 
if the application does not install any specific signal handler. 

The second case where the application installs the signal handler after 
the rte_eal_init() invocation, rte_oops_decode() provides the means of
decoding the oops message in the application's fault handler.


Patch split:

Patch 1/6: defines the API and stub implementation for Unix systems
Patch 2/6: The API implementation
Patch 3/6: add an optional libunwind dependency to DPDK for better backtrace in oops.
Patch 4/6: x86 specific archinfo like x86 register dump on oops
Patch 5/6: arm64 specific archinfo like arm64 register dump on oops
Patch 6/6: UT for the new APIs


Example command for the build, run, and output logs of an x86-64 linux machine.
  

meson --buildtype debug build
ninja -C build

echo "oops_autotest" | ./build/app/test/dpdk-test --no-huge  -c 0x2

Signal info:
------------
PID:           2439496
Signal number: 11
Fault address: 0x5

Backtrace:
----------
[  0x55e8b56d5cee]: test_oops_generate()+0x75
[  0x55e8b5459843]: unit_test_suite_runner()+0x1aa
[  0x55e8b56d605c]: test_oops()+0x13
[  0x55e8b544bdfc]: cmd_autotest_parsed()+0x55
[  0x55e8b6063a0d]: cmdline_parse()+0x319
[  0x55e8b6061dea]: cmdline_valid_buffer()+0x35
[  0x55e8b6066bd8]: rdline_char_in()+0xc48
[  0x55e8b606221c]: cmdline_in()+0x62
[  0x55e8b6062495]: cmdline_interact()+0x56
[  0x55e8b5459314]: main()+0x65e
[  0x7f54b25d2b25]: __libc_start_main()+0xd5
[  0x55e8b544bc9e]: _start()+0x2e

Arch info:
----------
R8 : 0x0000000000000000  R9 : 0x0000000000000000
R10: 0x00007f54b25b8b48  R11: 0x00007f54b25e7930
R12: 0x00007fffc695e610  R13: 0x0000000000000000
R14: 0x0000000000000000  R15: 0x0000000000000000
RAX: 0x0000000000000005  RBX: 0x0000000000000001
RCX: 0x00007f54b278a943  RDX: 0x3769043bf13a2594
RBP: 0x00007fffc6958340  RSP: 0x00007fffc6958330
RSI: 0x0000000000000000  RDI: 0x000055e8c4c1e380
RIP: 0x000055e8b56d5cee  EFL: 0x0000000000010246

Stack dump:
----------
0x7fffc6958330: 0x6000000
0x7fffc6958334: 0x0
0x7fffc6958338: 0x30cfeac5
0x7fffc695833c: 0x0
0x7fffc6958340: 0xe08395c6
0x7fffc6958344: 0xff7f0000
0x7fffc6958348: 0x439845b5
0x7fffc695834c: 0xe8550000
0x7fffc6958350: 0x0
0x7fffc6958354: 0xb000000
0x7fffc6958358: 0x20445bb9
0x7fffc695835c: 0xe8550000
0x7fffc6958360: 0x925506b6
0x7fffc6958364: 0x0
0x7fffc6958368: 0x0
0x7fffc695836c: 0x0

Code dump:
----------
0x55e8b56d5cee: 0xc7000000
0x55e8b56d5cf2: 0xeb12
0x55e8b56d5cf6: 0xfb6054b
0x55e8b56d5cfa: 0x87540f84
0x55e8b56d5cfe: 0xc07407b8
0x55e8b56d5d02: 0x0
0x55e8b56d5d06: 0xeb05b8ff
0x55e8b56d5d0a: 0xffffffc9
0x55e8b56d5d0e: 0xc3554889
0x55e8b56d5d12: 0xe54881ec
0x55e8b56d5d16: 0xc0000000
0x55e8b56d5d1a: 0x89bd4cff
0x55e8b56d5d1e: 0xffff4889
0x55e8b56d5d22: 0xb540ffff


Jerin Jacob (6):
  eal: introduce oops handling API
  eal: oops handling API implementation
  eal: support libunwind based backtrace
  eal/x86: support register dump for oops
  eal/arm64: support register dump for oops
  test/oops: support unit test case for oops handling APIs

 .github/workflows/build.yml               |   2 +-
 .travis.yml                               |   2 +-
 app/test/meson.build                      |   2 +
 app/test/test_oops.c                      | 122 +++++++++
 config/meson.build                        |   8 +
 doc/api/doxy-api-index.md                 |   3 +-
 doc/guides/linux_gsg/eal_args.include.rst |   4 +
 doc/guides/rel_notes/release_21_11.rst    |  10 +
 lib/eal/common/eal_common_options.c       |   5 +
 lib/eal/common/eal_internal_cfg.h         |   1 +
 lib/eal/common/eal_options.h              |   2 +
 lib/eal/common/eal_private.h              |   3 +
 lib/eal/freebsd/eal.c                     |   8 +
 lib/eal/include/meson.build               |   1 +
 lib/eal/include/rte_oops.h                | 101 ++++++++
 lib/eal/linux/eal.c                       |   7 +
 lib/eal/unix/eal_oops.c                   | 293 ++++++++++++++++++++++
 lib/eal/unix/meson.build                  |   1 +
 lib/eal/version.map                       |   4 +
 19 files changed, 576 insertions(+), 3 deletions(-)
 create mode 100644 app/test/test_oops.c
 create mode 100644 lib/eal/include/rte_oops.h
 create mode 100644 lib/eal/unix/eal_oops.c
  

Comments

Thomas Monjalon Sept. 21, 2021, 5:30 p.m. UTC | #1
06/09/2021 06:17, jerinj@marvell.com:
> It is handy to get detailed OOPS information like Linux kernel
> when DPDK application crashes without losing any of the features
> provided by coredump infrastructure by the OS.
> 
> This patch series introduces the APIs to handle OOPS in DPDK.

I don't understand how it is related to DPDK.
It looks something to be handled freely by the application
without DPDK forcing anything.
What is the benefit for other DPDK features?
Which problem is it solving?
  
Jerin Jacob Sept. 21, 2021, 5:54 p.m. UTC | #2
On Tue, Sep 21, 2021 at 11:00 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> 06/09/2021 06:17, jerinj@marvell.com:
> > It is handy to get detailed OOPS information like Linux kernel
> > when DPDK application crashes without losing any of the features
> > provided by coredump infrastructure by the OS.
> >
> > This patch series introduces the APIs to handle OOPS in DPDK.
>
> I don't understand how it is related to DPDK.

It abstracts the execution environment/architecture(See Arch Info in
log)[1] details to capture
details on fault handlers to enable additional details on fault from
DPDK application for
additional debugging information. Just like Kernel prints its OOPS on fault.

> It looks something to be handled freely by the application
> without DPDK forcing anything.

This NOT enforcing application to use DPDK OOPS handler, instead, if
registered then
it uses the default handler.

Even if the default handler is registered it invokes the application
handler if the application registers
the fault handler. So there is not difference in behavior.

> What is the benefit for other DPDK features?

Could you clarify this question a bit more?

> Which problem is it solving?

Better debug trace on fault for DPDK application. Instead of faulting
with no information.


[1]

Backtrace:
----------
[  0x55e8b56d5cee]: test_oops_generate()+0x75
[  0x55e8b5459843]: unit_test_suite_runner()+0x1aa
[  0x55e8b56d605c]: test_oops()+0x13
[  0x55e8b544bdfc]: cmd_autotest_parsed()+0x55
[  0x55e8b6063a0d]: cmdline_parse()+0x319
[  0x55e8b6061dea]: cmdline_valid_buffer()+0x35
[  0x55e8b6066bd8]: rdline_char_in()+0xc48
[  0x55e8b606221c]: cmdline_in()+0x62
[  0x55e8b6062495]: cmdline_interact()+0x56
[  0x55e8b5459314]: main()+0x65e
[  0x7f54b25d2b25]: __libc_start_main()+0xd5
[  0x55e8b544bc9e]: _start()+0x2e

Arch info:
----------
R8 : 0x0000000000000000  R9 : 0x0000000000000000
R10: 0x00007f54b25b8b48  R11: 0x00007f54b25e7930
R12: 0x00007fffc695e610  R13: 0x0000000000000000
R14: 0x0000000000000000  R15: 0x0000000000000000
RAX: 0x0000000000000005  RBX: 0x0000000000000001
RCX: 0x00007f54b278a943  RDX: 0x3769043bf13a2594
RBP: 0x00007fffc6958340  RSP: 0x00007fffc6958330
RSI: 0x0000000000000000  RDI: 0x000055e8c4c1e380
RIP: 0x000055e8b56d5cee  EFL: 0x0000000000010246

Stack dump:
----------
0x7fffc6958330: 0x6000000
0x7fffc6958334: 0x0
0x7fffc6958338: 0x30cfeac5
0x7fffc695833c: 0x0
0x7fffc6958340: 0xe08395c6
0x7fffc6958344: 0xff7f0000
0x7fffc6958348: 0x439845b5
0x7fffc695834c: 0xe8550000
0x7fffc6958350: 0x0
0x7fffc6958354: 0xb000000
0x7fffc6958358: 0x20445bb9
0x7fffc695835c: 0xe8550000
0x7fffc6958360: 0x925506b6
0x7fffc6958364: 0x0
0x7fffc6958368: 0x0
0x7fffc695836c: 0x0

Code dump:
----------
0x55e8b56d5cee: 0xc7000000
0x55e8b56d5cf2: 0xeb12
0x55e8b56d5cf6: 0xfb6054b
0x55e8b56d5cfa: 0x87540f84
0x55e8b56d5cfe: 0xc07407b8
0x55e8b56d5d02: 0x0
0x55e8b56d5d06: 0xeb05b8ff
0x55e8b56d5d0a: 0xffffffc9
0x55e8b56d5d0e: 0xc3554889
0x55e8b56d5d12: 0xe54881ec
0x55e8b56d5d16: 0xc0000000
0x55e8b56d5d1a: 0x89bd4cff
0x55e8b56d5d1e: 0xffff4889
0x55e8b56d5d22: 0xb540ffff
>
  
Thomas Monjalon Sept. 22, 2021, 7:34 a.m. UTC | #3
21/09/2021 19:54, Jerin Jacob:
> On Tue, Sep 21, 2021 at 11:00 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >
> > 06/09/2021 06:17, jerinj@marvell.com:
> > > It is handy to get detailed OOPS information like Linux kernel
> > > when DPDK application crashes without losing any of the features
> > > provided by coredump infrastructure by the OS.
> > >
> > > This patch series introduces the APIs to handle OOPS in DPDK.
> >
> > I don't understand how it is related to DPDK.
> 
> It abstracts the execution environment/architecture(See Arch Info in
> log)[1] details to capture
> details on fault handlers to enable additional details on fault from
> DPDK application for
> additional debugging information. Just like Kernel prints its OOPS on fault.

Not sure it is a good direction to achieve the same features as a kernel.
In recent years, the idea was to make DPDK a focused library.

> > It looks something to be handled freely by the application
> > without DPDK forcing anything.
> 
> This NOT enforcing application to use DPDK OOPS handler, instead, if
> registered then
> it uses the default handler.
> 
> Even if the default handler is registered it invokes the application
> handler if the application registers
> the fault handler. So there is not difference in behavior.

OK

> > What is the benefit for other DPDK features?
> 
> Could you clarify this question a bit more?

I mean is it used by other parts of DPDK, or just a standalone feature?

> > Which problem is it solving?
> 
> Better debug trace on fault for DPDK application. Instead of faulting
> with no information.

It does not look to be in the scope of DPDK, or I miss something.
  
Jerin Jacob Sept. 22, 2021, 8:03 a.m. UTC | #4
On Wed, Sep 22, 2021 at 1:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> 21/09/2021 19:54, Jerin Jacob:
> > On Tue, Sep 21, 2021 at 11:00 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > >
> > > 06/09/2021 06:17, jerinj@marvell.com:
> > > > It is handy to get detailed OOPS information like Linux kernel
> > > > when DPDK application crashes without losing any of the features
> > > > provided by coredump infrastructure by the OS.
> > > >
> > > > This patch series introduces the APIs to handle OOPS in DPDK.
> > >
> > > I don't understand how it is related to DPDK.
> >
> > It abstracts the execution environment/architecture(See Arch Info in
> > log)[1] details to capture
> > details on fault handlers to enable additional details on fault from
> > DPDK application for
> > additional debugging information. Just like Kernel prints its OOPS on fault.
>
> Not sure it is a good direction to achieve the same features as a kernel.

I just gave an example, that kernel has this feature and DPDK does not have it.
And it is good for DPDK applications.

Any specific point where you think this feature is not good for DPDK
in-tree and out of tree
applications?

> In recent years, the idea was to make DPDK a focused library.

Not sure how this feature is not deviating from that. See below, on
libunwind library usage.

>
> > > It looks something to be handled freely by the application
> > > without DPDK forcing anything.
> >
> > This NOT enforcing application to use DPDK OOPS handler, instead, if
> > registered then
> > it uses the default handler.
> >
> > Even if the default handler is registered it invokes the application
> > handler if the application registers
> > the fault handler. So there is not difference in behavior.
>
> OK
>
> > > What is the benefit for other DPDK features?
> >
> > Could you clarify this question a bit more?
>
> I mean is it used by other parts of DPDK, or just a standalone feature?

Standalone feature in EAL. It can get a crash dump from any internal
library if it segfaults.
Default handler can be extended if we need more information specific
to DPDK libraries if need
(For example BPF etc)

>
> > > Which problem is it solving?
> >
> > Better debug trace on fault for DPDK application. Instead of faulting
> > with no information.
>
> It does not look to be in the scope of DPDK, or I miss something.

I think it is, like we have APIs for creating control threads in EAL.

Also, This feature is dependent on libunwind as an optional dependency.
So we are not duplicating any other library effort just that integrating
all together including arch specific bits in EAL to have a feature for
better DPDK application usage.

>
>
  
Thomas Monjalon Sept. 22, 2021, 8:33 a.m. UTC | #5
22/09/2021 10:03, Jerin Jacob:
> On Wed, Sep 22, 2021 at 1:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > 21/09/2021 19:54, Jerin Jacob:
> > > On Tue, Sep 21, 2021 at 11:00 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > > 06/09/2021 06:17, jerinj@marvell.com:
> > > > > It is handy to get detailed OOPS information like Linux kernel
> > > > > when DPDK application crashes without losing any of the features
> > > > > provided by coredump infrastructure by the OS.
> > > > >
> > > > > This patch series introduces the APIs to handle OOPS in DPDK.
> > > >
> > > > I don't understand how it is related to DPDK.
> > >
> > > It abstracts the execution environment/architecture(See Arch Info in
> > > log)[1] details to capture
> > > details on fault handlers to enable additional details on fault from
> > > DPDK application for
> > > additional debugging information. Just like Kernel prints its OOPS on fault.
> >
> > Not sure it is a good direction to achieve the same features as a kernel.
> 
> I just gave an example, that kernel has this feature and DPDK does not have it.
> And it is good for DPDK applications.
> 
> Any specific point where you think this feature is not good for DPDK
> in-tree and out of tree applications?

No specific. Just a fear we make life more complex for some users,
because there are always bugs and unplanned side effects.

> > In recent years, the idea was to make DPDK a focused library.
> 
> Not sure how this feature is not deviating from that. See below, on
> libunwind library usage.
> 
> >
> > > > It looks something to be handled freely by the application
> > > > without DPDK forcing anything.
> > >
> > > This NOT enforcing application to use DPDK OOPS handler, instead, if
> > > registered then
> > > it uses the default handler.
> > >
> > > Even if the default handler is registered it invokes the application
> > > handler if the application registers
> > > the fault handler. So there is not difference in behavior.
> >
> > OK
> >
> > > > What is the benefit for other DPDK features?
> > >
> > > Could you clarify this question a bit more?
> >
> > I mean is it used by other parts of DPDK, or just a standalone feature?
> 
> Standalone feature in EAL. It can get a crash dump from any internal
> library if it segfaults.
> Default handler can be extended if we need more information specific
> to DPDK libraries if need
> (For example BPF etc)
> 
> >
> > > > Which problem is it solving?
> > >
> > > Better debug trace on fault for DPDK application. Instead of faulting
> > > with no information.
> >
> > It does not look to be in the scope of DPDK, or I miss something.
> 
> I think it is, like we have APIs for creating control threads in EAL.
> 
> Also, This feature is dependent on libunwind as an optional dependency.
> So we are not duplicating any other library effort just that integrating
> all together including arch specific bits in EAL to have a feature for
> better DPDK application usage.

That's a difficult decision. We need more opinions.
We may also discuss it in the techboard meeting today.
  
Jerin Jacob Sept. 22, 2021, 8:49 a.m. UTC | #6
On Wed, Sep 22, 2021 at 2:03 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> 22/09/2021 10:03, Jerin Jacob:
> > On Wed, Sep 22, 2021 at 1:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > 21/09/2021 19:54, Jerin Jacob:
> > > > On Tue, Sep 21, 2021 at 11:00 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > > > 06/09/2021 06:17, jerinj@marvell.com:
> > > > > > It is handy to get detailed OOPS information like Linux kernel
> > > > > > when DPDK application crashes without losing any of the features
> > > > > > provided by coredump infrastructure by the OS.
> > > > > >
> > > > > > This patch series introduces the APIs to handle OOPS in DPDK.
> > > > >
> > > > > I don't understand how it is related to DPDK.
> > > >
> > > > It abstracts the execution environment/architecture(See Arch Info in
> > > > log)[1] details to capture
> > > > details on fault handlers to enable additional details on fault from
> > > > DPDK application for
> > > > additional debugging information. Just like Kernel prints its OOPS on fault.
> > >
> > > Not sure it is a good direction to achieve the same features as a kernel.
> >
> > I just gave an example, that kernel has this feature and DPDK does not have it.
> > And it is good for DPDK applications.
> >
> > Any specific point where you think this feature is not good for DPDK
> > in-tree and out of tree applications?
>
> No specific. Just a fear we make life more complex for some users,
> because there are always bugs and unplanned side effects.

OK. That's more of a non technical thing.

I have provided an EAL switch to disable this feature like
telemetry has a disable option as EAL argument. It can be used for this purpose.

>
> > > In recent years, the idea was to make DPDK a focused library.
> >
> > Not sure how this feature is not deviating from that. See below, on
> > libunwind library usage.
> >
> > >
> > > > > It looks something to be handled freely by the application
> > > > > without DPDK forcing anything.
> > > >
> > > > This NOT enforcing application to use DPDK OOPS handler, instead, if
> > > > registered then
> > > > it uses the default handler.
> > > >
> > > > Even if the default handler is registered it invokes the application
> > > > handler if the application registers
> > > > the fault handler. So there is not difference in behavior.
> > >
> > > OK
> > >
> > > > > What is the benefit for other DPDK features?
> > > >
> > > > Could you clarify this question a bit more?
> > >
> > > I mean is it used by other parts of DPDK, or just a standalone feature?
> >
> > Standalone feature in EAL. It can get a crash dump from any internal
> > library if it segfaults.
> > Default handler can be extended if we need more information specific
> > to DPDK libraries if need
> > (For example BPF etc)
> >
> > >
> > > > > Which problem is it solving?
> > > >
> > > > Better debug trace on fault for DPDK application. Instead of faulting
> > > > with no information.
> > >
> > > It does not look to be in the scope of DPDK, or I miss something.
> >
> > I think it is, like we have APIs for creating control threads in EAL.
> >
> > Also, This feature is dependent on libunwind as an optional dependency.
> > So we are not duplicating any other library effort just that integrating
> > all together including arch specific bits in EAL to have a feature for
> > better DPDK application usage.
>
> That's a difficult decision. We need more opinions.

Sure.

> We may also discuss it in the techboard meeting today.

Sure.

>
>