mbox series

[v2,00/15] Unit tests fixes for CI

Message ID 1560580950-16754-1-git-send-email-david.marchand@redhat.com (mailing list archive)
Headers
Series Unit tests fixes for CI |

Message

David Marchand June 15, 2019, 6:42 a.m. UTC
  This is a joint effort to make the unit tests ready for CI.
The first patches are fixes that I had accumulated.
Then the second part of the series focuses on skipping tests when some
requirements are not fulfilled so that we can start them in a restrained
environment like Travis virtual machines that gives us two cores and does
not have specific hw devices.

We are still not ready for enabling those tests in Travis.
At least, the following issues remain:
- some fixes on librte_acl have not been merged yet [1],
- the tests on --file-prefix are still ko, and have been isolated in a
  test that we could disable while waiting for the fixes,
- rwlock_autotest and hash_readwrite_lf_autotest are taking a little more
  than 10s,
- librte_table unit test crashes on ipv6 [2],
- the "perf" tests are taking way too long for my taste,
- the shared build unit tests all fail when depending on mempool since
  the mempool drivers are not loaded,

1: http://patchwork.dpdk.org/project/dpdk/list/?series=4242
2: https://bugs.dpdk.org/show_bug.cgi?id=285

Changelog since v1:
- removed limit on 128 cores in rcu tests,
- reworked Michael patch on eal test and started splitting big tests into
  subtests: when a subtest fails, it does not impact the other subtests;
  plus, subtests are shorter to run, so easier to make them fit in 10s,

Comments/reviews welcome!
  

Comments

Bruce Richardson June 17, 2019, 10 a.m. UTC | #1
On Sat, Jun 15, 2019 at 08:42:15AM +0200, David Marchand wrote:
> This is a joint effort to make the unit tests ready for CI.
> The first patches are fixes that I had accumulated.
> Then the second part of the series focuses on skipping tests when some
> requirements are not fulfilled so that we can start them in a restrained
> environment like Travis virtual machines that gives us two cores and does
> not have specific hw devices.
> 
> We are still not ready for enabling those tests in Travis.
> At least, the following issues remain:
> - some fixes on librte_acl have not been merged yet [1],
> - the tests on --file-prefix are still ko, and have been isolated in a
>   test that we could disable while waiting for the fixes,
> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more
>   than 10s,
> - librte_table unit test crashes on ipv6 [2],
> - the "perf" tests are taking way too long for my taste,
> - the shared build unit tests all fail when depending on mempool since
>   the mempool drivers are not loaded,
> 

For the autotest app shared builds, it is probably worthwhile linking in
all drivers explicitly to avoid issues like this.

/Bruce
  
David Marchand June 17, 2019, 10:46 a.m. UTC | #2
On Mon, Jun 17, 2019 at 12:02 PM Bruce Richardson <
bruce.richardson@intel.com> wrote:

> On Sat, Jun 15, 2019 at 08:42:15AM +0200, David Marchand wrote:
> > This is a joint effort to make the unit tests ready for CI.
> > The first patches are fixes that I had accumulated.
> > Then the second part of the series focuses on skipping tests when some
> > requirements are not fulfilled so that we can start them in a restrained
> > environment like Travis virtual machines that gives us two cores and does
> > not have specific hw devices.
> >
> > We are still not ready for enabling those tests in Travis.
> > At least, the following issues remain:
> > - some fixes on librte_acl have not been merged yet [1],
> > - the tests on --file-prefix are still ko, and have been isolated in a
> >   test that we could disable while waiting for the fixes,
> > - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more
> >   than 10s,
> > - librte_table unit test crashes on ipv6 [2],
> > - the "perf" tests are taking way too long for my taste,
> > - the shared build unit tests all fail when depending on mempool since
> >   the mempool drivers are not loaded,
> >
>
> For the autotest app shared builds, it is probably worthwhile linking in
> all drivers explicitly to avoid issues like this.
>

Yes, I'll look into this.

While at it, do you know why the i40e and ixgbe drivers are linked to
app/test in meson?
  
Bruce Richardson June 17, 2019, 11:17 a.m. UTC | #3
On Mon, Jun 17, 2019 at 12:46:03PM +0200, David Marchand wrote:
>    On Mon, Jun 17, 2019 at 12:02 PM Bruce Richardson
>    <[1]bruce.richardson@intel.com> wrote:
> 
>      On Sat, Jun 15, 2019 at 08:42:15AM +0200, David Marchand wrote:
>      > This is a joint effort to make the unit tests ready for CI.
>      > The first patches are fixes that I had accumulated.
>      > Then the second part of the series focuses on skipping tests when
>      some
>      > requirements are not fulfilled so that we can start them in a
>      restrained
>      > environment like Travis virtual machines that gives us two cores
>      and does
>      > not have specific hw devices.
>      >
>      > We are still not ready for enabling those tests in Travis.
>      > At least, the following issues remain:
>      > - some fixes on librte_acl have not been merged yet [1],
>      > - the tests on --file-prefix are still ko, and have been isolated
>      in a
>      >   test that we could disable while waiting for the fixes,
>      > - rwlock_autotest and hash_readwrite_lf_autotest are taking a
>      little more
>      >   than 10s,
>      > - librte_table unit test crashes on ipv6 [2],
>      > - the "perf" tests are taking way too long for my taste,
>      > - the shared build unit tests all fail when depending on mempool
>      since
>      >   the mempool drivers are not loaded,
>      >
>      For the autotest app shared builds, it is probably worthwhile
>      linking in
>      all drivers explicitly to avoid issues like this.
> 
>    Yes, I'll look into this.
>    While at it, do you know why the i40e and ixgbe drivers are linked to
>    app/test in meson?
>    --

There are unit tests for the device-specific functions in those drivers, so
they need to be given at link time.

/Bruce
  
David Marchand June 17, 2019, 11:41 a.m. UTC | #4
On Mon, Jun 17, 2019 at 1:18 PM Bruce Richardson <bruce.richardson@intel.com>
wrote:

> On Mon, Jun 17, 2019 at 12:46:03PM +0200, David Marchand wrote:
> >    On Mon, Jun 17, 2019 at 12:02 PM Bruce Richardson
> >    <[1]bruce.richardson@intel.com> wrote:
> >
> >      On Sat, Jun 15, 2019 at 08:42:15AM +0200, David Marchand wrote:
> >      > This is a joint effort to make the unit tests ready for CI.
> >      > The first patches are fixes that I had accumulated.
> >      > Then the second part of the series focuses on skipping tests when
> >      some
> >      > requirements are not fulfilled so that we can start them in a
> >      restrained
> >      > environment like Travis virtual machines that gives us two cores
> >      and does
> >      > not have specific hw devices.
> >      >
> >      > We are still not ready for enabling those tests in Travis.
> >      > At least, the following issues remain:
> >      > - some fixes on librte_acl have not been merged yet [1],
> >      > - the tests on --file-prefix are still ko, and have been isolated
> >      in a
> >      >   test that we could disable while waiting for the fixes,
> >      > - rwlock_autotest and hash_readwrite_lf_autotest are taking a
> >      little more
> >      >   than 10s,
> >      > - librte_table unit test crashes on ipv6 [2],
> >      > - the "perf" tests are taking way too long for my taste,
> >      > - the shared build unit tests all fail when depending on mempool
> >      since
> >      >   the mempool drivers are not loaded,
> >      >
> >      For the autotest app shared builds, it is probably worthwhile
> >      linking in
> >      all drivers explicitly to avoid issues like this.
> >
> >    Yes, I'll look into this.
> >    While at it, do you know why the i40e and ixgbe drivers are linked to
> >    app/test in meson?
> >    --
>
> There are unit tests for the device-specific functions in those drivers, so
> they need to be given at link time.
>

For testpmd, I can understand.

But I can't see code for driver specific apis in app/test.
It looks like a copy/paste error when adding meson support.
  
Bruce Richardson June 17, 2019, 11:56 a.m. UTC | #5
On Mon, Jun 17, 2019 at 01:41:21PM +0200, David Marchand wrote:
>    On Mon, Jun 17, 2019 at 1:18 PM Bruce Richardson
>    <[1]bruce.richardson@intel.com> wrote:
> 
>      On Mon, Jun 17, 2019 at 12:46:03PM +0200, David Marchand wrote:
>      >    On Mon, Jun 17, 2019 at 12:02 PM Bruce Richardson
>      >    <[1][2]bruce.richardson@intel.com> wrote:
>      >
>      >      On Sat, Jun 15, 2019 at 08:42:15AM +0200, David Marchand
>      wrote:
>      >      > This is a joint effort to make the unit tests ready for CI.
>      >      > The first patches are fixes that I had accumulated.
>      >      > Then the second part of the series focuses on skipping
>      tests when
>      >      some
>      >      > requirements are not fulfilled so that we can start them in
>      a
>      >      restrained
>      >      > environment like Travis virtual machines that gives us two
>      cores
>      >      and does
>      >      > not have specific hw devices.
>      >      >
>      >      > We are still not ready for enabling those tests in Travis.
>      >      > At least, the following issues remain:
>      >      > - some fixes on librte_acl have not been merged yet [1],
>      >      > - the tests on --file-prefix are still ko, and have been
>      isolated
>      >      in a
>      >      >   test that we could disable while waiting for the fixes,
>      >      > - rwlock_autotest and hash_readwrite_lf_autotest are taking
>      a
>      >      little more
>      >      >   than 10s,
>      >      > - librte_table unit test crashes on ipv6 [2],
>      >      > - the "perf" tests are taking way too long for my taste,
>      >      > - the shared build unit tests all fail when depending on
>      mempool
>      >      since
>      >      >   the mempool drivers are not loaded,
>      >      >
>      >      For the autotest app shared builds, it is probably worthwhile
>      >      linking in
>      >      all drivers explicitly to avoid issues like this.
>      >
>      >    Yes, I'll look into this.
>      >    While at it, do you know why the i40e and ixgbe drivers are
>      linked to
>      >    app/test in meson?
>      >    --
>      There are unit tests for the device-specific functions in those
>      drivers, so
>      they need to be given at link time.
> 
>    For testpmd, I can understand.
>    But I can't see code for driver specific apis in app/test.
>    It looks like a copy/paste error when adding meson support.
>    --
Ok, could be. Simple question is does it still build ok if you remove them?

/Bruce
  
David Marchand June 17, 2019, 1:44 p.m. UTC | #6
On Mon, Jun 17, 2019 at 1:57 PM Bruce Richardson <bruce.richardson@intel.com>
wrote:

> On Mon, Jun 17, 2019 at 01:41:21PM +0200, David Marchand wrote:
> >    On Mon, Jun 17, 2019 at 1:18 PM Bruce Richardson
> >    <[1]bruce.richardson@intel.com> wrote:
> >
> >      On Mon, Jun 17, 2019 at 12:46:03PM +0200, David Marchand wrote:
> >      >    On Mon, Jun 17, 2019 at 12:02 PM Bruce Richardson
> >      >    <[1][2]bruce.richardson@intel.com> wrote:
> >      >
> >      >      On Sat, Jun 15, 2019 at 08:42:15AM +0200, David Marchand
> >      wrote:
> >      >      > This is a joint effort to make the unit tests ready for CI.
> >      >      > The first patches are fixes that I had accumulated.
> >      >      > Then the second part of the series focuses on skipping
> >      tests when
> >      >      some
> >      >      > requirements are not fulfilled so that we can start them in
> >      a
> >      >      restrained
> >      >      > environment like Travis virtual machines that gives us two
> >      cores
> >      >      and does
> >      >      > not have specific hw devices.
> >      >      >
> >      >      > We are still not ready for enabling those tests in Travis.
> >      >      > At least, the following issues remain:
> >      >      > - some fixes on librte_acl have not been merged yet [1],
> >      >      > - the tests on --file-prefix are still ko, and have been
> >      isolated
> >      >      in a
> >      >      >   test that we could disable while waiting for the fixes,
> >      >      > - rwlock_autotest and hash_readwrite_lf_autotest are taking
> >      a
> >      >      little more
> >      >      >   than 10s,
> >      >      > - librte_table unit test crashes on ipv6 [2],
> >      >      > - the "perf" tests are taking way too long for my taste,
> >      >      > - the shared build unit tests all fail when depending on
> >      mempool
> >      >      since
> >      >      >   the mempool drivers are not loaded,
> >      >      >
> >      >      For the autotest app shared builds, it is probably worthwhile
> >      >      linking in
> >      >      all drivers explicitly to avoid issues like this.
> >      >
> >      >    Yes, I'll look into this.
> >      >    While at it, do you know why the i40e and ixgbe drivers are
> >      linked to
> >      >    app/test in meson?
> >      >    --
> >      There are unit tests for the device-specific functions in those
> >      drivers, so
> >      they need to be given at link time.
> >
> >    For testpmd, I can understand.
> >    But I can't see code for driver specific apis in app/test.
> >    It looks like a copy/paste error when adding meson support.
> >    --
> Ok, could be. Simple question is does it still build ok if you remove them?
>

It would have been strange if it did not build, since on Makefile side we
do nothing.
Yes, it builds fine with meson without this.

I managed to get the same test results than with static builds by linking
the skeleton eventdev driver and the mempool sw drivers.
Should be enough.
  
Thomas Monjalon June 27, 2019, 8:36 p.m. UTC | #7
15/06/2019 08:42, David Marchand:
> This is a joint effort to make the unit tests ready for CI.

Applied, thanks

Remaining work below from your list:
[...]
> - the tests on --file-prefix are still ko, and have been isolated in a
>   test that we could disable while waiting for the fixes,
> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more
>   than 10s,
> - librte_table unit test crashes on ipv6 [2],
> - the "perf" tests are taking way too long for my taste,
> - the shared build unit tests all fail when depending on mempool since
>   the mempool drivers are not loaded,
> 
> 2: https://bugs.dpdk.org/show_bug.cgi?id=285
  
Aaron Conole July 1, 2019, 4:04 p.m. UTC | #8
Thomas Monjalon <thomas@monjalon.net> writes:

> 15/06/2019 08:42, David Marchand:
>> This is a joint effort to make the unit tests ready for CI.
>
> Applied, thanks
>
> Remaining work below from your list:
> [...]
>> - the tests on --file-prefix are still ko, and have been isolated in a
>>   test that we could disable while waiting for the fixes,

Yes, I think it's good to do that for now.

>> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more
>>   than 10s,

Occasionally the distributor test times out as well.  I've moved them as
part of a separate patch, that I'll post along with a bigger series to
enable the unit tests under travis.  Michael and I are leaning toward
introducing a new variable called RUN_TESTS which will do the docs and
unit testing since those combined would add quite a bit to the execution
time of each job (and feel free to bike shed the name, since the patches
aren't final).

>> - librte_table unit test crashes on ipv6 [2],

I guess we're waiting on a patch from Jananee (CC'd)?

>> - the "perf" tests are taking way too long for my taste,

Agreed, so I think we'll disable the perf tests in the CI environment
anyway.

>> - the shared build unit tests all fail when depending on mempool since
>>   the mempool drivers are not loaded,

I think Michael is working on a fix for this right now.

>> 2: https://bugs.dpdk.org/show_bug.cgi?id=285
  
Thomas Monjalon July 1, 2019, 4:22 p.m. UTC | #9
01/07/2019 18:04, Aaron Conole:
> Michael and I are leaning toward
> introducing a new variable called RUN_TESTS which will do the docs and
> unit testing since those combined would add quite a bit to the execution
> time of each job (and feel free to bike shed the name, since the patches
> aren't final).

Please would you like to send a RFC so we can discuss it before
you do the final patches?
  
David Marchand July 1, 2019, 4:45 p.m. UTC | #10
On Mon, Jul 1, 2019 at 6:04 PM Aaron Conole <aconole@redhat.com> wrote:

> >> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little
> more
> >>   than 10s,
>
> Occasionally the distributor test times out as well.  I've moved them as
> part of a separate patch, that I'll post along with a bigger series to
> enable the unit tests under travis.  Michael and I are leaning toward
> introducing a new variable called RUN_TESTS which will do the docs and
> unit testing since those combined would add quite a bit to the execution
> time of each job (and feel free to bike shed the name, since the patches
> aren't final).
>

Seeing how the distributor autotest usually takes less than a second to
complete, this sounds like a bug.
I don't think I caught this so far.


Yes, we need a variable to control this and select the targets that will do
the tests and/or build the doc.
About the name, RUN_TESTS is ok for me.

What do you want to make of this variable?
Have it as a simple boolean that enables everything? Or a selector with
strings like unit-tests+doc+perf-tests?



> >> - librte_table unit test crashes on ipv6 [2],
>
> I guess we're waiting on a patch from Jananee (CC'd)?
>

Yep.
  
Michael Santana July 1, 2019, 6:07 p.m. UTC | #11
>
>
>
> On Mon, Jul 1, 2019 at 6:04 PM Aaron Conole <aconole@redhat.com> wrote:
>>
>> >> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more
>> >>   than 10s,
>>
>> Occasionally the distributor test times out as well.  I've moved them as
>> part of a separate patch, that I'll post along with a bigger series to
>> enable the unit tests under travis.  Michael and I are leaning toward
>> introducing a new variable called RUN_TESTS which will do the docs and
>> unit testing since those combined would add quite a bit to the execution
>> time of each job (and feel free to bike shed the name, since the patches
>> aren't final).
>
>
> Seeing how the distributor autotest usually takes less than a second to complete, this sounds like a bug.
> I don't think I caught this so far.
So I actually ran into the distributor test timing out. I agree with
David in that it is a bug with the test. Looking at the logs that test
normally finishes in less than 1/2 a second, so running to 10 seconds
and timing out is a big jump in run time. I ran into the issue where
it timedout, so I restarted the job and it finished no problem.
The test fails every so often for no good reason and the logs[1] dont
really say much. I speculate that it is waiting for a resource to
become available or in the worse case a deadlock. Seeing that it only
fails every so often and it passes when restarted I don't think it's a
big deal, nevertheless it's worth investing time figuring out what's
wrong

[1] https://api.travis-ci.com/v3/job/212335916/log.txt
>
>
> Yes, we need a variable to control this and select the targets that will do the tests and/or build the doc.
> About the name, RUN_TESTS is ok for me.
>
> What do you want to make of this variable?
> Have it as a simple boolean that enables everything? Or a selector with strings like unit-tests+doc+perf-tests?
>
>
>>
>> >> - librte_table unit test crashes on ipv6 [2],
>>
>> I guess we're waiting on a patch from Jananee (CC'd)?
>
>
> Yep.
>
>
> --
> David Marchand
  
Michael Santana July 9, 2019, 3:50 p.m. UTC | #12
On 7/1/19 2:07 PM, Michael Santana Francisco wrote:
>>
>>
>> On Mon, Jul 1, 2019 at 6:04 PM Aaron Conole <aconole@redhat.com> wrote:
>>>>> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more
>>>>>    than 10s,
>>> Occasionally the distributor test times out as well.  I've moved them as
>>> part of a separate patch, that I'll post along with a bigger series to
>>> enable the unit tests under travis.  Michael and I are leaning toward
>>> introducing a new variable called RUN_TESTS which will do the docs and
>>> unit testing since those combined would add quite a bit to the execution
>>> time of each job (and feel free to bike shed the name, since the patches
>>> aren't final).
>>
>> Seeing how the distributor autotest usually takes less than a second to complete, this sounds like a bug.
>> I don't think I caught this so far.
> So I actually ran into the distributor test timing out. I agree with
> David in that it is a bug with the test. Looking at the logs that test
> normally finishes in less than 1/2 a second, so running to 10 seconds
> and timing out is a big jump in run time. I ran into the issue where
> it timedout, so I restarted the job and it finished no problem.
> The test fails every so often for no good reason and the logs[1] dont
> really say much. I speculate that it is waiting for a resource to
> become available or in the worse case a deadlock. Seeing that it only
> fails every so often and it passes when restarted I don't think it's a
> big deal, nevertheless it's worth investing time figuring out what's
> wrong
>
> [1] https://api.travis-ci.com/v3/job/212335916/log.txt

I investigated a little bit on this this test. CC'd David Hunt,

I was able to reproduce the problem on v19.08-rc1 with:

`while sudo sh -c "echo 'distributor_autotest' | 
./build/app/test/dpdk-test"; do :; done`

It runs a couple of times fine showing output and showing progress, but 
then at some point after a couple of seconds it just stops - no longer 
getting any output. It just sits there with no further output. I let it 
sit there for a whole minute and nothing happens. So I attach gdb to try 
to figure out what is happening. One thread seems to be stuck on a while 
loop, see lib/librte_distributor/rte_distributor.c:310.

I looked at the assembly code (layout asm, ni) and I saw these four 
lines below (which correspond to the while loop) being executed 
repeatedly and indefinitely. It looks like this thread is waiting for 
the variable bufptr64[0] to change state.

0xa064d0 <release+32>   pause
0xa064d2 <release+34>   mov    0x3840(%rdx),%rax
0xa064d9 <release+41>   test   $0x1,%al
0xa064db <release+43>   je     0xa064d0 <release+32>


While the first thread is waiting on bufptr64[0] to change state, there 
is another thread that is also stuck on another while loop on 
lib/librte_distributor/rte_distributor.c:53. It seems that this thread 
is stuck waiting for retptr64 to change state. Corresponding assembly 
being executed indefinitely:

0xa06de0 <rte_distributor_request_pkt_v1705+592> mov    0x38c0(%r8),%rax
0xa06de7 <rte_distributor_request_pkt_v1705+599> test   $0x1,%al
0xa06de9 <rte_distributor_request_pkt_v1705+601> je     0xa06bbd 
<rte_distributor_request_pkt_v1705+45>
0xa06def <rte_distributor_request_pkt_v1705+607>        nop
0xa06df0 <rte_distributor_request_pkt_v1705+608> pause
0xa06df2 <rte_distributor_request_pkt_v1705+610> rdtsc
0xa06df4 <rte_distributor_request_pkt_v1705+612> mov    %rdx,%r10
0xa06df7 <rte_distributor_request_pkt_v1705+615> shl    $0x20,%r10
0xa06dfb <rte_distributor_request_pkt_v1705+619> mov    %eax,%eax
0xa06dfd <rte_distributor_request_pkt_v1705+621> or     %r10,%rax
0xa06e00 <rte_distributor_request_pkt_v1705+624> lea    0x64(%rax),%r10
0xa06e04 <rte_distributor_request_pkt_v1705+628> jmp    0xa06e12 
<rte_distributor_request_pkt_v1705+642>
0xa06e06 <rte_distributor_request_pkt_v1705+630> nopw   %cs:0x0(%rax,%rax,1)
0xa06e10 <rte_distributor_request_pkt_v1705+640> pause
0xa06e12 <rte_distributor_request_pkt_v1705+642> rdtsc
0xa06e14 <rte_distributor_request_pkt_v1705+644> shl    $0x20,%rdx
0xa06e18 <rte_distributor_request_pkt_v1705+648> mov    %eax,%eax
0xa06e1a <rte_distributor_request_pkt_v1705+650> or     %rdx,%rax
0xa06e1d <rte_distributor_request_pkt_v1705+653> cmp    %rax,%r10
0xa06e20 <rte_distributor_request_pkt_v1705+656> ja     0xa06e10 
<rte_distributor_request_pkt_v1705+640>
0xa06e22 <rte_distributor_request_pkt_v1705+658> jmp    0xa06de0 
<rte_distributor_request_pkt_v1705+592>


My guess is that these threads are interdependent, so one thread is 
waiting for the other thread to change the state of the control 
variable. I can't say for sure if this is what is happening or why the 
these variables don't change state, so I would like ask someone who is 
more familiar with this particular code to take a look

>>
>> Yes, we need a variable to control this and select the targets that will do the tests and/or build the doc.
>> About the name, RUN_TESTS is ok for me.
>>
>> What do you want to make of this variable?
>> Have it as a simple boolean that enables everything? Or a selector with strings like unit-tests+doc+perf-tests?
>>
>>
>>>>> - librte_table unit test crashes on ipv6 [2],
>>> I guess we're waiting on a patch from Jananee (CC'd)?
>>
>> Yep.
>>
>>
>> --
>> David Marchand
  
David Marchand July 10, 2019, 8:18 a.m. UTC | #13
On Tue, Jul 9, 2019 at 5:50 PM Michael Santana Francisco <
msantana@redhat.com> wrote:

> On 7/1/19 2:07 PM, Michael Santana Francisco wrote:
> >>
> >>
> >> On Mon, Jul 1, 2019 at 6:04 PM Aaron Conole <aconole@redhat.com> wrote:
> >>>>> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little
> more
> >>>>>    than 10s,
> >>> Occasionally the distributor test times out as well.  I've moved them
> as
> >>> part of a separate patch, that I'll post along with a bigger series to
> >>> enable the unit tests under travis.  Michael and I are leaning toward
> >>> introducing a new variable called RUN_TESTS which will do the docs and
> >>> unit testing since those combined would add quite a bit to the
> execution
> >>> time of each job (and feel free to bike shed the name, since the
> patches
> >>> aren't final).
> >>
> >> Seeing how the distributor autotest usually takes less than a second to
> complete, this sounds like a bug.
> >> I don't think I caught this so far.
> > So I actually ran into the distributor test timing out. I agree with
> > David in that it is a bug with the test. Looking at the logs that test
> > normally finishes in less than 1/2 a second, so running to 10 seconds
> > and timing out is a big jump in run time. I ran into the issue where
> > it timedout, so I restarted the job and it finished no problem.
> > The test fails every so often for no good reason and the logs[1] dont
> > really say much. I speculate that it is waiting for a resource to
> > become available or in the worse case a deadlock. Seeing that it only
> > fails every so often and it passes when restarted I don't think it's a
> > big deal, nevertheless it's worth investing time figuring out what's
> > wrong
> >
> > [1] https://api.travis-ci.com/v3/job/212335916/log.txt
>
> I investigated a little bit on this this test. CC'd David Hunt,
>
> I was able to reproduce the problem on v19.08-rc1 with:
>
> `while sudo sh -c "echo 'distributor_autotest' |
> ./build/app/test/dpdk-test"; do :; done`
>
> It runs a couple of times fine showing output and showing progress, but
> then at some point after a couple of seconds it just stops - no longer
> getting any output. It just sits there with no further output. I let it
> sit there for a whole minute and nothing happens. So I attach gdb to try
> to figure out what is happening. One thread seems to be stuck on a while
> loop, see lib/librte_distributor/rte_distributor.c:310.
>
> I looked at the assembly code (layout asm, ni) and I saw these four
> lines below (which correspond to the while loop) being executed
> repeatedly and indefinitely. It looks like this thread is waiting for
> the variable bufptr64[0] to change state.
>
> 0xa064d0 <release+32>   pause
> 0xa064d2 <release+34>   mov    0x3840(%rdx),%rax
> 0xa064d9 <release+41>   test   $0x1,%al
> 0xa064db <release+43>   je     0xa064d0 <release+32>
>
>
> While the first thread is waiting on bufptr64[0] to change state, there
> is another thread that is also stuck on another while loop on
> lib/librte_distributor/rte_distributor.c:53. It seems that this thread
> is stuck waiting for retptr64 to change state. Corresponding assembly
> being executed indefinitely:
>
> 0xa06de0 <rte_distributor_request_pkt_v1705+592> mov    0x38c0(%r8),%rax
> 0xa06de7 <rte_distributor_request_pkt_v1705+599> test   $0x1,%al
> 0xa06de9 <rte_distributor_request_pkt_v1705+601> je     0xa06bbd
> <rte_distributor_request_pkt_v1705+45>
> 0xa06def <rte_distributor_request_pkt_v1705+607>        nop
> 0xa06df0 <rte_distributor_request_pkt_v1705+608> pause
> 0xa06df2 <rte_distributor_request_pkt_v1705+610> rdtsc
> 0xa06df4 <rte_distributor_request_pkt_v1705+612> mov    %rdx,%r10
> 0xa06df7 <rte_distributor_request_pkt_v1705+615> shl    $0x20,%r10
> 0xa06dfb <rte_distributor_request_pkt_v1705+619> mov    %eax,%eax
> 0xa06dfd <rte_distributor_request_pkt_v1705+621> or     %r10,%rax
> 0xa06e00 <rte_distributor_request_pkt_v1705+624> lea    0x64(%rax),%r10
> 0xa06e04 <rte_distributor_request_pkt_v1705+628> jmp    0xa06e12
> <rte_distributor_request_pkt_v1705+642>
> 0xa06e06 <rte_distributor_request_pkt_v1705+630> nopw
> %cs:0x0(%rax,%rax,1)
> 0xa06e10 <rte_distributor_request_pkt_v1705+640> pause
> 0xa06e12 <rte_distributor_request_pkt_v1705+642> rdtsc
> 0xa06e14 <rte_distributor_request_pkt_v1705+644> shl    $0x20,%rdx
> 0xa06e18 <rte_distributor_request_pkt_v1705+648> mov    %eax,%eax
> 0xa06e1a <rte_distributor_request_pkt_v1705+650> or     %rdx,%rax
> 0xa06e1d <rte_distributor_request_pkt_v1705+653> cmp    %rax,%r10
> 0xa06e20 <rte_distributor_request_pkt_v1705+656> ja     0xa06e10
> <rte_distributor_request_pkt_v1705+640>
> 0xa06e22 <rte_distributor_request_pkt_v1705+658> jmp    0xa06de0
> <rte_distributor_request_pkt_v1705+592>
>
>
> My guess is that these threads are interdependent, so one thread is
> waiting for the other thread to change the state of the control
> variable. I can't say for sure if this is what is happening or why the
> these variables don't change state, so I would like ask someone who is
> more familiar with this particular code to take a look
>

Ah cool, thanks for the analysis.
Can you create a bz with this description and assign it to the
librte_distributor maintainer?