Message ID | 1560580950-16754-1-git-send-email-david.marchand@redhat.com (mailing list archive) |
---|---|
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id CDE181C29F; Sat, 15 Jun 2019 08:42:44 +0200 (CEST) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id 51A891BE84 for <dev@dpdk.org>; Sat, 15 Jun 2019 08:42:43 +0200 (CEST) Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 285673688B; Sat, 15 Jun 2019 06:42:42 +0000 (UTC) Received: from dmarchan.remote.csb (unknown [10.40.205.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id DBE555C29A; Sat, 15 Jun 2019 06:42:38 +0000 (UTC) From: David Marchand <david.marchand@redhat.com> To: dev@dpdk.org Cc: thomas@monjalon.net, aconole@redhat.com, msantana@redhat.com Date: Sat, 15 Jun 2019 08:42:15 +0200 Message-Id: <1560580950-16754-1-git-send-email-david.marchand@redhat.com> In-Reply-To: <1559638792-8608-1-git-send-email-david.marchand@redhat.com> References: <1559638792-8608-1-git-send-email-david.marchand@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Sat, 15 Jun 2019 06:42:42 +0000 (UTC) Subject: [dpdk-dev] [PATCH v2 00/15] Unit tests fixes for CI X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org> |
Series |
Unit tests fixes for CI
|
|
Message
David Marchand
June 15, 2019, 6:42 a.m. UTC
This is a joint effort to make the unit tests ready for CI. The first patches are fixes that I had accumulated. Then the second part of the series focuses on skipping tests when some requirements are not fulfilled so that we can start them in a restrained environment like Travis virtual machines that gives us two cores and does not have specific hw devices. We are still not ready for enabling those tests in Travis. At least, the following issues remain: - some fixes on librte_acl have not been merged yet [1], - the tests on --file-prefix are still ko, and have been isolated in a test that we could disable while waiting for the fixes, - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more than 10s, - librte_table unit test crashes on ipv6 [2], - the "perf" tests are taking way too long for my taste, - the shared build unit tests all fail when depending on mempool since the mempool drivers are not loaded, 1: http://patchwork.dpdk.org/project/dpdk/list/?series=4242 2: https://bugs.dpdk.org/show_bug.cgi?id=285 Changelog since v1: - removed limit on 128 cores in rcu tests, - reworked Michael patch on eal test and started splitting big tests into subtests: when a subtest fails, it does not impact the other subtests; plus, subtests are shorter to run, so easier to make them fit in 10s, Comments/reviews welcome!
Comments
On Sat, Jun 15, 2019 at 08:42:15AM +0200, David Marchand wrote: > This is a joint effort to make the unit tests ready for CI. > The first patches are fixes that I had accumulated. > Then the second part of the series focuses on skipping tests when some > requirements are not fulfilled so that we can start them in a restrained > environment like Travis virtual machines that gives us two cores and does > not have specific hw devices. > > We are still not ready for enabling those tests in Travis. > At least, the following issues remain: > - some fixes on librte_acl have not been merged yet [1], > - the tests on --file-prefix are still ko, and have been isolated in a > test that we could disable while waiting for the fixes, > - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more > than 10s, > - librte_table unit test crashes on ipv6 [2], > - the "perf" tests are taking way too long for my taste, > - the shared build unit tests all fail when depending on mempool since > the mempool drivers are not loaded, > For the autotest app shared builds, it is probably worthwhile linking in all drivers explicitly to avoid issues like this. /Bruce
On Mon, Jun 17, 2019 at 12:02 PM Bruce Richardson < bruce.richardson@intel.com> wrote: > On Sat, Jun 15, 2019 at 08:42:15AM +0200, David Marchand wrote: > > This is a joint effort to make the unit tests ready for CI. > > The first patches are fixes that I had accumulated. > > Then the second part of the series focuses on skipping tests when some > > requirements are not fulfilled so that we can start them in a restrained > > environment like Travis virtual machines that gives us two cores and does > > not have specific hw devices. > > > > We are still not ready for enabling those tests in Travis. > > At least, the following issues remain: > > - some fixes on librte_acl have not been merged yet [1], > > - the tests on --file-prefix are still ko, and have been isolated in a > > test that we could disable while waiting for the fixes, > > - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more > > than 10s, > > - librte_table unit test crashes on ipv6 [2], > > - the "perf" tests are taking way too long for my taste, > > - the shared build unit tests all fail when depending on mempool since > > the mempool drivers are not loaded, > > > > For the autotest app shared builds, it is probably worthwhile linking in > all drivers explicitly to avoid issues like this. > Yes, I'll look into this. While at it, do you know why the i40e and ixgbe drivers are linked to app/test in meson?
On Mon, Jun 17, 2019 at 12:46:03PM +0200, David Marchand wrote: > On Mon, Jun 17, 2019 at 12:02 PM Bruce Richardson > <[1]bruce.richardson@intel.com> wrote: > > On Sat, Jun 15, 2019 at 08:42:15AM +0200, David Marchand wrote: > > This is a joint effort to make the unit tests ready for CI. > > The first patches are fixes that I had accumulated. > > Then the second part of the series focuses on skipping tests when > some > > requirements are not fulfilled so that we can start them in a > restrained > > environment like Travis virtual machines that gives us two cores > and does > > not have specific hw devices. > > > > We are still not ready for enabling those tests in Travis. > > At least, the following issues remain: > > - some fixes on librte_acl have not been merged yet [1], > > - the tests on --file-prefix are still ko, and have been isolated > in a > > test that we could disable while waiting for the fixes, > > - rwlock_autotest and hash_readwrite_lf_autotest are taking a > little more > > than 10s, > > - librte_table unit test crashes on ipv6 [2], > > - the "perf" tests are taking way too long for my taste, > > - the shared build unit tests all fail when depending on mempool > since > > the mempool drivers are not loaded, > > > For the autotest app shared builds, it is probably worthwhile > linking in > all drivers explicitly to avoid issues like this. > > Yes, I'll look into this. > While at it, do you know why the i40e and ixgbe drivers are linked to > app/test in meson? > -- There are unit tests for the device-specific functions in those drivers, so they need to be given at link time. /Bruce
On Mon, Jun 17, 2019 at 1:18 PM Bruce Richardson <bruce.richardson@intel.com> wrote: > On Mon, Jun 17, 2019 at 12:46:03PM +0200, David Marchand wrote: > > On Mon, Jun 17, 2019 at 12:02 PM Bruce Richardson > > <[1]bruce.richardson@intel.com> wrote: > > > > On Sat, Jun 15, 2019 at 08:42:15AM +0200, David Marchand wrote: > > > This is a joint effort to make the unit tests ready for CI. > > > The first patches are fixes that I had accumulated. > > > Then the second part of the series focuses on skipping tests when > > some > > > requirements are not fulfilled so that we can start them in a > > restrained > > > environment like Travis virtual machines that gives us two cores > > and does > > > not have specific hw devices. > > > > > > We are still not ready for enabling those tests in Travis. > > > At least, the following issues remain: > > > - some fixes on librte_acl have not been merged yet [1], > > > - the tests on --file-prefix are still ko, and have been isolated > > in a > > > test that we could disable while waiting for the fixes, > > > - rwlock_autotest and hash_readwrite_lf_autotest are taking a > > little more > > > than 10s, > > > - librte_table unit test crashes on ipv6 [2], > > > - the "perf" tests are taking way too long for my taste, > > > - the shared build unit tests all fail when depending on mempool > > since > > > the mempool drivers are not loaded, > > > > > For the autotest app shared builds, it is probably worthwhile > > linking in > > all drivers explicitly to avoid issues like this. > > > > Yes, I'll look into this. > > While at it, do you know why the i40e and ixgbe drivers are linked to > > app/test in meson? > > -- > > There are unit tests for the device-specific functions in those drivers, so > they need to be given at link time. > For testpmd, I can understand. But I can't see code for driver specific apis in app/test. It looks like a copy/paste error when adding meson support.
On Mon, Jun 17, 2019 at 01:41:21PM +0200, David Marchand wrote: > On Mon, Jun 17, 2019 at 1:18 PM Bruce Richardson > <[1]bruce.richardson@intel.com> wrote: > > On Mon, Jun 17, 2019 at 12:46:03PM +0200, David Marchand wrote: > > On Mon, Jun 17, 2019 at 12:02 PM Bruce Richardson > > <[1][2]bruce.richardson@intel.com> wrote: > > > > On Sat, Jun 15, 2019 at 08:42:15AM +0200, David Marchand > wrote: > > > This is a joint effort to make the unit tests ready for CI. > > > The first patches are fixes that I had accumulated. > > > Then the second part of the series focuses on skipping > tests when > > some > > > requirements are not fulfilled so that we can start them in > a > > restrained > > > environment like Travis virtual machines that gives us two > cores > > and does > > > not have specific hw devices. > > > > > > We are still not ready for enabling those tests in Travis. > > > At least, the following issues remain: > > > - some fixes on librte_acl have not been merged yet [1], > > > - the tests on --file-prefix are still ko, and have been > isolated > > in a > > > test that we could disable while waiting for the fixes, > > > - rwlock_autotest and hash_readwrite_lf_autotest are taking > a > > little more > > > than 10s, > > > - librte_table unit test crashes on ipv6 [2], > > > - the "perf" tests are taking way too long for my taste, > > > - the shared build unit tests all fail when depending on > mempool > > since > > > the mempool drivers are not loaded, > > > > > For the autotest app shared builds, it is probably worthwhile > > linking in > > all drivers explicitly to avoid issues like this. > > > > Yes, I'll look into this. > > While at it, do you know why the i40e and ixgbe drivers are > linked to > > app/test in meson? > > -- > There are unit tests for the device-specific functions in those > drivers, so > they need to be given at link time. > > For testpmd, I can understand. > But I can't see code for driver specific apis in app/test. > It looks like a copy/paste error when adding meson support. > -- Ok, could be. Simple question is does it still build ok if you remove them? /Bruce
On Mon, Jun 17, 2019 at 1:57 PM Bruce Richardson <bruce.richardson@intel.com> wrote: > On Mon, Jun 17, 2019 at 01:41:21PM +0200, David Marchand wrote: > > On Mon, Jun 17, 2019 at 1:18 PM Bruce Richardson > > <[1]bruce.richardson@intel.com> wrote: > > > > On Mon, Jun 17, 2019 at 12:46:03PM +0200, David Marchand wrote: > > > On Mon, Jun 17, 2019 at 12:02 PM Bruce Richardson > > > <[1][2]bruce.richardson@intel.com> wrote: > > > > > > On Sat, Jun 15, 2019 at 08:42:15AM +0200, David Marchand > > wrote: > > > > This is a joint effort to make the unit tests ready for CI. > > > > The first patches are fixes that I had accumulated. > > > > Then the second part of the series focuses on skipping > > tests when > > > some > > > > requirements are not fulfilled so that we can start them in > > a > > > restrained > > > > environment like Travis virtual machines that gives us two > > cores > > > and does > > > > not have specific hw devices. > > > > > > > > We are still not ready for enabling those tests in Travis. > > > > At least, the following issues remain: > > > > - some fixes on librte_acl have not been merged yet [1], > > > > - the tests on --file-prefix are still ko, and have been > > isolated > > > in a > > > > test that we could disable while waiting for the fixes, > > > > - rwlock_autotest and hash_readwrite_lf_autotest are taking > > a > > > little more > > > > than 10s, > > > > - librte_table unit test crashes on ipv6 [2], > > > > - the "perf" tests are taking way too long for my taste, > > > > - the shared build unit tests all fail when depending on > > mempool > > > since > > > > the mempool drivers are not loaded, > > > > > > > For the autotest app shared builds, it is probably worthwhile > > > linking in > > > all drivers explicitly to avoid issues like this. > > > > > > Yes, I'll look into this. > > > While at it, do you know why the i40e and ixgbe drivers are > > linked to > > > app/test in meson? > > > -- > > There are unit tests for the device-specific functions in those > > drivers, so > > they need to be given at link time. > > > > For testpmd, I can understand. > > But I can't see code for driver specific apis in app/test. > > It looks like a copy/paste error when adding meson support. > > -- > Ok, could be. Simple question is does it still build ok if you remove them? > It would have been strange if it did not build, since on Makefile side we do nothing. Yes, it builds fine with meson without this. I managed to get the same test results than with static builds by linking the skeleton eventdev driver and the mempool sw drivers. Should be enough.
15/06/2019 08:42, David Marchand: > This is a joint effort to make the unit tests ready for CI. Applied, thanks Remaining work below from your list: [...] > - the tests on --file-prefix are still ko, and have been isolated in a > test that we could disable while waiting for the fixes, > - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more > than 10s, > - librte_table unit test crashes on ipv6 [2], > - the "perf" tests are taking way too long for my taste, > - the shared build unit tests all fail when depending on mempool since > the mempool drivers are not loaded, > > 2: https://bugs.dpdk.org/show_bug.cgi?id=285
Thomas Monjalon <thomas@monjalon.net> writes: > 15/06/2019 08:42, David Marchand: >> This is a joint effort to make the unit tests ready for CI. > > Applied, thanks > > Remaining work below from your list: > [...] >> - the tests on --file-prefix are still ko, and have been isolated in a >> test that we could disable while waiting for the fixes, Yes, I think it's good to do that for now. >> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more >> than 10s, Occasionally the distributor test times out as well. I've moved them as part of a separate patch, that I'll post along with a bigger series to enable the unit tests under travis. Michael and I are leaning toward introducing a new variable called RUN_TESTS which will do the docs and unit testing since those combined would add quite a bit to the execution time of each job (and feel free to bike shed the name, since the patches aren't final). >> - librte_table unit test crashes on ipv6 [2], I guess we're waiting on a patch from Jananee (CC'd)? >> - the "perf" tests are taking way too long for my taste, Agreed, so I think we'll disable the perf tests in the CI environment anyway. >> - the shared build unit tests all fail when depending on mempool since >> the mempool drivers are not loaded, I think Michael is working on a fix for this right now. >> 2: https://bugs.dpdk.org/show_bug.cgi?id=285
01/07/2019 18:04, Aaron Conole: > Michael and I are leaning toward > introducing a new variable called RUN_TESTS which will do the docs and > unit testing since those combined would add quite a bit to the execution > time of each job (and feel free to bike shed the name, since the patches > aren't final). Please would you like to send a RFC so we can discuss it before you do the final patches?
On Mon, Jul 1, 2019 at 6:04 PM Aaron Conole <aconole@redhat.com> wrote: > >> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little > more > >> than 10s, > > Occasionally the distributor test times out as well. I've moved them as > part of a separate patch, that I'll post along with a bigger series to > enable the unit tests under travis. Michael and I are leaning toward > introducing a new variable called RUN_TESTS which will do the docs and > unit testing since those combined would add quite a bit to the execution > time of each job (and feel free to bike shed the name, since the patches > aren't final). > Seeing how the distributor autotest usually takes less than a second to complete, this sounds like a bug. I don't think I caught this so far. Yes, we need a variable to control this and select the targets that will do the tests and/or build the doc. About the name, RUN_TESTS is ok for me. What do you want to make of this variable? Have it as a simple boolean that enables everything? Or a selector with strings like unit-tests+doc+perf-tests? > >> - librte_table unit test crashes on ipv6 [2], > > I guess we're waiting on a patch from Jananee (CC'd)? > Yep.
> > > > On Mon, Jul 1, 2019 at 6:04 PM Aaron Conole <aconole@redhat.com> wrote: >> >> >> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more >> >> than 10s, >> >> Occasionally the distributor test times out as well. I've moved them as >> part of a separate patch, that I'll post along with a bigger series to >> enable the unit tests under travis. Michael and I are leaning toward >> introducing a new variable called RUN_TESTS which will do the docs and >> unit testing since those combined would add quite a bit to the execution >> time of each job (and feel free to bike shed the name, since the patches >> aren't final). > > > Seeing how the distributor autotest usually takes less than a second to complete, this sounds like a bug. > I don't think I caught this so far. So I actually ran into the distributor test timing out. I agree with David in that it is a bug with the test. Looking at the logs that test normally finishes in less than 1/2 a second, so running to 10 seconds and timing out is a big jump in run time. I ran into the issue where it timedout, so I restarted the job and it finished no problem. The test fails every so often for no good reason and the logs[1] dont really say much. I speculate that it is waiting for a resource to become available or in the worse case a deadlock. Seeing that it only fails every so often and it passes when restarted I don't think it's a big deal, nevertheless it's worth investing time figuring out what's wrong [1] https://api.travis-ci.com/v3/job/212335916/log.txt > > > Yes, we need a variable to control this and select the targets that will do the tests and/or build the doc. > About the name, RUN_TESTS is ok for me. > > What do you want to make of this variable? > Have it as a simple boolean that enables everything? Or a selector with strings like unit-tests+doc+perf-tests? > > >> >> >> - librte_table unit test crashes on ipv6 [2], >> >> I guess we're waiting on a patch from Jananee (CC'd)? > > > Yep. > > > -- > David Marchand
On 7/1/19 2:07 PM, Michael Santana Francisco wrote: >> >> >> On Mon, Jul 1, 2019 at 6:04 PM Aaron Conole <aconole@redhat.com> wrote: >>>>> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more >>>>> than 10s, >>> Occasionally the distributor test times out as well. I've moved them as >>> part of a separate patch, that I'll post along with a bigger series to >>> enable the unit tests under travis. Michael and I are leaning toward >>> introducing a new variable called RUN_TESTS which will do the docs and >>> unit testing since those combined would add quite a bit to the execution >>> time of each job (and feel free to bike shed the name, since the patches >>> aren't final). >> >> Seeing how the distributor autotest usually takes less than a second to complete, this sounds like a bug. >> I don't think I caught this so far. > So I actually ran into the distributor test timing out. I agree with > David in that it is a bug with the test. Looking at the logs that test > normally finishes in less than 1/2 a second, so running to 10 seconds > and timing out is a big jump in run time. I ran into the issue where > it timedout, so I restarted the job and it finished no problem. > The test fails every so often for no good reason and the logs[1] dont > really say much. I speculate that it is waiting for a resource to > become available or in the worse case a deadlock. Seeing that it only > fails every so often and it passes when restarted I don't think it's a > big deal, nevertheless it's worth investing time figuring out what's > wrong > > [1] https://api.travis-ci.com/v3/job/212335916/log.txt I investigated a little bit on this this test. CC'd David Hunt, I was able to reproduce the problem on v19.08-rc1 with: `while sudo sh -c "echo 'distributor_autotest' | ./build/app/test/dpdk-test"; do :; done` It runs a couple of times fine showing output and showing progress, but then at some point after a couple of seconds it just stops - no longer getting any output. It just sits there with no further output. I let it sit there for a whole minute and nothing happens. So I attach gdb to try to figure out what is happening. One thread seems to be stuck on a while loop, see lib/librte_distributor/rte_distributor.c:310. I looked at the assembly code (layout asm, ni) and I saw these four lines below (which correspond to the while loop) being executed repeatedly and indefinitely. It looks like this thread is waiting for the variable bufptr64[0] to change state. 0xa064d0 <release+32> pause 0xa064d2 <release+34> mov 0x3840(%rdx),%rax 0xa064d9 <release+41> test $0x1,%al 0xa064db <release+43> je 0xa064d0 <release+32> While the first thread is waiting on bufptr64[0] to change state, there is another thread that is also stuck on another while loop on lib/librte_distributor/rte_distributor.c:53. It seems that this thread is stuck waiting for retptr64 to change state. Corresponding assembly being executed indefinitely: 0xa06de0 <rte_distributor_request_pkt_v1705+592> mov 0x38c0(%r8),%rax 0xa06de7 <rte_distributor_request_pkt_v1705+599> test $0x1,%al 0xa06de9 <rte_distributor_request_pkt_v1705+601> je 0xa06bbd <rte_distributor_request_pkt_v1705+45> 0xa06def <rte_distributor_request_pkt_v1705+607> nop 0xa06df0 <rte_distributor_request_pkt_v1705+608> pause 0xa06df2 <rte_distributor_request_pkt_v1705+610> rdtsc 0xa06df4 <rte_distributor_request_pkt_v1705+612> mov %rdx,%r10 0xa06df7 <rte_distributor_request_pkt_v1705+615> shl $0x20,%r10 0xa06dfb <rte_distributor_request_pkt_v1705+619> mov %eax,%eax 0xa06dfd <rte_distributor_request_pkt_v1705+621> or %r10,%rax 0xa06e00 <rte_distributor_request_pkt_v1705+624> lea 0x64(%rax),%r10 0xa06e04 <rte_distributor_request_pkt_v1705+628> jmp 0xa06e12 <rte_distributor_request_pkt_v1705+642> 0xa06e06 <rte_distributor_request_pkt_v1705+630> nopw %cs:0x0(%rax,%rax,1) 0xa06e10 <rte_distributor_request_pkt_v1705+640> pause 0xa06e12 <rte_distributor_request_pkt_v1705+642> rdtsc 0xa06e14 <rte_distributor_request_pkt_v1705+644> shl $0x20,%rdx 0xa06e18 <rte_distributor_request_pkt_v1705+648> mov %eax,%eax 0xa06e1a <rte_distributor_request_pkt_v1705+650> or %rdx,%rax 0xa06e1d <rte_distributor_request_pkt_v1705+653> cmp %rax,%r10 0xa06e20 <rte_distributor_request_pkt_v1705+656> ja 0xa06e10 <rte_distributor_request_pkt_v1705+640> 0xa06e22 <rte_distributor_request_pkt_v1705+658> jmp 0xa06de0 <rte_distributor_request_pkt_v1705+592> My guess is that these threads are interdependent, so one thread is waiting for the other thread to change the state of the control variable. I can't say for sure if this is what is happening or why the these variables don't change state, so I would like ask someone who is more familiar with this particular code to take a look >> >> Yes, we need a variable to control this and select the targets that will do the tests and/or build the doc. >> About the name, RUN_TESTS is ok for me. >> >> What do you want to make of this variable? >> Have it as a simple boolean that enables everything? Or a selector with strings like unit-tests+doc+perf-tests? >> >> >>>>> - librte_table unit test crashes on ipv6 [2], >>> I guess we're waiting on a patch from Jananee (CC'd)? >> >> Yep. >> >> >> -- >> David Marchand
On Tue, Jul 9, 2019 at 5:50 PM Michael Santana Francisco < msantana@redhat.com> wrote: > On 7/1/19 2:07 PM, Michael Santana Francisco wrote: > >> > >> > >> On Mon, Jul 1, 2019 at 6:04 PM Aaron Conole <aconole@redhat.com> wrote: > >>>>> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little > more > >>>>> than 10s, > >>> Occasionally the distributor test times out as well. I've moved them > as > >>> part of a separate patch, that I'll post along with a bigger series to > >>> enable the unit tests under travis. Michael and I are leaning toward > >>> introducing a new variable called RUN_TESTS which will do the docs and > >>> unit testing since those combined would add quite a bit to the > execution > >>> time of each job (and feel free to bike shed the name, since the > patches > >>> aren't final). > >> > >> Seeing how the distributor autotest usually takes less than a second to > complete, this sounds like a bug. > >> I don't think I caught this so far. > > So I actually ran into the distributor test timing out. I agree with > > David in that it is a bug with the test. Looking at the logs that test > > normally finishes in less than 1/2 a second, so running to 10 seconds > > and timing out is a big jump in run time. I ran into the issue where > > it timedout, so I restarted the job and it finished no problem. > > The test fails every so often for no good reason and the logs[1] dont > > really say much. I speculate that it is waiting for a resource to > > become available or in the worse case a deadlock. Seeing that it only > > fails every so often and it passes when restarted I don't think it's a > > big deal, nevertheless it's worth investing time figuring out what's > > wrong > > > > [1] https://api.travis-ci.com/v3/job/212335916/log.txt > > I investigated a little bit on this this test. CC'd David Hunt, > > I was able to reproduce the problem on v19.08-rc1 with: > > `while sudo sh -c "echo 'distributor_autotest' | > ./build/app/test/dpdk-test"; do :; done` > > It runs a couple of times fine showing output and showing progress, but > then at some point after a couple of seconds it just stops - no longer > getting any output. It just sits there with no further output. I let it > sit there for a whole minute and nothing happens. So I attach gdb to try > to figure out what is happening. One thread seems to be stuck on a while > loop, see lib/librte_distributor/rte_distributor.c:310. > > I looked at the assembly code (layout asm, ni) and I saw these four > lines below (which correspond to the while loop) being executed > repeatedly and indefinitely. It looks like this thread is waiting for > the variable bufptr64[0] to change state. > > 0xa064d0 <release+32> pause > 0xa064d2 <release+34> mov 0x3840(%rdx),%rax > 0xa064d9 <release+41> test $0x1,%al > 0xa064db <release+43> je 0xa064d0 <release+32> > > > While the first thread is waiting on bufptr64[0] to change state, there > is another thread that is also stuck on another while loop on > lib/librte_distributor/rte_distributor.c:53. It seems that this thread > is stuck waiting for retptr64 to change state. Corresponding assembly > being executed indefinitely: > > 0xa06de0 <rte_distributor_request_pkt_v1705+592> mov 0x38c0(%r8),%rax > 0xa06de7 <rte_distributor_request_pkt_v1705+599> test $0x1,%al > 0xa06de9 <rte_distributor_request_pkt_v1705+601> je 0xa06bbd > <rte_distributor_request_pkt_v1705+45> > 0xa06def <rte_distributor_request_pkt_v1705+607> nop > 0xa06df0 <rte_distributor_request_pkt_v1705+608> pause > 0xa06df2 <rte_distributor_request_pkt_v1705+610> rdtsc > 0xa06df4 <rte_distributor_request_pkt_v1705+612> mov %rdx,%r10 > 0xa06df7 <rte_distributor_request_pkt_v1705+615> shl $0x20,%r10 > 0xa06dfb <rte_distributor_request_pkt_v1705+619> mov %eax,%eax > 0xa06dfd <rte_distributor_request_pkt_v1705+621> or %r10,%rax > 0xa06e00 <rte_distributor_request_pkt_v1705+624> lea 0x64(%rax),%r10 > 0xa06e04 <rte_distributor_request_pkt_v1705+628> jmp 0xa06e12 > <rte_distributor_request_pkt_v1705+642> > 0xa06e06 <rte_distributor_request_pkt_v1705+630> nopw > %cs:0x0(%rax,%rax,1) > 0xa06e10 <rte_distributor_request_pkt_v1705+640> pause > 0xa06e12 <rte_distributor_request_pkt_v1705+642> rdtsc > 0xa06e14 <rte_distributor_request_pkt_v1705+644> shl $0x20,%rdx > 0xa06e18 <rte_distributor_request_pkt_v1705+648> mov %eax,%eax > 0xa06e1a <rte_distributor_request_pkt_v1705+650> or %rdx,%rax > 0xa06e1d <rte_distributor_request_pkt_v1705+653> cmp %rax,%r10 > 0xa06e20 <rte_distributor_request_pkt_v1705+656> ja 0xa06e10 > <rte_distributor_request_pkt_v1705+640> > 0xa06e22 <rte_distributor_request_pkt_v1705+658> jmp 0xa06de0 > <rte_distributor_request_pkt_v1705+592> > > > My guess is that these threads are interdependent, so one thread is > waiting for the other thread to change the state of the control > variable. I can't say for sure if this is what is happening or why the > these variables don't change state, so I would like ask someone who is > more familiar with this particular code to take a look > Ah cool, thanks for the analysis. Can you create a bz with this description and assign it to the librte_distributor maintainer?