mbox series

[v5,00/15] fix distributor synchronization issues

Message ID 20201008052323.11547-1-l.wojciechow@partner.samsung.com (mailing list archive)
Headers
Series fix distributor synchronization issues |

Message

Lukasz Wojciechowski Oct. 8, 2020, 5:23 a.m. UTC
  During review and verification of the patch created by Sarosh Arif:
"test_distributor: prevent memory leakages from the pool" I found out
that running distributor unit tests multiple times in a row causes fails.
So I investigated all the issues I found.

There are few synchronization issues that might cause deadlocks
or corrupted data. They are fixed with this set of patches for both tests
and librte_distributor library.

---
v5:
* implement missing functionality in burst mode - worker shutdown
* fix shutdown test to always shutdown busy worker
* use atomic stores instead of barrier in tests clear_packet_count()
* reorder patches
* new patch 7: fix call to return_pkt in single mode
* new patch 11: replacing delays with spinlock on atomics in tests
* new patch 12: fix scalar matching algorithm
* new patch 13: new test with marking and checking every packet
* new patch 14: flush also in flight packets
* new patch 15: fix clearing returns buffer
* minor fixes in other patches

v4:
* adjust commit name prefixes app/test -> test/distributor:
* reorder patches
* use NULL oldpkt in rte_distributor_get_pkt() calls in tests

v3:
* add missing acked and tested by statements from v1

v2:
* assign NULL to freed mbufs in distributor test
* fix handshake check on legacy single distributor
     rte_distributor_return_pkt_single()
* add patch 7 passing NULL to legacy API calls if no bufs are returned
* add patch 8 fixing API documentation


Lukasz Wojciechowski (15):
  distributor: fix missing handshake synchronization
  distributor: fix handshake deadlock
  distributor: do not use oldpkt when not needed
  distributor: handle worker shutdown in burst mode
  test/distributor: fix shutdown of busy worker
  test/distributor: synchronize lcores statistics
  distributor: fix return pkt calls in single mode
  test/distributor: fix freeing mbufs
  test/distributor: collect return mbufs
  distributor: align API documentation with code
  test/distributor: replace delays with spin locks
  distributor: fix scalar matching
  test/distributor: add test with packets marking
  distributor: fix flushing in flight packets
  distributor: fix clearing returns buffer

 app/test/test_distributor.c                   | 321 ++++++++++++++----
 lib/librte_distributor/distributor_private.h  |   3 +
 lib/librte_distributor/rte_distributor.c      | 219 +++++++++---
 lib/librte_distributor/rte_distributor.h      |  23 +-
 .../rte_distributor_single.c                  |   4 +
 5 files changed, 447 insertions(+), 123 deletions(-)
  

Comments

David Marchand Oct. 8, 2020, 7:30 a.m. UTC | #1
On Thu, Oct 8, 2020 at 7:24 AM Lukasz Wojciechowski
<l.wojciechow@partner.samsung.com> wrote:
>
> During review and verification of the patch created by Sarosh Arif:
> "test_distributor: prevent memory leakages from the pool" I found out
> that running distributor unit tests multiple times in a row causes fails.
> So I investigated all the issues I found.
>
> There are few synchronization issues that might cause deadlocks
> or corrupted data. They are fixed with this set of patches for both tests
> and librte_distributor library.
>
> ---
> v5:
> * implement missing functionality in burst mode - worker shutdown
> * fix shutdown test to always shutdown busy worker
> * use atomic stores instead of barrier in tests clear_packet_count()
> * reorder patches
> * new patch 7: fix call to return_pkt in single mode
> * new patch 11: replacing delays with spinlock on atomics in tests
> * new patch 12: fix scalar matching algorithm
> * new patch 13: new test with marking and checking every packet
> * new patch 14: flush also in flight packets
> * new patch 15: fix clearing returns buffer
> * minor fixes in other patches

Thanks for working on it, Lukasz.
David, Honnappa, review please.
  
Lukasz Wojciechowski Oct. 8, 2020, 9:16 p.m. UTC | #2
W dniu 08.10.2020 o 09:30, David Marchand pisze:
> On Thu, Oct 8, 2020 at 7:24 AM Lukasz Wojciechowski
> <l.wojciechow@partner.samsung.com> wrote:
>> During review and verification of the patch created by Sarosh Arif:
>> "test_distributor: prevent memory leakages from the pool" I found out
>> that running distributor unit tests multiple times in a row causes fails.
>> So I investigated all the issues I found.
>>
>> There are few synchronization issues that might cause deadlocks
>> or corrupted data. They are fixed with this set of patches for both tests
>> and librte_distributor library.
>>
>> ---
>> v5:
>> * implement missing functionality in burst mode - worker shutdown
>> * fix shutdown test to always shutdown busy worker
>> * use atomic stores instead of barrier in tests clear_packet_count()
>> * reorder patches
>> * new patch 7: fix call to return_pkt in single mode
>> * new patch 11: replacing delays with spinlock on atomics in tests
>> * new patch 12: fix scalar matching algorithm
>> * new patch 13: new test with marking and checking every packet
>> * new patch 14: flush also in flight packets
>> * new patch 15: fix clearing returns buffer
>> * minor fixes in other patches
> Thanks for working on it, Lukasz.
Sorry for the delay, but it was much to solve and test.
> David, Honnappa, review please.
I'm here if you have any questions or suggestions
>
>
  
David Marchand Oct. 9, 2020, 12:53 p.m. UTC | #3
Hello Lukasz,

On Thu, Oct 8, 2020 at 11:17 PM Lukasz Wojciechowski
<l.wojciechow@partner.samsung.com> wrote:
> I'm here if you have any questions or suggestions

Unfortunately, I can see a timeout on the distributor autotest in Travis:
https://travis-ci.com/github/ovsrobot/dpdk/jobs/396703415#L1151

Can you have a look?
Btw, did you receive a notification about this from the robot?
  
Lukasz Wojciechowski Oct. 9, 2020, 9:41 p.m. UTC | #4
Hi David,

W dniu 09.10.2020 o 14:53, David Marchand pisze:
> Hello Lukasz,
>
> On Thu, Oct 8, 2020 at 11:17 PM Lukasz Wojciechowski
> <l.wojciechow@partner.samsung.com> wrote:
>> I'm here if you have any questions or suggestions
> Unfortunately, I can see a timeout on the distributor autotest in Travis:
> https://travis-ci.com/github/ovsrobot/dpdk/jobs/396703415#L1151
>
> Can you have a look?
I took a look and I don't know the cause of test hanging and timeout.
I run today more than 200000 iteration of distributor tests and didn't 
get a single failure or lock.
David Hunt run the series tests today also, when checking impact on 
performance and I guess he didn't got the issue.
@DavidHunt, Am I right?

The failure happened in only one configuration and tests were run by 
travis using different compilers, architecture, etc.

The test did not wrote anything on the stdout or stderr:
--- stdout ---
EAL: Probing VFIO support...
APP: HPET is not enabled, using TSC as default timer
RTE>>distributor_autotest
--- stderr ---
EAL: Detected 2 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/distributor_autotest/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: No available hugepages reported in hugepages-1048576kB
-------
That's quite strange before the first test that is run: sanity_test, 
starts with printing information about the start.

Before that there is only the initialization code of the distributor 
structure and creation of mempool.

The only modifications I made to initialization of distributor structure 
was initialization of active and active sum fields of distributor:

     memset(d->active, 0, sizeof(d->active));
     d->activesum = 0;

That's seems not to be the reason.

I don't know what could be.


Is there a way to trigger travis job manually to see if the timeout 
reproduces ?

> Btw, did you receive a notification about this from the robot?
Yes, I got it.
But I interpreted it badly. I downloaded the log and start reading it up 
from the end and when I saw:

    Compiler stderr:^M
      /usr/bin/ld: cannot find -lvirt^M
    collect2: error: ld returned 1 exit status^M

  I thought that was it. Sorry for that.


BTW I'm going to publish v6 with changes suggested by Honnappa 
Nagarahalli (RELAXED memory mode) and David Hunt (indentations)


Best regards

Lukasz

>
>
  
Lukasz Wojciechowski Oct. 9, 2020, 11:25 p.m. UTC | #5
W dniu 09.10.2020 o 23:41, Lukasz Wojciechowski pisze:
>
> Hi David,
>
> W dniu 09.10.2020 o 14:53, David Marchand pisze:
>> Hello Lukasz,
>>
>> On Thu, Oct 8, 2020 at 11:17 PM Lukasz Wojciechowski
>> <l.wojciechow@partner.samsung.com>  wrote:
>>> I'm here if you have any questions or suggestions
>> Unfortunately, I can see a timeout on the distributor autotest in Travis:
>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/396703415#L1151
>>
>> Can you have a look?
> I took a look and I don't know the cause of test hanging and timeout.
> I run today more than 200000 iteration of distributor tests and didn't 
> get a single failure or lock.
> David Hunt run the series tests today also, when checking impact on 
> performance and I guess he didn't got the issue.
> @DavidHunt, Am I right?
>
> The failure happened in only one configuration and tests were run by 
> travis using different compilers, architecture, etc.
>
> The test did not wrote anything on the stdout or stderr:
> --- stdout ---
> EAL: Probing VFIO support...
> APP: HPET is not enabled, using TSC as default timer
> RTE>>distributor_autotest
> --- stderr ---
> EAL: Detected 2 lcore(s)
> EAL: Detected 1 NUMA nodes
> EAL: Multi-process socket /var/run/dpdk/distributor_autotest/mp_socket
> EAL: Selected IOVA mode 'PA'
> EAL: No available hugepages reported in hugepages-1048576kB
> -------
> That's quite strange before the first test that is run: sanity_test, 
> starts with printing information about the start.
>
> Before that there is only the initialization code of the distributor 
> structure and creation of mempool.
>
> The only modifications I made to initialization of distributor 
> structure was initialization of active and active sum fields of 
> distributor:
>
>     memset(d->active, 0, sizeof(d->active));
>     d->activesum = 0;
>
> That's seems not to be the reason.
>
> I don't know what could be.
>
>
> Is there a way to trigger travis job manually to see if the timeout 
> reproduces ?
>
>> Btw, did you receive a notification about this from the robot?
> Yes, I got it.
> But I interpreted it badly. I downloaded the log and start reading it 
> up from the end and when I saw:
>
>     Compiler stderr:^M
>      /usr/bin/ld: cannot find -lvirt^M
>     collect2: error: ld returned 1 exit status^M
>
>  I thought that was it. Sorry for that.
>
>
> BTW I'm going to publish v6 with changes suggested by Honnappa 
> Nagarahalli (RELAXED memory mode) and David Hunt (indentations)
>

More bad news - same issue just appeared on travis for v6.
Good news we can reproduce it.

Is there a way to delegate a job for travis other way than sending a new 
patch version?

> Best regards
>
> Lukasz
>
> -- 
> Lukasz Wojciechowski
> Principal Software Engineer
>
> Samsung R&D Institute Poland
> Samsung Electronics
> Office +48 22 377 88 25
> l.wojciechow@partner.samsung.com
>
  
David Marchand Oct. 10, 2020, 8:12 a.m. UTC | #6
Hello Lukasz,

On Sat, Oct 10, 2020 at 1:26 AM Lukasz Wojciechowski
<l.wojciechow@partner.samsung.com> wrote:
> W dniu 09.10.2020 o 23:41, Lukasz Wojciechowski pisze:
> More bad news - same issue just appeared on travis for v6.
> Good news we can reproduce it.
>
> Is there a way to delegate a job for travis other way than sending a new patch version?

You just need to fork dpdk in github, then setup travis.
Travis will get triggered on push.
I can help offlist if needed.
  
David Marchand Oct. 10, 2020, 8:15 a.m. UTC | #7
On Sat, Oct 10, 2020 at 10:12 AM David Marchand
<david.marchand@redhat.com> wrote:
>
> Hello Lukasz,
>
> On Sat, Oct 10, 2020 at 1:26 AM Lukasz Wojciechowski
> <l.wojciechow@partner.samsung.com> wrote:
> > W dniu 09.10.2020 o 23:41, Lukasz Wojciechowski pisze:
> > More bad news - same issue just appeared on travis for v6.
> > Good news we can reproduce it.
> >
> > Is there a way to delegate a job for travis other way than sending a new patch version?
>
> You just need to fork dpdk in github, then setup travis.

Forgot to paste it:
https://docs.travis-ci.com/user/tutorial/#to-get-started-with-travis-ci-using-github

> Travis will get triggered on push.
> I can help offlist if needed.
  
Lukasz Wojciechowski Oct. 10, 2020, 4:27 p.m. UTC | #8
W dniu 10.10.2020 o 10:12, David Marchand pisze:
> Hello Lukasz,
>
> On Sat, Oct 10, 2020 at 1:26 AM Lukasz Wojciechowski
> <l.wojciechow@partner.samsung.com> wrote:
>> W dniu 09.10.2020 o 23:41, Lukasz Wojciechowski pisze:
>> More bad news - same issue just appeared on travis for v6.
>> Good news we can reproduce it.
>>
>> Is there a way to delegate a job for travis other way than sending a new patch version?
> You just need to fork dpdk in github, then setup travis.
> Travis will get triggered on push.
> I can help offlist if needed.

Thank you

I managed to reproduce the issue by stressing my machine's cpus and memory.

The issue was caused by slow start of worker threads, which didn't reach 
place where they request for packages, because of that
they were treated as not activated. The distributor thread didn't send 
any packets because of that fact, but waited in an infinite loop until 
packets are returned from workers.

I pushed v7 of series with additional patch fixing that by running 
rte_distributor_process() in a loop until it manages to send all packets 
to workers.

>
>