Message ID | 20201008052323.11547-1-l.wojciechow@partner.samsung.com (mailing list archive) |
---|---|
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 3CACAA04BC; Thu, 8 Oct 2020 07:24:04 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 080201B6A3; Thu, 8 Oct 2020 07:23:43 +0200 (CEST) Received: from mailout2.w1.samsung.com (mailout2.w1.samsung.com [210.118.77.12]) by dpdk.org (Postfix) with ESMTP id 2B7285F13 for <dev@dpdk.org>; Thu, 8 Oct 2020 07:23:38 +0200 (CEST) Received: from eucas1p1.samsung.com (unknown [182.198.249.206]) by mailout2.w1.samsung.com (KnoxPortal) with ESMTP id 20201008052337euoutp027c53b42b2e23ca130efe6175ca59cc2b~77PIiT1W40408404084euoutp02j for <dev@dpdk.org>; Thu, 8 Oct 2020 05:23:37 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 mailout2.w1.samsung.com 20201008052337euoutp027c53b42b2e23ca130efe6175ca59cc2b~77PIiT1W40408404084euoutp02j DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsung.com; s=mail20170921; t=1602134617; bh=LttLRb70REzWDiW9VGXcwnPTFqqmp0S+jGME1Q0rmUo=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=A9j/YSR08Qyox4r0GR6iTUTTmyO2DB7T49p9PYsYdxDfJwkLuNFb41Fcq4Z7WKXV7 8FNpEVfY4Usjh5W8SUAAn+myUIjLwRqE8FcWnDoR7smVM5OCfNRQ8qcyceZdSXrXGk ILdTw1EXe2ktZoMO21TmC+STeIPVVvq3YlAnWzhc= Received: from eusmges2new.samsung.com (unknown [203.254.199.244]) by eucas1p2.samsung.com (KnoxPortal) with ESMTP id 20201008052336eucas1p25d3b702a09e71eecbacbab63c1c15616~77PIUImez2348123481eucas1p23; Thu, 8 Oct 2020 05:23:36 +0000 (GMT) Received: from eucas1p2.samsung.com ( [182.198.249.207]) by eusmges2new.samsung.com (EUCPMTA) with SMTP id 21.01.05997.852AE7F5; Thu, 8 Oct 2020 06:23:36 +0100 (BST) Received: from eusmtrp1.samsung.com (unknown [182.198.249.138]) by eucas1p1.samsung.com (KnoxPortal) with ESMTPA id 20201008052336eucas1p16b5b1600683e33ddba30479b7fd62ce6~77PH8FmnM3077830778eucas1p1O; Thu, 8 Oct 2020 05:23:36 +0000 (GMT) Received: from eusmgms2.samsung.com (unknown [182.198.249.180]) by eusmtrp1.samsung.com (KnoxPortal) with ESMTP id 20201008052336eusmtrp16803feb6a500f975a68f417cdbfb8e20~77PH7kBWH2941529415eusmtrp1M; Thu, 8 Oct 2020 05:23:36 +0000 (GMT) X-AuditID: cbfec7f4-65dff7000000176d-5f-5f7ea258fb93 Received: from eusmtip1.samsung.com ( [203.254.199.221]) by eusmgms2.samsung.com (EUCPMTA) with SMTP id FA.F0.06017.852AE7F5; Thu, 8 Oct 2020 06:23:36 +0100 (BST) Received: from Padamandas.fritz.box (unknown [106.210.88.70]) by eusmtip1.samsung.com (KnoxPortal) with ESMTPA id 20201008052336eusmtip151655cfcc611e350ebb8fb71dbf69b54~77PHfb7rz2784127841eusmtip14; Thu, 8 Oct 2020 05:23:35 +0000 (GMT) From: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com> To: Cc: dev@dpdk.org, l.wojciechow@partner.samsung.com Date: Thu, 8 Oct 2020 07:23:08 +0200 Message-Id: <20201008052323.11547-1-l.wojciechow@partner.samsung.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20200925224209.12173-1-l.wojciechow@partner.samsung.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFrrCIsWRmVeSWpSXmKPExsWy7djP87oRi+riDe4tNrN492k7k8WznnWM DkwevxYsZfU4+G4PUwBTFJdNSmpOZllqkb5dAldG56wmloINIhVNr0+yNDAuEehi5OSQEDCR uHRzAUsXIxeHkMAKRon+3x1MEM4XRonGnY/ZQKqEBD4zSrz9WgzT0XT5MSNE0XKg+JMP7BDO J0aJy9MbwDrYBGwljsz8ygpiiwiwSKz8/p0FxGYWMJJ42T2RuYuRg0NYwEHi3Xo7kDCLgKpE 953PYCW8Aq4Sb2etYYRYJi+xesMBZhCbU8BN4tu+ucwguyQEdrBJzJ4wjRmiyEVixsm3LBC2 sMSr41vYIWwZidOTe1ggGrYxSlz9/ZMRwtnPKHG9dwVUlbXE4X+/2UAuYhbQlFi/Sx8i7Chx Y+ZlRpCwhACfxI23ghD380lM2jadGSLMK9HRJgRRrSfxtGcqI8zaP2ufQJ3jIdF7Zw4zJHxm MkqcvTmBdQKj/CyEZQsYGVcxiqeWFuempxYb5aWW6xUn5haX5qXrJefnbmIExvXpf8e/7GDc 9SfpEKMAB6MSD6/B0dp4IdbEsuLK3EOMEhzMSiK8TmdPxwnxpiRWVqUW5ccXleakFh9ilOZg URLnNV70MlZIID2xJDU7NbUgtQgmy8TBKdXAWKB7d3/OZIGu5t+rj32/oi+u0vLvh+hZ8Z8P StTCPaMrkp83Pvo5be+bdSkT3qb55V1ZdO3M0nOTaw/k7HuT7ZG+a9akB6+W1X7hEWl8ukys tZ718We3WaVbbgtPea1R0K4TxXLyXLTRkjnGO8oXPtG2f/xykzf7tu5A3y7nH817X023y5/v 8E6JpTgj0VCLuag4EQCZyZ9S5wIAAA== X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrMLMWRmVeSWpSXmKPExsVy+t/xu7oRi+riDc6slrR492k7k8WznnWM DkwevxYsZfU4+G4PUwBTlJ5NUX5pSapCRn5xia1StKGFkZ6hpYWekYmlnqGxeayVkamSvp1N SmpOZllqkb5dgl5G56wmloINIhVNr0+yNDAuEehi5OSQEDCRaLr8mLGLkYtDSGApo8T/1edY uhg5gBIyEh8uQdUIS/y51sUGYgsJfGCUeN/ED2KzCdhKHJn5lRXEFhFgkVj5/TsLiM0MNPP2 vCY2kDHCAg4S79bbgYRZBFQluu98BivhFXCVeDtrDSPEeHmJ1RsOMIPYnAJuEt/2zWWGWOUq MXn5ZtYJjHwLGBlWMYqklhbnpucWG+kVJ+YWl+al6yXn525iBAbYtmM/t+xg7HoXfIhRgINR iYfX4GhtvBBrYllxZe4hRgkOZiURXqezp+OEeFMSK6tSi/Lji0pzUosPMZoCHTWRWUo0OR8Y /Hkl8YamhuYWlobmxubGZhZK4rwdAgdjhATSE0tSs1NTC1KLYPqYODilGhjzuppcTVfN3/ft fTBr45Rbp3eW+ti9m6jzZnv71r/mGmxHFda7NW0Ryqss/CEq/cbfYt2b2Syeqz7UH2C+s7qr 27X46cqrSqs/3fMUEnr103blyr1tzIJlT48cFMv1t+/6pufEMpev4e6DPa3bVnf9KxY6t55n n9fKp5nhwbcqJQ51JJya3KeqxFKckWioxVxUnAgAXL8I8kYCAAA= X-CMS-MailID: 20201008052336eucas1p16b5b1600683e33ddba30479b7fd62ce6 X-Msg-Generator: CA Content-Type: text/plain; charset="utf-8" X-RootMTR: 20201008052336eucas1p16b5b1600683e33ddba30479b7fd62ce6 X-EPHeader: CA CMS-TYPE: 201P X-CMS-RootMailID: 20201008052336eucas1p16b5b1600683e33ddba30479b7fd62ce6 References: <20200925224209.12173-1-l.wojciechow@partner.samsung.com> <CGME20201008052336eucas1p16b5b1600683e33ddba30479b7fd62ce6@eucas1p1.samsung.com> Subject: [dpdk-dev] [PATCH v5 00/15] fix distributor synchronization issues X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org> |
Series |
fix distributor synchronization issues
|
|
Message
Lukasz Wojciechowski
Oct. 8, 2020, 5:23 a.m. UTC
During review and verification of the patch created by Sarosh Arif: "test_distributor: prevent memory leakages from the pool" I found out that running distributor unit tests multiple times in a row causes fails. So I investigated all the issues I found. There are few synchronization issues that might cause deadlocks or corrupted data. They are fixed with this set of patches for both tests and librte_distributor library. --- v5: * implement missing functionality in burst mode - worker shutdown * fix shutdown test to always shutdown busy worker * use atomic stores instead of barrier in tests clear_packet_count() * reorder patches * new patch 7: fix call to return_pkt in single mode * new patch 11: replacing delays with spinlock on atomics in tests * new patch 12: fix scalar matching algorithm * new patch 13: new test with marking and checking every packet * new patch 14: flush also in flight packets * new patch 15: fix clearing returns buffer * minor fixes in other patches v4: * adjust commit name prefixes app/test -> test/distributor: * reorder patches * use NULL oldpkt in rte_distributor_get_pkt() calls in tests v3: * add missing acked and tested by statements from v1 v2: * assign NULL to freed mbufs in distributor test * fix handshake check on legacy single distributor rte_distributor_return_pkt_single() * add patch 7 passing NULL to legacy API calls if no bufs are returned * add patch 8 fixing API documentation Lukasz Wojciechowski (15): distributor: fix missing handshake synchronization distributor: fix handshake deadlock distributor: do not use oldpkt when not needed distributor: handle worker shutdown in burst mode test/distributor: fix shutdown of busy worker test/distributor: synchronize lcores statistics distributor: fix return pkt calls in single mode test/distributor: fix freeing mbufs test/distributor: collect return mbufs distributor: align API documentation with code test/distributor: replace delays with spin locks distributor: fix scalar matching test/distributor: add test with packets marking distributor: fix flushing in flight packets distributor: fix clearing returns buffer app/test/test_distributor.c | 321 ++++++++++++++---- lib/librte_distributor/distributor_private.h | 3 + lib/librte_distributor/rte_distributor.c | 219 +++++++++--- lib/librte_distributor/rte_distributor.h | 23 +- .../rte_distributor_single.c | 4 + 5 files changed, 447 insertions(+), 123 deletions(-)
Comments
On Thu, Oct 8, 2020 at 7:24 AM Lukasz Wojciechowski <l.wojciechow@partner.samsung.com> wrote: > > During review and verification of the patch created by Sarosh Arif: > "test_distributor: prevent memory leakages from the pool" I found out > that running distributor unit tests multiple times in a row causes fails. > So I investigated all the issues I found. > > There are few synchronization issues that might cause deadlocks > or corrupted data. They are fixed with this set of patches for both tests > and librte_distributor library. > > --- > v5: > * implement missing functionality in burst mode - worker shutdown > * fix shutdown test to always shutdown busy worker > * use atomic stores instead of barrier in tests clear_packet_count() > * reorder patches > * new patch 7: fix call to return_pkt in single mode > * new patch 11: replacing delays with spinlock on atomics in tests > * new patch 12: fix scalar matching algorithm > * new patch 13: new test with marking and checking every packet > * new patch 14: flush also in flight packets > * new patch 15: fix clearing returns buffer > * minor fixes in other patches Thanks for working on it, Lukasz. David, Honnappa, review please.
W dniu 08.10.2020 o 09:30, David Marchand pisze: > On Thu, Oct 8, 2020 at 7:24 AM Lukasz Wojciechowski > <l.wojciechow@partner.samsung.com> wrote: >> During review and verification of the patch created by Sarosh Arif: >> "test_distributor: prevent memory leakages from the pool" I found out >> that running distributor unit tests multiple times in a row causes fails. >> So I investigated all the issues I found. >> >> There are few synchronization issues that might cause deadlocks >> or corrupted data. They are fixed with this set of patches for both tests >> and librte_distributor library. >> >> --- >> v5: >> * implement missing functionality in burst mode - worker shutdown >> * fix shutdown test to always shutdown busy worker >> * use atomic stores instead of barrier in tests clear_packet_count() >> * reorder patches >> * new patch 7: fix call to return_pkt in single mode >> * new patch 11: replacing delays with spinlock on atomics in tests >> * new patch 12: fix scalar matching algorithm >> * new patch 13: new test with marking and checking every packet >> * new patch 14: flush also in flight packets >> * new patch 15: fix clearing returns buffer >> * minor fixes in other patches > Thanks for working on it, Lukasz. Sorry for the delay, but it was much to solve and test. > David, Honnappa, review please. I'm here if you have any questions or suggestions > >
Hello Lukasz,
On Thu, Oct 8, 2020 at 11:17 PM Lukasz Wojciechowski
<l.wojciechow@partner.samsung.com> wrote:
> I'm here if you have any questions or suggestions
Unfortunately, I can see a timeout on the distributor autotest in Travis:
https://travis-ci.com/github/ovsrobot/dpdk/jobs/396703415#L1151
Can you have a look?
Btw, did you receive a notification about this from the robot?
Hi David, W dniu 09.10.2020 o 14:53, David Marchand pisze: > Hello Lukasz, > > On Thu, Oct 8, 2020 at 11:17 PM Lukasz Wojciechowski > <l.wojciechow@partner.samsung.com> wrote: >> I'm here if you have any questions or suggestions > Unfortunately, I can see a timeout on the distributor autotest in Travis: > https://travis-ci.com/github/ovsrobot/dpdk/jobs/396703415#L1151 > > Can you have a look? I took a look and I don't know the cause of test hanging and timeout. I run today more than 200000 iteration of distributor tests and didn't get a single failure or lock. David Hunt run the series tests today also, when checking impact on performance and I guess he didn't got the issue. @DavidHunt, Am I right? The failure happened in only one configuration and tests were run by travis using different compilers, architecture, etc. The test did not wrote anything on the stdout or stderr: --- stdout --- EAL: Probing VFIO support... APP: HPET is not enabled, using TSC as default timer RTE>>distributor_autotest --- stderr --- EAL: Detected 2 lcore(s) EAL: Detected 1 NUMA nodes EAL: Multi-process socket /var/run/dpdk/distributor_autotest/mp_socket EAL: Selected IOVA mode 'PA' EAL: No available hugepages reported in hugepages-1048576kB ------- That's quite strange before the first test that is run: sanity_test, starts with printing information about the start. Before that there is only the initialization code of the distributor structure and creation of mempool. The only modifications I made to initialization of distributor structure was initialization of active and active sum fields of distributor: memset(d->active, 0, sizeof(d->active)); d->activesum = 0; That's seems not to be the reason. I don't know what could be. Is there a way to trigger travis job manually to see if the timeout reproduces ? > Btw, did you receive a notification about this from the robot? Yes, I got it. But I interpreted it badly. I downloaded the log and start reading it up from the end and when I saw: Compiler stderr:^M /usr/bin/ld: cannot find -lvirt^M collect2: error: ld returned 1 exit status^M I thought that was it. Sorry for that. BTW I'm going to publish v6 with changes suggested by Honnappa Nagarahalli (RELAXED memory mode) and David Hunt (indentations) Best regards Lukasz > >
W dniu 09.10.2020 o 23:41, Lukasz Wojciechowski pisze: > > Hi David, > > W dniu 09.10.2020 o 14:53, David Marchand pisze: >> Hello Lukasz, >> >> On Thu, Oct 8, 2020 at 11:17 PM Lukasz Wojciechowski >> <l.wojciechow@partner.samsung.com> wrote: >>> I'm here if you have any questions or suggestions >> Unfortunately, I can see a timeout on the distributor autotest in Travis: >> https://travis-ci.com/github/ovsrobot/dpdk/jobs/396703415#L1151 >> >> Can you have a look? > I took a look and I don't know the cause of test hanging and timeout. > I run today more than 200000 iteration of distributor tests and didn't > get a single failure or lock. > David Hunt run the series tests today also, when checking impact on > performance and I guess he didn't got the issue. > @DavidHunt, Am I right? > > The failure happened in only one configuration and tests were run by > travis using different compilers, architecture, etc. > > The test did not wrote anything on the stdout or stderr: > --- stdout --- > EAL: Probing VFIO support... > APP: HPET is not enabled, using TSC as default timer > RTE>>distributor_autotest > --- stderr --- > EAL: Detected 2 lcore(s) > EAL: Detected 1 NUMA nodes > EAL: Multi-process socket /var/run/dpdk/distributor_autotest/mp_socket > EAL: Selected IOVA mode 'PA' > EAL: No available hugepages reported in hugepages-1048576kB > ------- > That's quite strange before the first test that is run: sanity_test, > starts with printing information about the start. > > Before that there is only the initialization code of the distributor > structure and creation of mempool. > > The only modifications I made to initialization of distributor > structure was initialization of active and active sum fields of > distributor: > > memset(d->active, 0, sizeof(d->active)); > d->activesum = 0; > > That's seems not to be the reason. > > I don't know what could be. > > > Is there a way to trigger travis job manually to see if the timeout > reproduces ? > >> Btw, did you receive a notification about this from the robot? > Yes, I got it. > But I interpreted it badly. I downloaded the log and start reading it > up from the end and when I saw: > > Compiler stderr:^M > /usr/bin/ld: cannot find -lvirt^M > collect2: error: ld returned 1 exit status^M > > I thought that was it. Sorry for that. > > > BTW I'm going to publish v6 with changes suggested by Honnappa > Nagarahalli (RELAXED memory mode) and David Hunt (indentations) > More bad news - same issue just appeared on travis for v6. Good news we can reproduce it. Is there a way to delegate a job for travis other way than sending a new patch version? > Best regards > > Lukasz > > -- > Lukasz Wojciechowski > Principal Software Engineer > > Samsung R&D Institute Poland > Samsung Electronics > Office +48 22 377 88 25 > l.wojciechow@partner.samsung.com >
Hello Lukasz, On Sat, Oct 10, 2020 at 1:26 AM Lukasz Wojciechowski <l.wojciechow@partner.samsung.com> wrote: > W dniu 09.10.2020 o 23:41, Lukasz Wojciechowski pisze: > More bad news - same issue just appeared on travis for v6. > Good news we can reproduce it. > > Is there a way to delegate a job for travis other way than sending a new patch version? You just need to fork dpdk in github, then setup travis. Travis will get triggered on push. I can help offlist if needed.
On Sat, Oct 10, 2020 at 10:12 AM David Marchand <david.marchand@redhat.com> wrote: > > Hello Lukasz, > > On Sat, Oct 10, 2020 at 1:26 AM Lukasz Wojciechowski > <l.wojciechow@partner.samsung.com> wrote: > > W dniu 09.10.2020 o 23:41, Lukasz Wojciechowski pisze: > > More bad news - same issue just appeared on travis for v6. > > Good news we can reproduce it. > > > > Is there a way to delegate a job for travis other way than sending a new patch version? > > You just need to fork dpdk in github, then setup travis. Forgot to paste it: https://docs.travis-ci.com/user/tutorial/#to-get-started-with-travis-ci-using-github > Travis will get triggered on push. > I can help offlist if needed.
W dniu 10.10.2020 o 10:12, David Marchand pisze: > Hello Lukasz, > > On Sat, Oct 10, 2020 at 1:26 AM Lukasz Wojciechowski > <l.wojciechow@partner.samsung.com> wrote: >> W dniu 09.10.2020 o 23:41, Lukasz Wojciechowski pisze: >> More bad news - same issue just appeared on travis for v6. >> Good news we can reproduce it. >> >> Is there a way to delegate a job for travis other way than sending a new patch version? > You just need to fork dpdk in github, then setup travis. > Travis will get triggered on push. > I can help offlist if needed. Thank you I managed to reproduce the issue by stressing my machine's cpus and memory. The issue was caused by slow start of worker threads, which didn't reach place where they request for packages, because of that they were treated as not activated. The distributor thread didn't send any packets because of that fact, but waited in an infinite loop until packets are returned from workers. I pushed v7 of series with additional patch fixing that by running rte_distributor_process() in a loop until it manages to send all packets to workers. > >