From patchwork Thu Apr 13 20:58:35 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qinglai Xiao X-Patchwork-Id: 23633 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [IPv6:::1]) by dpdk.org (Postfix) with ESMTP id 77264316B; Thu, 13 Apr 2017 22:58:38 +0200 (CEST) Received: from mail-qt0-f180.google.com (mail-qt0-f180.google.com [209.85.216.180]) by dpdk.org (Postfix) with ESMTP id C1B3A2C27 for ; Thu, 13 Apr 2017 22:58:36 +0200 (CEST) Received: by mail-qt0-f180.google.com with SMTP id m36so55109232qtb.0 for ; Thu, 13 Apr 2017 13:58:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=6zpHeSJAhwVPIopMGsE3dKPTy3BVTQtzKCMiKCePGpc=; b=PlwE57/NzuE+NEZP+zCNwic+llss2QDT3tR0YPxFMnze0951BWvtdt1hEe444PXt04 5FiwXARQhcc/XYxZYrvd+sjir1M256T2R5td5QyCvd6Co2mWVtptXHNBypJUkUKD43Nl J1qLotjMwEq1GfJntL1f+WvAH5z3xK5qA97puGEnprUYID5eelr5AUUchi+f9+31yQAX ulXvzb9wtClb2Wv9eNwI8gtcQ8ad3xiDuZeJ5h0FAuTB4FUKzBKP8ZCPJjK3C2CK6VQs xPNtZbEUPBWyBRr1DNvV/Y3XsGeXj9MiR1sM2DTkfNSxfyKxktCRZdZRmCWcOUw+VR4K Dw5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=6zpHeSJAhwVPIopMGsE3dKPTy3BVTQtzKCMiKCePGpc=; b=ukr6hqCnXRkcKKLc1uMn2LztsqouZgrWFqoFTIsjQ7Zc2quDO2ivRFW/7mO/7g77U+ hPA6qDS+rDVZaEVFxtaqyyPVNIEl3RSF/qpc3rft4cKnK0M3iE8TVOdeh0pX3/07hyjn Tp3CpareLqyRpCexQ2wRflod+1QAIvEllBmT8CBAzMHAvt9Wqm7dXSHBg7ul5/anfMHk HWjNOCGHzbcNo/cpK5gtXhxxRshUHHetHM9JCMKxZSVthi/+pliv3RcE/kunY+81Ck12 MLNzcXklwqQ2B1EebcX+aJdgU3tmKY2nSUYKMd+MncnO1YK6/v5upPqgDpg+RPJf3re2 bwbQ== X-Gm-Message-State: AN3rC/4bVyVoZ2FlaDzLaXZ48c+MxRUAZnlbIZ9StmImdfUI3su4t2+P qPD4vCEToQNdpi4+Xkzzq6mV7u5gmBIa X-Received: by 10.237.59.198 with SMTP id s6mr3822134qte.97.1492117116078; Thu, 13 Apr 2017 13:58:36 -0700 (PDT) MIME-Version: 1.0 Received: by 10.140.23.240 with HTTP; Thu, 13 Apr 2017 13:58:35 -0700 (PDT) In-Reply-To: <20170413145910.GG3790@6wind.com> References: <20170413145910.GG3790@6wind.com> From: jigsaw Date: Thu, 13 Apr 2017 23:58:35 +0300 Message-ID: To: Adrien Mazarguil Cc: "dev@dpdk.org" X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi Adrien, Thanks for your comment. The LOCK/UNLOCK may be called by user application only. There are several reasons. 1. If the lib calls LOCK, user application may be punished unexpectedly. Consider what if the Rx burst function calls the LOCK in core #1, and then the mbuf is handed over from core #1 to core #2 through a enqueue/dequeue operation, as below: Rx(1) --> Enqueue(1) --> Dequeue(2) LOCKED Panic! The core #2 will then panic because the mbuf is owned by core #1 without being released. 2. Rx and Tx both usually works in a burst mode, combined with prefetch operation. Meanwhile LOCK and UNLOCK cannot work well with prefetch, because it requires data access of mbuf header. 3. The critical session tends to be small. Here we (user application) need to find a balance: with longer interval of critical section, we can have more secured code; however, longer duration leads to more complicated business logic. Hence, we urge user application to use LOCK/UNLOCK with the common sense of critical sections. Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown below: * then send them straightway. @@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf, struct rte_mbuf **pkts_burst, if (likely(pn != BAD_PORT)) send_packetsx4(qconf, pn, pkts_burst + j, k); else - for (m = j; m != j + k; m++) + for (m = j; m != j + k; m++) { + RTE_MBUF_UNLOCK(pkts_burst[m]); rte_pktmbuf_free(pkts_burst[m]); + } } } ========================================== Note that data race may or may not have visible consequence. If two cores unconsciously process same mbuf at different time, they may never notice it; but if two cores access same mbuf at the same physical time, the consequence is usually visible (crash). We don't seek for a solution that captures even potential data race; instead, we seek for a solution that can capture data race that happens simultaneously in two or more cores. Therefore, we do not need to extend the border of locking as wide as possible. We will apply locking only when we are focusing on a single mbuf processing. In a real life application, a core will spend quite some time on each mbuf. The interval ranges from a few hundred cycles to a few thousand cycles. And usually not more than a handful of mbuf are involved. This is a ideal use case for locking mbuf. I agree that the race detection shall not be compiled by default, since mtod is often called, and every mtod implies a visit to local cache. Further, recursive call of LOCK/UNLOCK shall be supported as well. But I don't think refcnt logic should be taken into account; these two are orthogonal designs, IMO. ***** Pls correct me if I am wrong here. ***** Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason. That is why I said LOCK/UNLOCK needs to survive mbuf alloc initialization. Of course we need to support locking multiple mbufs at the same time. For each core, we will then preserve, say, 8 slots. It works exactly like a direct mapped cacheline. That is, we can use 4bits from the mbuf address to locate its cacheline. If the cacheline has been occupied, we do an eviction; that is, the new mbuf will take the place of the old one. The old one is then UNLOCKed, unfortunately. Honestly I have not yet tried this approach in real life application. But I have been thinking over the problem of data race detection for a long time, and I found the restriction and requirement makes this solution the only viable one. There are hundreds of papers published in the field on data race condition detection, but the lightest of the options has at least 5x of performance penalty, not to mention the space complexity, making it not applicable in practice. Again, pls, anyone has same painful experience of data race bugs, share with me your concerns. It would be nice to come up with some practical device to address this problem. I believe 6Wind and other IP stack vendors must share the same feeling and opinion. thx & rgds, Qinglai On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil < adrien.mazarguil@6wind.com> wrote: > Hi Qinglai, > > On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote: > > Hi, > > > > I have a proposal for mbuf race condition detection and I would like to > get > > your opinions before > > committing any patch. > > > > Race condition is the worst bug I can think of; as it causes crashing > long > > after the first crime scene, > > and the consequence is unpredictable and difficult to understand. > > > > Besides, race condition is very difficult to reproduce. Usually we get a > > coredump from live network, > > but the coredump itself delivers nearly zero info. We just know that the > > mbuf is somehow broken, and > > it is perhaps overwritten by another core at any moment. > > > > There are tools such as Valgrind and ThreadSanitizer to capture this > fault. > > But the overhead brought > > by the tools are too high to be deployed in performance test, not to > > mention in the live network. But > > race condition rarely happens under low pressure. > > > > Even worse, even in theory, the tools mentioned are not capable of > > capturing the scenario, because > > they requires explicit calls on locking primitives such as pthread mutex. > > This is because the semantics > > are not understood by the tools without explicit lock/unlock. > > > > With such known restrictions, I have a proposal roughly as below. > > > > The idea is to ask coder to do explicit lock/unlock on each mbuf that > > matters. The pseudo code is as below: > > > > RTE_MBUF_LOCK(m); > > /* use mbuf as usual */ > > ... > > RTE_MBUF_UNLOCK(m); > > > > During the protected critical section, only the core which holds the lock > > can access the mbuf; while > > other cores, if they dare to use mbuf, they will be punished by segfault. > > > > Since the proposal shall be feasible at least in performance test, the > > overhead of locking and > > punishment must be small. And the access to mbuf must be as usual. > > > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code): > > > > RTE_MBUF_LOCK(m) > > { > > store_per_core_cache(m, m->buf_addr); > > m->buf_addr = NULL; > > mb(); // memory barrier > > } > > > > And RTE_MBUF_UNLOCK is simply the reverse: > > > > RTE_MBUF_UNLOCK(m) > > { > > purge_per_core_cache(m); > > m->buf_addr = load_per_core_cache(m); > > mb(); > > } > > > > As we can see that the idea is to re-write m->buf_addr with NULL, and > store > > it in a per-core > > cache. The other cores, while trying to access the mbuf during critical > > section, will be certainly > > punished. > > > > Then of course we need to keep the owner core working. This is done by > > making changes to > > mtod, as below: > > > > #define rte_pktmbuf_mtod_offset(m, t, o) \ > > ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off + > > (o))) > > > > The per-core cache of mbuf works like a key-value database, which works > > like a direct-mapping > > cache. If an eviction happens (a newly arriving mbuf must kick out an old > > one), then the we restore > > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then > > take care of such > > situation. > > > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed > and > > allocated, since > > we don't trust the refcnt. A double-free is actually very rare; but data > > race can occur without double-free. Therefore, we need to survive the > > allocation of mbuf, that is rte_pktmbuf_init, which reset the > > buf_addr. > > > > Further, other dereference to buf_addr in rte_mbuf.h need to be revised. > > But it is not a big deal (I hope). > > > > The runtime overhead of this implementation is very light-weighted, and > > probably is able to turned > > on even in live network. And it is not intrusive at all: no change is > > needed in struct rte_mbuf; we just > > need a O(N) space complexity (where N is number of cores), and O(1) > runtime > > complexity. > > > > Alright, that is basically what is in my mind. Before any patch is > > provided, or any perf analysis is made, I would like to know what are the > > risks form design point of view. Please leave your feedback. > > Your proposal makes sense but I'm not sure where developers should call > RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user applications > only? Is it to be used internally by DPDK as well? Does it include PMDs? > > I think any overhead outside of debugging mode, as minimal as it is, is too > much of a "punishment" for well behaving applications. The cost of a memory > barrier can be extremely high and this is one of the reasons DPDK processes > mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and > RTE_MBUF_UNLOCK() should thus exist under some compilation option disabled > by default, and thought as an additional debugging tool. > > Also the implementation would probably be more expensive than your > suggestion, in my opinion the reference count must be taken into account > and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs > work. Freeing mbufs multiple times is perfectly valid in many cases. > > How can one lock several mbufs at once by the way? > > Since it affects performance, this can only make sense as a debugging tool > developers can use when they encounter some corruption issue they cannot > identify, somewhat like poisoning buffers before they are freed. Not sure > it's worth the trouble. > > Regards, > > -- > Adrien Mazarguil > 6WIND > ========================================== diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h b/examples/l3fwd/l3fwd_em_hlm_sse.h index 7714a20..9db0190 100644 --- a/examples/l3fwd/l3fwd_em_hlm_sse.h +++ b/examples/l3fwd/l3fwd_em_hlm_sse.h @@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, for (j = 0; j < n; j += 8) { + RTE_MBUF_LOCK(pkts_burst[j]); + RTE_MBUF_LOCK(pkts_burst[j+1]); + RTE_MBUF_LOCK(pkts_burst[j+2]); + RTE_MBUF_LOCK(pkts_burst[j+3]); + RTE_MBUF_LOCK(pkts_burst[j+4]); + RTE_MBUF_LOCK(pkts_burst[j+5]); + RTE_MBUF_LOCK(pkts_burst[j+6]); + RTE_MBUF_LOCK(pkts_burst[j+7]); + uint32_t pkt_type = pkts_burst[j]->packet_type & pkts_burst[j+1]->packet_type & @@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst, } } - for (; j < nb_rx; j++) + for (; j < nb_rx; j++) { + RTE_MBUF_LOCK(pkts_burst[j]); dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid); + } send_packets_multi(qconf, pkts_burst, dst_port, nb_rx); diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h index 1afa1f0..2938558 100644 --- a/examples/l3fwd/l3fwd_sse.h +++ b/examples/l3fwd/l3fwd_sse.h @@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t port, struct rte_mbuf *m[], len = qconf->tx_mbufs[port].len; + for (j = 0; j < num; ++j) + RTE_MBUF_UNLOCK(m); + /* * If TX buffer for that queue is empty, and we have enough packets,