From patchwork Thu Apr 13 20:58:35 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qinglai Xiao <jigsaw@gmail.com>
X-Patchwork-Id: 23633
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@dpdk.org
Delivered-To: patchwork@dpdk.org
Received: from [92.243.14.124] (localhost [IPv6:::1])
	by dpdk.org (Postfix) with ESMTP id 77264316B;
	Thu, 13 Apr 2017 22:58:38 +0200 (CEST)
Received: from mail-qt0-f180.google.com (mail-qt0-f180.google.com
	[209.85.216.180]) by dpdk.org (Postfix) with ESMTP id C1B3A2C27
	for <dev@dpdk.org>; Thu, 13 Apr 2017 22:58:36 +0200 (CEST)
Received: by mail-qt0-f180.google.com with SMTP id m36so55109232qtb.0
	for <dev@dpdk.org>; Thu, 13 Apr 2017 13:58:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=mime-version:in-reply-to:references:from:date:message-id:subject:to
	:cc; bh=6zpHeSJAhwVPIopMGsE3dKPTy3BVTQtzKCMiKCePGpc=;
	b=PlwE57/NzuE+NEZP+zCNwic+llss2QDT3tR0YPxFMnze0951BWvtdt1hEe444PXt04
	5FiwXARQhcc/XYxZYrvd+sjir1M256T2R5td5QyCvd6Co2mWVtptXHNBypJUkUKD43Nl
	J1qLotjMwEq1GfJntL1f+WvAH5z3xK5qA97puGEnprUYID5eelr5AUUchi+f9+31yQAX
	ulXvzb9wtClb2Wv9eNwI8gtcQ8ad3xiDuZeJ5h0FAuTB4FUKzBKP8ZCPJjK3C2CK6VQs
	xPNtZbEUPBWyBRr1DNvV/Y3XsGeXj9MiR1sM2DTkfNSxfyKxktCRZdZRmCWcOUw+VR4K
	Dw5Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:in-reply-to:references:from:date
	:message-id:subject:to:cc;
	bh=6zpHeSJAhwVPIopMGsE3dKPTy3BVTQtzKCMiKCePGpc=;
	b=ukr6hqCnXRkcKKLc1uMn2LztsqouZgrWFqoFTIsjQ7Zc2quDO2ivRFW/7mO/7g77U+
	hPA6qDS+rDVZaEVFxtaqyyPVNIEl3RSF/qpc3rft4cKnK0M3iE8TVOdeh0pX3/07hyjn
	Tp3CpareLqyRpCexQ2wRflod+1QAIvEllBmT8CBAzMHAvt9Wqm7dXSHBg7ul5/anfMHk
	HWjNOCGHzbcNo/cpK5gtXhxxRshUHHetHM9JCMKxZSVthi/+pliv3RcE/kunY+81Ck12
	MLNzcXklwqQ2B1EebcX+aJdgU3tmKY2nSUYKMd+MncnO1YK6/v5upPqgDpg+RPJf3re2
	bwbQ==
X-Gm-Message-State: AN3rC/4bVyVoZ2FlaDzLaXZ48c+MxRUAZnlbIZ9StmImdfUI3su4t2+P
	qPD4vCEToQNdpi4+Xkzzq6mV7u5gmBIa
X-Received: by 10.237.59.198 with SMTP id s6mr3822134qte.97.1492117116078;
	Thu, 13 Apr 2017 13:58:36 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.140.23.240 with HTTP; Thu, 13 Apr 2017 13:58:35 -0700 (PDT)
In-Reply-To: <20170413145910.GG3790@6wind.com>
References: 
 <CAHVfvh6rPgy9X=pSjOCSo4J5rf-aGAkYjs7qh6H5dKGTT1rE8g@mail.gmail.com>
	<20170413145910.GG3790@6wind.com>
From: jigsaw <jigsaw@gmail.com>
Date: Thu, 13 Apr 2017 23:58:35 +0300
Message-ID: 
 <CAHVfvh4j6UdLBqRP7PJmhhSwy+nZd262B9bG9gG=TxDYSKj9QA@mail.gmail.com>
To: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>
X-Content-Filtered-By: Mailman/MimeDel 2.1.15
Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
	<mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
	<mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Hi Adrien,

Thanks for your comment.

The LOCK/UNLOCK may be called by user application only. There are several
reasons.

1. If the lib calls LOCK, user application may be punished unexpectedly.
Consider what if the Rx burst function calls the LOCK in core #1, and then
the mbuf is handed over from
core #1 to core #2 through a enqueue/dequeue operation, as below:

Rx(1) --> Enqueue(1) --> Dequeue(2)
LOCKED                       Panic!

The core #2 will then panic because the mbuf is owned by core #1 without
being released.

2. Rx and Tx both usually works in a burst mode, combined with prefetch
operation. Meanwhile
LOCK and UNLOCK cannot work well with prefetch, because it requires data
access of mbuf header.

3. The critical session tends to be small. Here we (user application) need
to find a balance: with longer interval of critical
section, we can have more secured code; however, longer duration leads to
more complicated business
logic.

Hence, we urge user application to use LOCK/UNLOCK with the common sense of
critical sections.

Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown
below:

         * then send them straightway.
@@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf, struct
rte_mbuf **pkts_burst,
                if (likely(pn != BAD_PORT))
                        send_packetsx4(qconf, pn, pkts_burst + j, k);
                else
-                       for (m = j; m != j + k; m++)
+                      for (m = j; m != j + k; m++) {
+                              RTE_MBUF_UNLOCK(pkts_burst[m]);
                                rte_pktmbuf_free(pkts_burst[m]);
+                      }

        }
 }
==========================================

Note that data race may or may not have visible consequence. If two cores
unconsciously process same mbuf
at different time, they may never notice it; but if two cores access same
mbuf at the same physical time, the
consequence is usually visible (crash). We don't seek for a solution that
captures even potential data race;
instead, we seek for a solution that can capture data race that happens
simultaneously in two or more cores.
Therefore, we do not need to extend the border of locking as wide as
possible. We will apply locking only when
we are focusing on a single mbuf processing.

In a real life application, a core will spend quite some time on each mbuf.
The interval ranges from a few hundred
cycles to a few thousand cycles. And usually not more than a handful of
mbuf are involved. This is a ideal use case
for locking mbuf.

I agree that the race detection shall not be compiled by default, since
mtod is often called, and every mtod implies a
visit to local cache. Further, recursive call of LOCK/UNLOCK shall be
supported as well. But I don't think refcnt logic
should be taken into account; these two are orthogonal designs, IMO. *****
Pls correct me if I am wrong here. *****

Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason. That
is why I said LOCK/UNLOCK needs to
survive mbuf alloc initialization.

Of course we need to support locking multiple mbufs at the same time. For
each core, we will then preserve, say, 8 slots.
It works exactly like a direct mapped cacheline. That is, we can use 4bits
from the mbuf address to locate its cacheline.
If the cacheline has been occupied, we do an eviction; that is, the new
mbuf will take the place of the old one. The old one is
then UNLOCKed, unfortunately.

Honestly I have not yet tried this approach in real life application. But I
have been thinking over the problem of data race detection
for a long time, and I found the restriction and requirement makes this
solution the only viable one. There are hundreds of papers
published in the field on data race condition detection, but the lightest
of the options has at least 5x of performance
penalty, not to mention the space complexity, making it not applicable in
practice.

Again, pls, anyone has same painful experience of data race bugs, share
with me your concerns. It would be nice to come up with
some practical device to address this problem. I believe 6Wind and other IP
stack vendors must share the same feeling and opinion.

thx &
rgds,

Qinglai


On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil <
adrien.mazarguil@6wind.com> wrote:

> Hi Qinglai,
>
> On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote:
> > Hi,
> >
> > I have a proposal for mbuf race condition detection and I would like to
> get
> > your opinions before
> > committing any patch.
> >
> > Race condition is the worst bug I can think of; as it causes crashing
> long
> > after the first crime scene,
> > and the consequence is unpredictable and difficult to understand.
> >
> > Besides, race condition is very difficult to reproduce. Usually we get a
> > coredump from live network,
> > but the coredump itself delivers nearly zero info. We just know that the
> > mbuf is somehow broken, and
> > it is perhaps overwritten by another core at any moment.
> >
> > There are tools such as Valgrind and ThreadSanitizer to capture this
> fault.
> > But the overhead brought
> > by the tools are too high to be deployed in performance test, not to
> > mention in the live network. But
> > race condition rarely happens under low pressure.
> >
> > Even worse, even in theory, the tools mentioned are not capable of
> > capturing the scenario, because
> > they requires explicit calls on locking primitives such as pthread mutex.
> > This is because the semantics
> > are not understood by the tools without explicit lock/unlock.
> >
> > With such known restrictions, I have a proposal roughly as below.
> >
> > The idea is to ask coder to do explicit lock/unlock on each mbuf that
> > matters. The pseudo code is as below:
> >
> >     RTE_MBUF_LOCK(m);
> >     /* use mbuf as usual */
> >     ...
> >     RTE_MBUF_UNLOCK(m);
> >
> > During the protected critical section, only the core which holds the lock
> > can access the mbuf; while
> > other cores, if they dare to use mbuf, they will be punished by segfault.
> >
> > Since the proposal shall be feasible at least in performance test, the
> > overhead of locking and
> > punishment must be small. And the access to mbuf must be as usual.
> >
> > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code):
> >
> > RTE_MBUF_LOCK(m)
> > {
> >     store_per_core_cache(m, m->buf_addr);
> >     m->buf_addr = NULL;
> >     mb(); // memory barrier
> > }
> >
> > And RTE_MBUF_UNLOCK is simply the reverse:
> >
> > RTE_MBUF_UNLOCK(m)
> > {
> >     purge_per_core_cache(m);
> >     m->buf_addr = load_per_core_cache(m);
> >     mb();
> > }
> >
> > As we can see that the idea is to re-write m->buf_addr with NULL, and
> store
> > it in a per-core
> > cache. The other cores, while trying to access the mbuf during critical
> > section, will be certainly
> > punished.
> >
> > Then of course we need to keep the owner core working. This is done by
> > making changes to
> > mtod, as below:
> >
> > #define rte_pktmbuf_mtod_offset(m, t, o) \
> >     ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off +
> > (o)))
> >
> > The per-core cache of mbuf works like a key-value database, which works
> > like a direct-mapping
> > cache. If an eviction happens (a newly arriving mbuf must kick out an old
> > one), then the we restore
> > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then
> > take care of such
> > situation.
> >
> > Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed
> and
> > allocated, since
> > we don't trust the refcnt. A double-free is actually very rare; but data
> > race can occur without double-free. Therefore, we need to survive the
> > allocation of mbuf, that is rte_pktmbuf_init, which reset the
> > buf_addr.
> >
> > Further, other dereference to buf_addr in rte_mbuf.h need to be revised.
> > But it is not a big deal (I hope).
> >
> > The runtime overhead of this implementation is very light-weighted, and
> > probably is able to turned
> > on even in live network. And it is not intrusive at all: no change is
> > needed in struct rte_mbuf; we just
> > need a O(N) space complexity (where N is number of cores), and O(1)
> runtime
> > complexity.
> >
> > Alright, that is basically what is in my mind. Before any patch is
> > provided, or any perf analysis is made, I would like to know what are the
> > risks form design point of view. Please leave your feedback.
>
> Your proposal makes sense but I'm not sure where developers should call
> RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user applications
> only? Is it to be used internally by DPDK as well? Does it include PMDs?
>
> I think any overhead outside of debugging mode, as minimal as it is, is too
> much of a "punishment" for well behaving applications. The cost of a memory
> barrier can be extremely high and this is one of the reasons DPDK processes
> mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and
> RTE_MBUF_UNLOCK() should thus exist under some compilation option disabled
> by default, and thought as an additional debugging tool.
>
> Also the implementation would probably be more expensive than your
> suggestion, in my opinion the reference count must be taken into account
> and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs
> work. Freeing mbufs multiple times is perfectly valid in many cases.
>
> How can one lock several mbufs at once by the way?
>
> Since it affects performance, this can only make sense as a debugging tool
> developers can use when they encounter some corruption issue they cannot
> identify, somewhat like poisoning buffers before they are freed. Not sure
> it's worth the trouble.
>
> Regards,
>
> --
> Adrien Mazarguil
> 6WIND
>

==========================================
diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h
b/examples/l3fwd/l3fwd_em_hlm_sse.h
index 7714a20..9db0190 100644
--- a/examples/l3fwd/l3fwd_em_hlm_sse.h
+++ b/examples/l3fwd/l3fwd_em_hlm_sse.h
@@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
**pkts_burst,

        for (j = 0; j < n; j += 8) {

+            RTE_MBUF_LOCK(pkts_burst[j]);
+            RTE_MBUF_LOCK(pkts_burst[j+1]);
+            RTE_MBUF_LOCK(pkts_burst[j+2]);
+            RTE_MBUF_LOCK(pkts_burst[j+3]);
+            RTE_MBUF_LOCK(pkts_burst[j+4]);
+            RTE_MBUF_LOCK(pkts_burst[j+5]);
+            RTE_MBUF_LOCK(pkts_burst[j+6]);
+            RTE_MBUF_LOCK(pkts_burst[j+7]);
+
                uint32_t pkt_type =
                        pkts_burst[j]->packet_type &
                        pkts_burst[j+1]->packet_type &
@@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
**pkts_burst,
                }
        }

-       for (; j < nb_rx; j++)
+      for (; j < nb_rx; j++) {
+              RTE_MBUF_LOCK(pkts_burst[j]);
                dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
+      }

        send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);

diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
index 1afa1f0..2938558 100644
--- a/examples/l3fwd/l3fwd_sse.h
+++ b/examples/l3fwd/l3fwd_sse.h
@@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t port,
struct rte_mbuf *m[],

        len = qconf->tx_mbufs[port].len;

+      for (j = 0; j < num; ++j)
+            RTE_MBUF_UNLOCK(m);
+
        /*
         * If TX buffer for that queue is empty, and we have enough packets,