diff mbox

[dpdk-dev] Proposal of mbuf Race Condition Detection

Message ID CAHVfvh4j6UdLBqRP7PJmhhSwy+nZd262B9bG9gG=TxDYSKj9QA@mail.gmail.com (mailing list archive)
State Not Applicable, archived
Headers show

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation success Compilation OK

Commit Message

Qinglai Xiao April 13, 2017, 8:58 p.m. UTC
Hi Adrien,

Thanks for your comment.

The LOCK/UNLOCK may be called by user application only. There are several
reasons.

1. If the lib calls LOCK, user application may be punished unexpectedly.
Consider what if the Rx burst function calls the LOCK in core #1, and then
the mbuf is handed over from
core #1 to core #2 through a enqueue/dequeue operation, as below:

Rx(1) --> Enqueue(1) --> Dequeue(2)
LOCKED                       Panic!

The core #2 will then panic because the mbuf is owned by core #1 without
being released.

2. Rx and Tx both usually works in a burst mode, combined with prefetch
operation. Meanwhile
LOCK and UNLOCK cannot work well with prefetch, because it requires data
access of mbuf header.

3. The critical session tends to be small. Here we (user application) need
to find a balance: with longer interval of critical
section, we can have more secured code; however, longer duration leads to
more complicated business
logic.

Hence, we urge user application to use LOCK/UNLOCK with the common sense of
critical sections.

Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown
below:

         * then send them straightway.
@@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf, struct
rte_mbuf **pkts_burst,
                if (likely(pn != BAD_PORT))
                        send_packetsx4(qconf, pn, pkts_burst + j, k);
                else
-                       for (m = j; m != j + k; m++)
+                      for (m = j; m != j + k; m++) {
+                              RTE_MBUF_UNLOCK(pkts_burst[m]);
                                rte_pktmbuf_free(pkts_burst[m]);
+                      }

        }
 }
==========================================

Note that data race may or may not have visible consequence. If two cores
unconsciously process same mbuf
at different time, they may never notice it; but if two cores access same
mbuf at the same physical time, the
consequence is usually visible (crash). We don't seek for a solution that
captures even potential data race;
instead, we seek for a solution that can capture data race that happens
simultaneously in two or more cores.
Therefore, we do not need to extend the border of locking as wide as
possible. We will apply locking only when
we are focusing on a single mbuf processing.

In a real life application, a core will spend quite some time on each mbuf.
The interval ranges from a few hundred
cycles to a few thousand cycles. And usually not more than a handful of
mbuf are involved. This is a ideal use case
for locking mbuf.

I agree that the race detection shall not be compiled by default, since
mtod is often called, and every mtod implies a
visit to local cache. Further, recursive call of LOCK/UNLOCK shall be
supported as well. But I don't think refcnt logic
should be taken into account; these two are orthogonal designs, IMO. *****
Pls correct me if I am wrong here. *****

Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason. That
is why I said LOCK/UNLOCK needs to
survive mbuf alloc initialization.

Of course we need to support locking multiple mbufs at the same time. For
each core, we will then preserve, say, 8 slots.
It works exactly like a direct mapped cacheline. That is, we can use 4bits
from the mbuf address to locate its cacheline.
If the cacheline has been occupied, we do an eviction; that is, the new
mbuf will take the place of the old one. The old one is
then UNLOCKed, unfortunately.

Honestly I have not yet tried this approach in real life application. But I
have been thinking over the problem of data race detection
for a long time, and I found the restriction and requirement makes this
solution the only viable one. There are hundreds of papers
published in the field on data race condition detection, but the lightest
of the options has at least 5x of performance
penalty, not to mention the space complexity, making it not applicable in
practice.

Again, pls, anyone has same painful experience of data race bugs, share
with me your concerns. It would be nice to come up with
some practical device to address this problem. I believe 6Wind and other IP
stack vendors must share the same feeling and opinion.

thx &
rgds,

Qinglai



On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil <
adrien.mazarguil@6wind.com> wrote:

> Hi Qinglai,
>
> On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote:
> > Hi,
> >
> > I have a proposal for mbuf race condition detection and I would like to
> get
> > your opinions before
> > committing any patch.
> >
> > Race condition is the worst bug I can think of; as it causes crashing
> long
> > after the first crime scene,
> > and the consequence is unpredictable and difficult to understand.
> >
> > Besides, race condition is very difficult to reproduce. Usually we get a
> > coredump from live network,
> > but the coredump itself delivers nearly zero info. We just know that the
> > mbuf is somehow broken, and
> > it is perhaps overwritten by another core at any moment.
> >
> > There are tools such as Valgrind and ThreadSanitizer to capture this
> fault.
> > But the overhead brought
> > by the tools are too high to be deployed in performance test, not to
> > mention in the live network. But
> > race condition rarely happens under low pressure.
> >
> > Even worse, even in theory, the tools mentioned are not capable of
> > capturing the scenario, because
> > they requires explicit calls on locking primitives such as pthread mutex.
> > This is because the semantics
> > are not understood by the tools without explicit lock/unlock.
> >
> > With such known restrictions, I have a proposal roughly as below.
> >
> > The idea is to ask coder to do explicit lock/unlock on each mbuf that
> > matters. The pseudo code is as below:
> >
> >     RTE_MBUF_LOCK(m);
> >     /* use mbuf as usual */
> >     ...
> >     RTE_MBUF_UNLOCK(m);
> >
> > During the protected critical section, only the core which holds the lock
> > can access the mbuf; while
> > other cores, if they dare to use mbuf, they will be punished by segfault.
> >
> > Since the proposal shall be feasible at least in performance test, the
> > overhead of locking and
> > punishment must be small. And the access to mbuf must be as usual.
> >
> > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code):
> >
> > RTE_MBUF_LOCK(m)
> > {
> >     store_per_core_cache(m, m->buf_addr);
> >     m->buf_addr = NULL;
> >     mb(); // memory barrier
> > }
> >
> > And RTE_MBUF_UNLOCK is simply the reverse:
> >
> > RTE_MBUF_UNLOCK(m)
> > {
> >     purge_per_core_cache(m);
> >     m->buf_addr = load_per_core_cache(m);
> >     mb();
> > }
> >
> > As we can see that the idea is to re-write m->buf_addr with NULL, and
> store
> > it in a per-core
> > cache. The other cores, while trying to access the mbuf during critical
> > section, will be certainly
> > punished.
> >
> > Then of course we need to keep the owner core working. This is done by
> > making changes to
> > mtod, as below:
> >
> > #define rte_pktmbuf_mtod_offset(m, t, o) \
> >     ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off +
> > (o)))
> >
> > The per-core cache of mbuf works like a key-value database, which works
> > like a direct-mapping
> > cache. If an eviction happens (a newly arriving mbuf must kick out an old
> > one), then the we restore
> > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then
> > take care of such
> > situation.
> >
> > Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed
> and
> > allocated, since
> > we don't trust the refcnt. A double-free is actually very rare; but data
> > race can occur without double-free. Therefore, we need to survive the
> > allocation of mbuf, that is rte_pktmbuf_init, which reset the
> > buf_addr.
> >
> > Further, other dereference to buf_addr in rte_mbuf.h need to be revised.
> > But it is not a big deal (I hope).
> >
> > The runtime overhead of this implementation is very light-weighted, and
> > probably is able to turned
> > on even in live network. And it is not intrusive at all: no change is
> > needed in struct rte_mbuf; we just
> > need a O(N) space complexity (where N is number of cores), and O(1)
> runtime
> > complexity.
> >
> > Alright, that is basically what is in my mind. Before any patch is
> > provided, or any perf analysis is made, I would like to know what are the
> > risks form design point of view. Please leave your feedback.
>
> Your proposal makes sense but I'm not sure where developers should call
> RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user applications
> only? Is it to be used internally by DPDK as well? Does it include PMDs?
>
> I think any overhead outside of debugging mode, as minimal as it is, is too
> much of a "punishment" for well behaving applications. The cost of a memory
> barrier can be extremely high and this is one of the reasons DPDK processes
> mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and
> RTE_MBUF_UNLOCK() should thus exist under some compilation option disabled
> by default, and thought as an additional debugging tool.
>
> Also the implementation would probably be more expensive than your
> suggestion, in my opinion the reference count must be taken into account
> and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs
> work. Freeing mbufs multiple times is perfectly valid in many cases.
>
> How can one lock several mbufs at once by the way?
>
> Since it affects performance, this can only make sense as a debugging tool
> developers can use when they encounter some corruption issue they cannot
> identify, somewhat like poisoning buffers before they are freed. Not sure
> it's worth the trouble.
>
> Regards,
>
> --
> Adrien Mazarguil
> 6WIND
>

Comments

Ananyev, Konstantin April 13, 2017, 11:19 p.m. UTC | #1
> -----Original Message-----

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of jigsaw

> Sent: Thursday, April 13, 2017 9:59 PM

> To: Adrien Mazarguil <adrien.mazarguil@6wind.com>

> Cc: dev@dpdk.org

> Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection

> 

> Hi Adrien,

> 

> Thanks for your comment.

> 

> The LOCK/UNLOCK may be called by user application only. There are several

> reasons.

> 

> 1. If the lib calls LOCK, user application may be punished unexpectedly.

> Consider what if the Rx burst function calls the LOCK in core #1, and then

> the mbuf is handed over from

> core #1 to core #2 through a enqueue/dequeue operation, as below:

> 

> Rx(1) --> Enqueue(1) --> Dequeue(2)

> LOCKED                       Panic!

> 

> The core #2 will then panic because the mbuf is owned by core #1 without

> being released.

> 

> 2. Rx and Tx both usually works in a burst mode, combined with prefetch

> operation. Meanwhile

> LOCK and UNLOCK cannot work well with prefetch, because it requires data

> access of mbuf header.

> 

> 3. The critical session tends to be small. Here we (user application) need

> to find a balance: with longer interval of critical

> section, we can have more secured code; however, longer duration leads to

> more complicated business

> logic.

> 

> Hence, we urge user application to use LOCK/UNLOCK with the common sense of

> critical sections.

> 

> Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown

> below:

> 

> ==========================================

> diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h

> b/examples/l3fwd/l3fwd_em_hlm_sse.h

> index 7714a20..9db0190 100644

> --- a/examples/l3fwd/l3fwd_em_hlm_sse.h

> +++ b/examples/l3fwd/l3fwd_em_hlm_sse.h

> @@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf

> **pkts_burst,

> 

>         for (j = 0; j < n; j += 8) {

> 

> +            RTE_MBUF_LOCK(pkts_burst[j]);

> +            RTE_MBUF_LOCK(pkts_burst[j+1]);

> +            RTE_MBUF_LOCK(pkts_burst[j+2]);

> +            RTE_MBUF_LOCK(pkts_burst[j+3]);

> +            RTE_MBUF_LOCK(pkts_burst[j+4]);

> +            RTE_MBUF_LOCK(pkts_burst[j+5]);

> +            RTE_MBUF_LOCK(pkts_burst[j+6]);

> +            RTE_MBUF_LOCK(pkts_burst[j+7]);

> +

>                 uint32_t pkt_type =

>                         pkts_burst[j]->packet_type &

>                         pkts_burst[j+1]->packet_type &

> @@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf

> **pkts_burst,

>                 }

>         }

> 

> -       for (; j < nb_rx; j++)

> +      for (; j < nb_rx; j++) {

> +              RTE_MBUF_LOCK(pkts_burst[j]);

>                 dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);

> +      }

> 

>         send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);

> 

> diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h

> index 1afa1f0..2938558 100644

> --- a/examples/l3fwd/l3fwd_sse.h

> +++ b/examples/l3fwd/l3fwd_sse.h

> @@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t port,

> struct rte_mbuf *m[],

> 

>         len = qconf->tx_mbufs[port].len;

> 

> +      for (j = 0; j < num; ++j)

> +            RTE_MBUF_UNLOCK(m);

> +

>         /*

>          * If TX buffer for that queue is empty, and we have enough packets,

>          * then send them straightway.

> @@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf, struct

> rte_mbuf **pkts_burst,

>                 if (likely(pn != BAD_PORT))

>                         send_packetsx4(qconf, pn, pkts_burst + j, k);

>                 else

> -                       for (m = j; m != j + k; m++)

> +                      for (m = j; m != j + k; m++) {

> +                              RTE_MBUF_UNLOCK(pkts_burst[m]);

>                                 rte_pktmbuf_free(pkts_burst[m]);

> +                      }

> 

>         }

>  }

> ==========================================

> 

> Note that data race may or may not have visible consequence. If two cores

> unconsciously process same mbuf

> at different time, they may never notice it; but if two cores access same

> mbuf at the same physical time, the

> consequence is usually visible (crash). We don't seek for a solution that

> captures even potential data race;

> instead, we seek for a solution that can capture data race that happens

> simultaneously in two or more cores.

> Therefore, we do not need to extend the border of locking as wide as

> possible. We will apply locking only when

> we are focusing on a single mbuf processing.

> 

> In a real life application, a core will spend quite some time on each mbuf.

> The interval ranges from a few hundred

> cycles to a few thousand cycles. And usually not more than a handful of

> mbuf are involved. This is a ideal use case

> for locking mbuf.

> 

> I agree that the race detection shall not be compiled by default, since

> mtod is often called, and every mtod implies a

> visit to local cache. Further, recursive call of LOCK/UNLOCK shall be

> supported as well. But I don't think refcnt logic

> should be taken into account; these two are orthogonal designs, IMO. *****

> Pls correct me if I am wrong here. *****

> 

> Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason. That

> is why I said LOCK/UNLOCK needs to

> survive mbuf alloc initialization.

> 

> Of course we need to support locking multiple mbufs at the same time. For

> each core, we will then preserve, say, 8 slots.

> It works exactly like a direct mapped cacheline. That is, we can use 4bits

> from the mbuf address to locate its cacheline.

> If the cacheline has been occupied, we do an eviction; that is, the new

> mbuf will take the place of the old one. The old one is

> then UNLOCKed, unfortunately.

> 

> Honestly I have not yet tried this approach in real life application. But I

> have been thinking over the problem of data race detection

> for a long time, and I found the restriction and requirement makes this

> solution the only viable one. There are hundreds of papers

> published in the field on data race condition detection, but the lightest

> of the options has at least 5x of performance

> penalty, not to mention the space complexity, making it not applicable in

> practice.

> 

> Again, pls, anyone has same painful experience of data race bugs, share

> with me your concerns. It would be nice to come up with

> some practical device to address this problem. I believe 6Wind and other IP

> stack vendors must share the same feeling and opinion.

> 

> thx &

> rgds,

> 

> Qinglai

> 

> 

> 

> On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil <

> adrien.mazarguil@6wind.com> wrote:

> 

> > Hi Qinglai,

> >

> > On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote:

> > > Hi,

> > >

> > > I have a proposal for mbuf race condition detection and I would like to

> > get

> > > your opinions before

> > > committing any patch.

> > >

> > > Race condition is the worst bug I can think of; as it causes crashing

> > long

> > > after the first crime scene,

> > > and the consequence is unpredictable and difficult to understand.

> > >

> > > Besides, race condition is very difficult to reproduce. Usually we get a

> > > coredump from live network,

> > > but the coredump itself delivers nearly zero info. We just know that the

> > > mbuf is somehow broken, and

> > > it is perhaps overwritten by another core at any moment.

> > >

> > > There are tools such as Valgrind and ThreadSanitizer to capture this

> > fault.

> > > But the overhead brought

> > > by the tools are too high to be deployed in performance test, not to

> > > mention in the live network. But

> > > race condition rarely happens under low pressure.

> > >

> > > Even worse, even in theory, the tools mentioned are not capable of

> > > capturing the scenario, because

> > > they requires explicit calls on locking primitives such as pthread mutex.

> > > This is because the semantics

> > > are not understood by the tools without explicit lock/unlock.

> > >

> > > With such known restrictions, I have a proposal roughly as below.

> > >

> > > The idea is to ask coder to do explicit lock/unlock on each mbuf that

> > > matters. The pseudo code is as below:

> > >

> > >     RTE_MBUF_LOCK(m);

> > >     /* use mbuf as usual */

> > >     ...

> > >     RTE_MBUF_UNLOCK(m);

> > >

> > > During the protected critical section, only the core which holds the lock

> > > can access the mbuf; while

> > > other cores, if they dare to use mbuf, they will be punished by segfault.

> > >

> > > Since the proposal shall be feasible at least in performance test, the

> > > overhead of locking and

> > > punishment must be small. And the access to mbuf must be as usual.

> > >

> > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code):

> > >

> > > RTE_MBUF_LOCK(m)

> > > {

> > >     store_per_core_cache(m, m->buf_addr);

> > >     m->buf_addr = NULL;

> > >     mb(); // memory barrier

> > > }

> > >

> > > And RTE_MBUF_UNLOCK is simply the reverse:

> > >

> > > RTE_MBUF_UNLOCK(m)

> > > {

> > >     purge_per_core_cache(m);

> > >     m->buf_addr = load_per_core_cache(m);

> > >     mb();

> > > }

> > >

> > > As we can see that the idea is to re-write m->buf_addr with NULL, and

> > store

> > > it in a per-core

> > > cache. The other cores, while trying to access the mbuf during critical

> > > section, will be certainly

> > > punished.

> > >

> > > Then of course we need to keep the owner core working. This is done by

> > > making changes to

> > > mtod, as below:

> > >

> > > #define rte_pktmbuf_mtod_offset(m, t, o) \

> > >     ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off +

> > > (o)))

> > >

> > > The per-core cache of mbuf works like a key-value database, which works

> > > like a direct-mapping

> > > cache. If an eviction happens (a newly arriving mbuf must kick out an old

> > > one), then the we restore

> > > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then

> > > take care of such

> > > situation.

> > >

> > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed

> > and

> > > allocated, since

> > > we don't trust the refcnt. A double-free is actually very rare; but data

> > > race can occur without double-free. Therefore, we need to survive the

> > > allocation of mbuf, that is rte_pktmbuf_init, which reset the

> > > buf_addr.

> > >

> > > Further, other dereference to buf_addr in rte_mbuf.h need to be revised.

> > > But it is not a big deal (I hope).

> > >

> > > The runtime overhead of this implementation is very light-weighted, and

> > > probably is able to turned

> > > on even in live network. And it is not intrusive at all: no change is

> > > needed in struct rte_mbuf; we just

> > > need a O(N) space complexity (where N is number of cores), and O(1)

> > runtime

> > > complexity.

> > >

> > > Alright, that is basically what is in my mind. Before any patch is

> > > provided, or any perf analysis is made, I would like to know what are the

> > > risks form design point of view. Please leave your feedback.

> >

> > Your proposal makes sense but I'm not sure where developers should call

> > RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user applications

> > only? Is it to be used internally by DPDK as well? Does it include PMDs?

> >

> > I think any overhead outside of debugging mode, as minimal as it is, is too

> > much of a "punishment" for well behaving applications. The cost of a memory

> > barrier can be extremely high and this is one of the reasons DPDK processes

> > mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and

> > RTE_MBUF_UNLOCK() should thus exist under some compilation option disabled

> > by default, and thought as an additional debugging tool.

> >

> > Also the implementation would probably be more expensive than your

> > suggestion, in my opinion the reference count must be taken into account

> > and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs

> > work. Freeing mbufs multiple times is perfectly valid in many cases.

> >

> > How can one lock several mbufs at once by the way?

> >

> > Since it affects performance, this can only make sense as a debugging tool

> > developers can use when they encounter some corruption issue they cannot

> > identify, somewhat like poisoning buffers before they are freed. Not sure

> > it's worth the trouble.


I am agree with Adrian here - it doesn't  look worth it:
From one side it adds quite a lot of overhead and complexity to the mbuf code.
From other side it usage seems pretty limited - there would be a lot of cases when it wouldn't be
able to detect race condition anyway..
So my opinion is NACK.
BTW, if your app is intentionally trying to do read/write to the same mbufs simultaneously from different threads
without some explicit synchronization, then most likely there is something wrong with it.
Konstantin
Stephen Hemminger April 14, 2017, 12:03 a.m. UTC | #2
On Thu, 13 Apr 2017 23:19:45 +0000
"Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:

> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of jigsaw
> > Sent: Thursday, April 13, 2017 9:59 PM
> > To: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection
> > 
> > Hi Adrien,
> > 
> > Thanks for your comment.
> > 
> > The LOCK/UNLOCK may be called by user application only. There are several
> > reasons.
> > 
> > 1. If the lib calls LOCK, user application may be punished unexpectedly.
> > Consider what if the Rx burst function calls the LOCK in core #1, and then
> > the mbuf is handed over from
> > core #1 to core #2 through a enqueue/dequeue operation, as below:
> > 
> > Rx(1) --> Enqueue(1) --> Dequeue(2)
> > LOCKED                       Panic!
> > 
> > The core #2 will then panic because the mbuf is owned by core #1 without
> > being released.
> > 
> > 2. Rx and Tx both usually works in a burst mode, combined with prefetch
> > operation. Meanwhile
> > LOCK and UNLOCK cannot work well with prefetch, because it requires data
> > access of mbuf header.
> > 
> > 3. The critical session tends to be small. Here we (user application) need
> > to find a balance: with longer interval of critical
> > section, we can have more secured code; however, longer duration leads to
> > more complicated business
> > logic.
> > 
> > Hence, we urge user application to use LOCK/UNLOCK with the common sense of
> > critical sections.
> > 
> > Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown
> > below:
> > 
> > ==========================================
> > diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h
> > b/examples/l3fwd/l3fwd_em_hlm_sse.h
> > index 7714a20..9db0190 100644
> > --- a/examples/l3fwd/l3fwd_em_hlm_sse.h
> > +++ b/examples/l3fwd/l3fwd_em_hlm_sse.h
> > @@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
> > **pkts_burst,
> > 
> >         for (j = 0; j < n; j += 8) {
> > 
> > +            RTE_MBUF_LOCK(pkts_burst[j]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+1]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+2]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+3]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+4]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+5]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+6]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+7]);
> > +
> >                 uint32_t pkt_type =
> >                         pkts_burst[j]->packet_type &
> >                         pkts_burst[j+1]->packet_type &
> > @@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
> > **pkts_burst,
> >                 }
> >         }
> > 
> > -       for (; j < nb_rx; j++)
> > +      for (; j < nb_rx; j++) {
> > +              RTE_MBUF_LOCK(pkts_burst[j]);
> >                 dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
> > +      }
> > 
> >         send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
> > 
> > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > index 1afa1f0..2938558 100644
> > --- a/examples/l3fwd/l3fwd_sse.h
> > +++ b/examples/l3fwd/l3fwd_sse.h
> > @@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t port,
> > struct rte_mbuf *m[],
> > 
> >         len = qconf->tx_mbufs[port].len;
> > 
> > +      for (j = 0; j < num; ++j)
> > +            RTE_MBUF_UNLOCK(m);
> > +
> >         /*
> >          * If TX buffer for that queue is empty, and we have enough packets,
> >          * then send them straightway.
> > @@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf, struct
> > rte_mbuf **pkts_burst,
> >                 if (likely(pn != BAD_PORT))
> >                         send_packetsx4(qconf, pn, pkts_burst + j, k);
> >                 else
> > -                       for (m = j; m != j + k; m++)
> > +                      for (m = j; m != j + k; m++) {
> > +                              RTE_MBUF_UNLOCK(pkts_burst[m]);
> >                                 rte_pktmbuf_free(pkts_burst[m]);
> > +                      }
> > 
> >         }
> >  }
> > ==========================================
> > 
> > Note that data race may or may not have visible consequence. If two cores
> > unconsciously process same mbuf
> > at different time, they may never notice it; but if two cores access same
> > mbuf at the same physical time, the
> > consequence is usually visible (crash). We don't seek for a solution that
> > captures even potential data race;
> > instead, we seek for a solution that can capture data race that happens
> > simultaneously in two or more cores.
> > Therefore, we do not need to extend the border of locking as wide as
> > possible. We will apply locking only when
> > we are focusing on a single mbuf processing.
> > 
> > In a real life application, a core will spend quite some time on each mbuf.
> > The interval ranges from a few hundred
> > cycles to a few thousand cycles. And usually not more than a handful of
> > mbuf are involved. This is a ideal use case
> > for locking mbuf.
> > 
> > I agree that the race detection shall not be compiled by default, since
> > mtod is often called, and every mtod implies a
> > visit to local cache. Further, recursive call of LOCK/UNLOCK shall be
> > supported as well. But I don't think refcnt logic
> > should be taken into account; these two are orthogonal designs, IMO. *****
> > Pls correct me if I am wrong here. *****
> > 
> > Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason. That
> > is why I said LOCK/UNLOCK needs to
> > survive mbuf alloc initialization.
> > 
> > Of course we need to support locking multiple mbufs at the same time. For
> > each core, we will then preserve, say, 8 slots.
> > It works exactly like a direct mapped cacheline. That is, we can use 4bits
> > from the mbuf address to locate its cacheline.
> > If the cacheline has been occupied, we do an eviction; that is, the new
> > mbuf will take the place of the old one. The old one is
> > then UNLOCKed, unfortunately.
> > 
> > Honestly I have not yet tried this approach in real life application. But I
> > have been thinking over the problem of data race detection
> > for a long time, and I found the restriction and requirement makes this
> > solution the only viable one. There are hundreds of papers
> > published in the field on data race condition detection, but the lightest
> > of the options has at least 5x of performance
> > penalty, not to mention the space complexity, making it not applicable in
> > practice.
> > 
> > Again, pls, anyone has same painful experience of data race bugs, share
> > with me your concerns. It would be nice to come up with
> > some practical device to address this problem. I believe 6Wind and other IP
> > stack vendors must share the same feeling and opinion.
> > 
> > thx &
> > rgds,
> > 
> > Qinglai
> > 
> > 
> > 
> > On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil <  
> > adrien.mazarguil@6wind.com> wrote:  
> >   
> > > Hi Qinglai,
> > >
> > > On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote:  
> > > > Hi,
> > > >
> > > > I have a proposal for mbuf race condition detection and I would like to  
> > > get  
> > > > your opinions before
> > > > committing any patch.
> > > >
> > > > Race condition is the worst bug I can think of; as it causes crashing  
> > > long  
> > > > after the first crime scene,
> > > > and the consequence is unpredictable and difficult to understand.
> > > >
> > > > Besides, race condition is very difficult to reproduce. Usually we get a
> > > > coredump from live network,
> > > > but the coredump itself delivers nearly zero info. We just know that the
> > > > mbuf is somehow broken, and
> > > > it is perhaps overwritten by another core at any moment.
> > > >
> > > > There are tools such as Valgrind and ThreadSanitizer to capture this  
> > > fault.  
> > > > But the overhead brought
> > > > by the tools are too high to be deployed in performance test, not to
> > > > mention in the live network. But
> > > > race condition rarely happens under low pressure.
> > > >
> > > > Even worse, even in theory, the tools mentioned are not capable of
> > > > capturing the scenario, because
> > > > they requires explicit calls on locking primitives such as pthread mutex.
> > > > This is because the semantics
> > > > are not understood by the tools without explicit lock/unlock.
> > > >
> > > > With such known restrictions, I have a proposal roughly as below.
> > > >
> > > > The idea is to ask coder to do explicit lock/unlock on each mbuf that
> > > > matters. The pseudo code is as below:
> > > >
> > > >     RTE_MBUF_LOCK(m);
> > > >     /* use mbuf as usual */
> > > >     ...
> > > >     RTE_MBUF_UNLOCK(m);
> > > >
> > > > During the protected critical section, only the core which holds the lock
> > > > can access the mbuf; while
> > > > other cores, if they dare to use mbuf, they will be punished by segfault.
> > > >
> > > > Since the proposal shall be feasible at least in performance test, the
> > > > overhead of locking and
> > > > punishment must be small. And the access to mbuf must be as usual.
> > > >
> > > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code):
> > > >
> > > > RTE_MBUF_LOCK(m)
> > > > {
> > > >     store_per_core_cache(m, m->buf_addr);
> > > >     m->buf_addr = NULL;
> > > >     mb(); // memory barrier
> > > > }
> > > >
> > > > And RTE_MBUF_UNLOCK is simply the reverse:
> > > >
> > > > RTE_MBUF_UNLOCK(m)
> > > > {
> > > >     purge_per_core_cache(m);
> > > >     m->buf_addr = load_per_core_cache(m);
> > > >     mb();
> > > > }
> > > >
> > > > As we can see that the idea is to re-write m->buf_addr with NULL, and  
> > > store  
> > > > it in a per-core
> > > > cache. The other cores, while trying to access the mbuf during critical
> > > > section, will be certainly
> > > > punished.
> > > >
> > > > Then of course we need to keep the owner core working. This is done by
> > > > making changes to
> > > > mtod, as below:
> > > >
> > > > #define rte_pktmbuf_mtod_offset(m, t, o) \
> > > >     ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off +
> > > > (o)))
> > > >
> > > > The per-core cache of mbuf works like a key-value database, which works
> > > > like a direct-mapping
> > > > cache. If an eviction happens (a newly arriving mbuf must kick out an old
> > > > one), then the we restore
> > > > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then
> > > > take care of such
> > > > situation.
> > > >
> > > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed  
> > > and  
> > > > allocated, since
> > > > we don't trust the refcnt. A double-free is actually very rare; but data
> > > > race can occur without double-free. Therefore, we need to survive the
> > > > allocation of mbuf, that is rte_pktmbuf_init, which reset the
> > > > buf_addr.
> > > >
> > > > Further, other dereference to buf_addr in rte_mbuf.h need to be revised.
> > > > But it is not a big deal (I hope).
> > > >
> > > > The runtime overhead of this implementation is very light-weighted, and
> > > > probably is able to turned
> > > > on even in live network. And it is not intrusive at all: no change is
> > > > needed in struct rte_mbuf; we just
> > > > need a O(N) space complexity (where N is number of cores), and O(1)  
> > > runtime  
> > > > complexity.
> > > >
> > > > Alright, that is basically what is in my mind. Before any patch is
> > > > provided, or any perf analysis is made, I would like to know what are the
> > > > risks form design point of view. Please leave your feedback.  
> > >
> > > Your proposal makes sense but I'm not sure where developers should call
> > > RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user applications
> > > only? Is it to be used internally by DPDK as well? Does it include PMDs?
> > >
> > > I think any overhead outside of debugging mode, as minimal as it is, is too
> > > much of a "punishment" for well behaving applications. The cost of a memory
> > > barrier can be extremely high and this is one of the reasons DPDK processes
> > > mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and
> > > RTE_MBUF_UNLOCK() should thus exist under some compilation option disabled
> > > by default, and thought as an additional debugging tool.
> > >
> > > Also the implementation would probably be more expensive than your
> > > suggestion, in my opinion the reference count must be taken into account
> > > and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs
> > > work. Freeing mbufs multiple times is perfectly valid in many cases.
> > >
> > > How can one lock several mbufs at once by the way?
> > >
> > > Since it affects performance, this can only make sense as a debugging tool
> > > developers can use when they encounter some corruption issue they cannot
> > > identify, somewhat like poisoning buffers before they are freed. Not sure
> > > it's worth the trouble.  
> 
> I am agree with Adrian here - it doesn't  look worth it:
> From one side it adds quite a lot of overhead and complexity to the mbuf code.
> From other side it usage seems pretty limited - there would be a lot of cases when it wouldn't be
> able to detect race condition anyway..
> So my opinion is NACK.
> BTW, if your app is intentionally trying to do read/write to the same mbufs simultaneously from different threads
> without some explicit synchronization, then most likely there is something wrong with it.
> Konstantin

This is a one-off debug thing. If you have these kind of issues, you really need to rethink
your threading design. The examples have clear thread model. If you follow the documentation
about process model, and don't try and reinvent your own, then this kind of hand holding
is not necessary.

NAK as well.
Qinglai Xiao April 14, 2017, 8:51 a.m. UTC | #3
Hi all,

Thanks for the comments. The proposal seems refrained from wide usage in
application.

thx &
rgds,
-qinglai

On Fri, Apr 14, 2017 at 3:03 AM, Stephen Hemminger <
stephen@networkplumber.org> wrote:

> On Thu, 13 Apr 2017 23:19:45 +0000
> "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
>
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of jigsaw
> > > Sent: Thursday, April 13, 2017 9:59 PM
> > > To: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection
> > >
> > > Hi Adrien,
> > >
> > > Thanks for your comment.
> > >
> > > The LOCK/UNLOCK may be called by user application only. There are
> several
> > > reasons.
> > >
> > > 1. If the lib calls LOCK, user application may be punished
> unexpectedly.
> > > Consider what if the Rx burst function calls the LOCK in core #1, and
> then
> > > the mbuf is handed over from
> > > core #1 to core #2 through a enqueue/dequeue operation, as below:
> > >
> > > Rx(1) --> Enqueue(1) --> Dequeue(2)
> > > LOCKED                       Panic!
> > >
> > > The core #2 will then panic because the mbuf is owned by core #1
> without
> > > being released.
> > >
> > > 2. Rx and Tx both usually works in a burst mode, combined with prefetch
> > > operation. Meanwhile
> > > LOCK and UNLOCK cannot work well with prefetch, because it requires
> data
> > > access of mbuf header.
> > >
> > > 3. The critical session tends to be small. Here we (user application)
> need
> > > to find a balance: with longer interval of critical
> > > section, we can have more secured code; however, longer duration leads
> to
> > > more complicated business
> > > logic.
> > >
> > > Hence, we urge user application to use LOCK/UNLOCK with the common
> sense of
> > > critical sections.
> > >
> > > Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown
> > > below:
> > >
> > > ==========================================
> > > diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h
> > > b/examples/l3fwd/l3fwd_em_hlm_sse.h
> > > index 7714a20..9db0190 100644
> > > --- a/examples/l3fwd/l3fwd_em_hlm_sse.h
> > > +++ b/examples/l3fwd/l3fwd_em_hlm_sse.h
> > > @@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
> > > **pkts_burst,
> > >
> > >         for (j = 0; j < n; j += 8) {
> > >
> > > +            RTE_MBUF_LOCK(pkts_burst[j]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+1]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+2]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+3]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+4]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+5]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+6]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+7]);
> > > +
> > >                 uint32_t pkt_type =
> > >                         pkts_burst[j]->packet_type &
> > >                         pkts_burst[j+1]->packet_type &
> > > @@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
> > > **pkts_burst,
> > >                 }
> > >         }
> > >
> > > -       for (; j < nb_rx; j++)
> > > +      for (; j < nb_rx; j++) {
> > > +              RTE_MBUF_LOCK(pkts_burst[j]);
> > >                 dst_port[j] = em_get_dst_port(qconf, pkts_burst[j],
> portid);
> > > +      }
> > >
> > >         send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
> > >
> > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > > index 1afa1f0..2938558 100644
> > > --- a/examples/l3fwd/l3fwd_sse.h
> > > +++ b/examples/l3fwd/l3fwd_sse.h
> > > @@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t
> port,
> > > struct rte_mbuf *m[],
> > >
> > >         len = qconf->tx_mbufs[port].len;
> > >
> > > +      for (j = 0; j < num; ++j)
> > > +            RTE_MBUF_UNLOCK(m);
> > > +
> > >         /*
> > >          * If TX buffer for that queue is empty, and we have enough
> packets,
> > >          * then send them straightway.
> > > @@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf,
> struct
> > > rte_mbuf **pkts_burst,
> > >                 if (likely(pn != BAD_PORT))
> > >                         send_packetsx4(qconf, pn, pkts_burst + j, k);
> > >                 else
> > > -                       for (m = j; m != j + k; m++)
> > > +                      for (m = j; m != j + k; m++) {
> > > +                              RTE_MBUF_UNLOCK(pkts_burst[m]);
> > >                                 rte_pktmbuf_free(pkts_burst[m]);
> > > +                      }
> > >
> > >         }
> > >  }
> > > ==========================================
> > >
> > > Note that data race may or may not have visible consequence. If two
> cores
> > > unconsciously process same mbuf
> > > at different time, they may never notice it; but if two cores access
> same
> > > mbuf at the same physical time, the
> > > consequence is usually visible (crash). We don't seek for a solution
> that
> > > captures even potential data race;
> > > instead, we seek for a solution that can capture data race that happens
> > > simultaneously in two or more cores.
> > > Therefore, we do not need to extend the border of locking as wide as
> > > possible. We will apply locking only when
> > > we are focusing on a single mbuf processing.
> > >
> > > In a real life application, a core will spend quite some time on each
> mbuf.
> > > The interval ranges from a few hundred
> > > cycles to a few thousand cycles. And usually not more than a handful of
> > > mbuf are involved. This is a ideal use case
> > > for locking mbuf.
> > >
> > > I agree that the race detection shall not be compiled by default, since
> > > mtod is often called, and every mtod implies a
> > > visit to local cache. Further, recursive call of LOCK/UNLOCK shall be
> > > supported as well. But I don't think refcnt logic
> > > should be taken into account; these two are orthogonal designs, IMO.
> *****
> > > Pls correct me if I am wrong here. *****
> > >
> > > Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason.
> That
> > > is why I said LOCK/UNLOCK needs to
> > > survive mbuf alloc initialization.
> > >
> > > Of course we need to support locking multiple mbufs at the same time.
> For
> > > each core, we will then preserve, say, 8 slots.
> > > It works exactly like a direct mapped cacheline. That is, we can use
> 4bits
> > > from the mbuf address to locate its cacheline.
> > > If the cacheline has been occupied, we do an eviction; that is, the new
> > > mbuf will take the place of the old one. The old one is
> > > then UNLOCKed, unfortunately.
> > >
> > > Honestly I have not yet tried this approach in real life application.
> But I
> > > have been thinking over the problem of data race detection
> > > for a long time, and I found the restriction and requirement makes this
> > > solution the only viable one. There are hundreds of papers
> > > published in the field on data race condition detection, but the
> lightest
> > > of the options has at least 5x of performance
> > > penalty, not to mention the space complexity, making it not applicable
> in
> > > practice.
> > >
> > > Again, pls, anyone has same painful experience of data race bugs, share
> > > with me your concerns. It would be nice to come up with
> > > some practical device to address this problem. I believe 6Wind and
> other IP
> > > stack vendors must share the same feeling and opinion.
> > >
> > > thx &
> > > rgds,
> > >
> > > Qinglai
> > >
> > >
> > >
> > > On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil <
> > > adrien.mazarguil@6wind.com> wrote:
> > >
> > > > Hi Qinglai,
> > > >
> > > > On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote:
> > > > > Hi,
> > > > >
> > > > > I have a proposal for mbuf race condition detection and I would
> like to
> > > > get
> > > > > your opinions before
> > > > > committing any patch.
> > > > >
> > > > > Race condition is the worst bug I can think of; as it causes
> crashing
> > > > long
> > > > > after the first crime scene,
> > > > > and the consequence is unpredictable and difficult to understand.
> > > > >
> > > > > Besides, race condition is very difficult to reproduce. Usually we
> get a
> > > > > coredump from live network,
> > > > > but the coredump itself delivers nearly zero info. We just know
> that the
> > > > > mbuf is somehow broken, and
> > > > > it is perhaps overwritten by another core at any moment.
> > > > >
> > > > > There are tools such as Valgrind and ThreadSanitizer to capture
> this
> > > > fault.
> > > > > But the overhead brought
> > > > > by the tools are too high to be deployed in performance test, not
> to
> > > > > mention in the live network. But
> > > > > race condition rarely happens under low pressure.
> > > > >
> > > > > Even worse, even in theory, the tools mentioned are not capable of
> > > > > capturing the scenario, because
> > > > > they requires explicit calls on locking primitives such as pthread
> mutex.
> > > > > This is because the semantics
> > > > > are not understood by the tools without explicit lock/unlock.
> > > > >
> > > > > With such known restrictions, I have a proposal roughly as below.
> > > > >
> > > > > The idea is to ask coder to do explicit lock/unlock on each mbuf
> that
> > > > > matters. The pseudo code is as below:
> > > > >
> > > > >     RTE_MBUF_LOCK(m);
> > > > >     /* use mbuf as usual */
> > > > >     ...
> > > > >     RTE_MBUF_UNLOCK(m);
> > > > >
> > > > > During the protected critical section, only the core which holds
> the lock
> > > > > can access the mbuf; while
> > > > > other cores, if they dare to use mbuf, they will be punished by
> segfault.
> > > > >
> > > > > Since the proposal shall be feasible at least in performance test,
> the
> > > > > overhead of locking and
> > > > > punishment must be small. And the access to mbuf must be as usual.
> > > > >
> > > > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo
> code):
> > > > >
> > > > > RTE_MBUF_LOCK(m)
> > > > > {
> > > > >     store_per_core_cache(m, m->buf_addr);
> > > > >     m->buf_addr = NULL;
> > > > >     mb(); // memory barrier
> > > > > }
> > > > >
> > > > > And RTE_MBUF_UNLOCK is simply the reverse:
> > > > >
> > > > > RTE_MBUF_UNLOCK(m)
> > > > > {
> > > > >     purge_per_core_cache(m);
> > > > >     m->buf_addr = load_per_core_cache(m);
> > > > >     mb();
> > > > > }
> > > > >
> > > > > As we can see that the idea is to re-write m->buf_addr with NULL,
> and
> > > > store
> > > > > it in a per-core
> > > > > cache. The other cores, while trying to access the mbuf during
> critical
> > > > > section, will be certainly
> > > > > punished.
> > > > >
> > > > > Then of course we need to keep the owner core working. This is
> done by
> > > > > making changes to
> > > > > mtod, as below:
> > > > >
> > > > > #define rte_pktmbuf_mtod_offset(m, t, o) \
> > > > >     ((t)((char *)(m)->buf_addr + load_per_core_cache(m) +
> (m)->data_off +
> > > > > (o)))
> > > > >
> > > > > The per-core cache of mbuf works like a key-value database, which
> works
> > > > > like a direct-mapping
> > > > > cache. If an eviction happens (a newly arriving mbuf must kick out
> an old
> > > > > one), then the we restore
> > > > > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK
> will then
> > > > > take care of such
> > > > > situation.
> > > > >
> > > > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is
> freed
> > > > and
> > > > > allocated, since
> > > > > we don't trust the refcnt. A double-free is actually very rare;
> but data
> > > > > race can occur without double-free. Therefore, we need to survive
> the
> > > > > allocation of mbuf, that is rte_pktmbuf_init, which reset the
> > > > > buf_addr.
> > > > >
> > > > > Further, other dereference to buf_addr in rte_mbuf.h need to be
> revised.
> > > > > But it is not a big deal (I hope).
> > > > >
> > > > > The runtime overhead of this implementation is very
> light-weighted, and
> > > > > probably is able to turned
> > > > > on even in live network. And it is not intrusive at all: no change
> is
> > > > > needed in struct rte_mbuf; we just
> > > > > need a O(N) space complexity (where N is number of cores), and O(1)
> > > > runtime
> > > > > complexity.
> > > > >
> > > > > Alright, that is basically what is in my mind. Before any patch is
> > > > > provided, or any perf analysis is made, I would like to know what
> are the
> > > > > risks form design point of view. Please leave your feedback.
> > > >
> > > > Your proposal makes sense but I'm not sure where developers should
> call
> > > > RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user
> applications
> > > > only? Is it to be used internally by DPDK as well? Does it include
> PMDs?
> > > >
> > > > I think any overhead outside of debugging mode, as minimal as it is,
> is too
> > > > much of a "punishment" for well behaving applications. The cost of a
> memory
> > > > barrier can be extremely high and this is one of the reasons DPDK
> processes
> > > > mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and
> > > > RTE_MBUF_UNLOCK() should thus exist under some compilation option
> disabled
> > > > by default, and thought as an additional debugging tool.
> > > >
> > > > Also the implementation would probably be more expensive than your
> > > > suggestion, in my opinion the reference count must be taken into
> account
> > > > and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs
> > > > work. Freeing mbufs multiple times is perfectly valid in many cases.
> > > >
> > > > How can one lock several mbufs at once by the way?
> > > >
> > > > Since it affects performance, this can only make sense as a
> debugging tool
> > > > developers can use when they encounter some corruption issue they
> cannot
> > > > identify, somewhat like poisoning buffers before they are freed. Not
> sure
> > > > it's worth the trouble.
> >
> > I am agree with Adrian here - it doesn't  look worth it:
> > From one side it adds quite a lot of overhead and complexity to the mbuf
> code.
> > From other side it usage seems pretty limited - there would be a lot of
> cases when it wouldn't be
> > able to detect race condition anyway..
> > So my opinion is NACK.
> > BTW, if your app is intentionally trying to do read/write to the same
> mbufs simultaneously from different threads
> > without some explicit synchronization, then most likely there is
> something wrong with it.
> > Konstantin
>
> This is a one-off debug thing. If you have these kind of issues, you
> really need to rethink
> your threading design. The examples have clear thread model. If you follow
> the documentation
> about process model, and don't try and reinvent your own, then this kind
> of hand holding
> is not necessary.
>
> NAK as well.
>
diff mbox

Patch

==========================================
diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h
b/examples/l3fwd/l3fwd_em_hlm_sse.h
index 7714a20..9db0190 100644
--- a/examples/l3fwd/l3fwd_em_hlm_sse.h
+++ b/examples/l3fwd/l3fwd_em_hlm_sse.h
@@ -299,6 +299,15 @@  l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
**pkts_burst,

        for (j = 0; j < n; j += 8) {

+            RTE_MBUF_LOCK(pkts_burst[j]);
+            RTE_MBUF_LOCK(pkts_burst[j+1]);
+            RTE_MBUF_LOCK(pkts_burst[j+2]);
+            RTE_MBUF_LOCK(pkts_burst[j+3]);
+            RTE_MBUF_LOCK(pkts_burst[j+4]);
+            RTE_MBUF_LOCK(pkts_burst[j+5]);
+            RTE_MBUF_LOCK(pkts_burst[j+6]);
+            RTE_MBUF_LOCK(pkts_burst[j+7]);
+
                uint32_t pkt_type =
                        pkts_burst[j]->packet_type &
                        pkts_burst[j+1]->packet_type &
@@ -333,8 +342,10 @@  l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
**pkts_burst,
                }
        }

-       for (; j < nb_rx; j++)
+      for (; j < nb_rx; j++) {
+              RTE_MBUF_LOCK(pkts_burst[j]);
                dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
+      }

        send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);

diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
index 1afa1f0..2938558 100644
--- a/examples/l3fwd/l3fwd_sse.h
+++ b/examples/l3fwd/l3fwd_sse.h
@@ -322,6 +322,9 @@  send_packetsx4(struct lcore_conf *qconf, uint8_t port,
struct rte_mbuf *m[],

        len = qconf->tx_mbufs[port].len;

+      for (j = 0; j < num; ++j)
+            RTE_MBUF_UNLOCK(m);
+
        /*
         * If TX buffer for that queue is empty, and we have enough packets,