Message ID | 20171010095636.4507-1-hejianet@gmail.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 03B1B1B1BF; Tue, 10 Oct 2017 11:56:49 +0200 (CEST) Received: from mail-pf0-f195.google.com (mail-pf0-f195.google.com [209.85.192.195]) by dpdk.org (Postfix) with ESMTP id 29CDE1B1B9 for <dev@dpdk.org>; Tue, 10 Oct 2017 11:56:48 +0200 (CEST) Received: by mail-pf0-f195.google.com with SMTP id b85so14045039pfj.1 for <dev@dpdk.org>; Tue, 10 Oct 2017 02:56:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=+R6AmqdDNBGeni8gUsxCCwZHJwAqOMCYRmSzHlx8Smw=; b=KoFEAHvdeBFi1jci+OQ+LKFrILevG9+kOJh18zvempD7MuhLxle2l2h8fpw+BRfobO Gc+j3o9u6F8rIjWGzckPbvkrfrVMxRdGKMQZmsj/54PR1VqiY2oKxjY1DajboSrZci9H LQ7yb3e34MWZozOiEC1A0E8gPr/eVR6iDgie5duWNjK1kJod6FSNNM3Fu+e2TKEKC46I mdMUy/m0JLPXaAsEi7CYnO7zSlIB1gS/+xKG2w7q69btUAC3gqaylpRUbvu+RZatMhmy ch11e+pUXcM0Xb2awry+2FkoOI5r+bi+qUKv3Ou7dx7hXpRKpX3GBOKkMl7kVVsvyATg VHpw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=+R6AmqdDNBGeni8gUsxCCwZHJwAqOMCYRmSzHlx8Smw=; b=YqBhZBkEE/wtDKtg1TPUK9bFl+MZQz45jZJpA0pfNDyXR4e+SRaThXbnrYMTZhH9qG TkbksSKR2e/0ZslBeBY2QW+ZVxkUlYzz9xhNFDvTL8oz8hlQRS2+DDBKciP99JOLQrzO SA5aOz66bEubB5Y5zLV/7acmwjMTOmPAwy2eh5RnkecTAyI2rgC+ms1MdWSloYl6cJfE RsMFLi6+bZw8i+I+dZx9YNLgbZwz8Wd0xi6PZgWRvgVAZOfJWur+5lPthEmCKyRMLpGX Ub8JumuKvXNyP4g1iw6TU/TS3Qn2qqN4OYr+FZ6C1YJIZ4qmfWQ2Jec1O35U5DSLT7JS RKzQ== X-Gm-Message-State: AMCzsaWbgNeBcBgTeSCCCsFSZEAoKo+IYjQV7z0rSP2bzPD6BOZNC211 NTgo1KlNgYF4ZPIFHOrbgbE= X-Google-Smtp-Source: AOwi7QBlbZ6x1KFdFqPx8zV8bzxQh2zdLm0GIwimXBG6poOsuhmugDd5Orcs3DyQCh8d7xXN3y7Nxg== X-Received: by 10.98.149.88 with SMTP id p85mr11251097pfd.12.1507629407449; Tue, 10 Oct 2017 02:56:47 -0700 (PDT) Received: from localhost.localdomain ([180.173.249.63]) by smtp.gmail.com with ESMTPSA id w12sm21208843pfk.83.2017.10.10.02.56.45 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 10 Oct 2017 02:56:46 -0700 (PDT) From: Jia He <hejianet@gmail.com> To: olivier.matz@6wind.com Cc: dev@dpdk.org, Jia He <hejianet@gmail.com>, jia.he@hxt-semitech.com, jie2.liu@hxt-semitech.com, bing.zhao@hxt-semitech.com Date: Tue, 10 Oct 2017 17:56:36 +0800 Message-Id: <20171010095636.4507-1-hejianet@gmail.com> X-Mailer: git-send-email 2.13.6 Subject: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <http://dpdk.org/ml/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://dpdk.org/ml/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <http://dpdk.org/ml/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org> |
Checks
Context | Check | Description |
---|---|---|
ci/checkpatch | warning | coding style issues |
ci/Intel-compilation | fail | Compilation issues |
Commit Message
Jia He
Oct. 10, 2017, 9:56 a.m. UTC
Before this patch: In __rte_ring_move_cons_head() ... do { /* Restore n as it may change every loop */ n = max; *old_head = r->cons.head; //1st load const uint32_t prod_tail = r->prod.tail; //2nd load In weak memory order architectures(powerpc,arm), the 2nd load might be reodered before the 1st load, that makes *entries is bigger than we wanted. This nasty reording messed enque/deque up. cpu1(producer) cpu2(consumer) cpu3(consumer) load r->prod.tail in enqueue: load r->cons.tail load r->prod.head store r->prod.tail load r->cons.head load r->prod.tail ... store r->cons.{head,tail} load r->cons.head THEN,r->cons.head will be bigger than prod_tail, then make *entries very big After this patch, the old cons.head will be recaculated after failure of rte_atomic32_cmpset There is no such issue in X86 cpu, because X86 is strong memory order model Signed-off-by: Jia He <hejianet@gmail.com> Signed-off-by: jia.he@hxt-semitech.com Signed-off-by: jie2.liu@hxt-semitech.com Signed-off-by: bing.zhao@hxt-semitech.com --- lib/librte_ring/rte_ring.h | 8 ++++++++ 1 file changed, 8 insertions(+)
Comments
Hi, On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote: > Before this patch: > In __rte_ring_move_cons_head() > ... > do { > /* Restore n as it may change every loop */ > n = max; > > *old_head = r->cons.head; //1st load > const uint32_t prod_tail = r->prod.tail; //2nd load > > In weak memory order architectures(powerpc,arm), the 2nd load might be > reodered before the 1st load, that makes *entries is bigger than we wanted. > This nasty reording messed enque/deque up. > > cpu1(producer) cpu2(consumer) cpu3(consumer) > load r->prod.tail > in enqueue: > load r->cons.tail > load r->prod.head > > store r->prod.tail > > load r->cons.head > load r->prod.tail > ... > store r->cons.{head,tail} > load r->cons.head > > THEN,r->cons.head will be bigger than prod_tail, then make *entries very big > > After this patch, the old cons.head will be recaculated after failure of > rte_atomic32_cmpset > > There is no such issue in X86 cpu, because X86 is strong memory order model > > Signed-off-by: Jia He <hejianet@gmail.com> > Signed-off-by: jia.he@hxt-semitech.com > Signed-off-by: jie2.liu@hxt-semitech.com > Signed-off-by: bing.zhao@hxt-semitech.com > > --- > lib/librte_ring/rte_ring.h | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h > index 5e9b3b7..15c72e2 100644 > --- a/lib/librte_ring/rte_ring.h > +++ b/lib/librte_ring/rte_ring.h > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, > n = max; > > *old_head = r->prod.head; > + > + /* load of prod.tail can't be reordered before cons.head */ > + rte_smp_rmb(); > + > const uint32_t cons_tail = r->cons.tail; > /* > * The subtraction is done between two unsigned 32bits value > @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, > n = max; > > *old_head = r->cons.head; > + > + /* load of prod.tail can't be reordered before cons.head */ > + rte_smp_rmb(); > + > const uint32_t prod_tail = r->prod.tail; > /* The subtraction is done between two unsigned 32bits value > * (the result is always modulo 32 bits even if we have > -- > 2.7.4 > The explanation convinces me. However, since it's in a critical path, it would be good to have other opinions. This patch reminds me this discussion, that was also related to memory barrier, but at another place: http://dpdk.org/ml/archives/dev/2016-July/043765.html Lead to that patch: http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e But finally reverted: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3 Konstatin, Jerin, do you have any comment?
On Thu, 12 Oct 2017 17:53:51 +0200 Olivier MATZ <olivier.matz@6wind.com> wrote: > Hi, > > On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote: > > Before this patch: > > In __rte_ring_move_cons_head() > > ... > > do { > > /* Restore n as it may change every loop */ > > n = max; > > > > *old_head = r->cons.head; //1st load > > const uint32_t prod_tail = r->prod.tail; //2nd load > > > > In weak memory order architectures(powerpc,arm), the 2nd load might be > > reodered before the 1st load, that makes *entries is bigger than we wanted. > > This nasty reording messed enque/deque up. > > > > cpu1(producer) cpu2(consumer) cpu3(consumer) > > load r->prod.tail > > in enqueue: > > load r->cons.tail > > load r->prod.head > > > > store r->prod.tail > > > > load r->cons.head > > load r->prod.tail > > ... > > store r->cons.{head,tail} > > load r->cons.head > > > > THEN,r->cons.head will be bigger than prod_tail, then make *entries very big > > > > After this patch, the old cons.head will be recaculated after failure of > > rte_atomic32_cmpset > > > > There is no such issue in X86 cpu, because X86 is strong memory order model > > > > Signed-off-by: Jia He <hejianet@gmail.com> > > Signed-off-by: jia.he@hxt-semitech.com > > Signed-off-by: jie2.liu@hxt-semitech.com > > Signed-off-by: bing.zhao@hxt-semitech.com > > > > --- > > lib/librte_ring/rte_ring.h | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h > > index 5e9b3b7..15c72e2 100644 > > --- a/lib/librte_ring/rte_ring.h > > +++ b/lib/librte_ring/rte_ring.h > > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, > > n = max; > > > > *old_head = r->prod.head; > > + > > + /* load of prod.tail can't be reordered before cons.head */ > > + rte_smp_rmb(); > > + > > const uint32_t cons_tail = r->cons.tail; > > /* > > * The subtraction is done between two unsigned 32bits value > > @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, > > n = max; > > > > *old_head = r->cons.head; > > + > > + /* load of prod.tail can't be reordered before cons.head */ > > + rte_smp_rmb(); > > + > > const uint32_t prod_tail = r->prod.tail; > > /* The subtraction is done between two unsigned 32bits value > > * (the result is always modulo 32 bits even if we have > > -- > > 2.7.4 > > > > The explanation convinces me. > > However, since it's in a critical path, it would be good to have other > opinions. This patch reminds me this discussion, that was also related to > memory barrier, but at another place: > http://dpdk.org/ml/archives/dev/2016-July/043765.html > Lead to that patch: http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e > But finally reverted: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3 > > Konstatin, Jerin, do you have any comment? On strongly ordered architectures like x86 rte_smp_rmb() turns into compiler_barrier(). Compiler barriers generate no instructions, they just tell the compiler to not do data flow optimization across. Since the pointers are mostly volatile it shouldn't even change the generated code.
Hi guys, > -----Original Message----- > From: Olivier MATZ [mailto:olivier.matz@6wind.com] > Sent: Thursday, October 12, 2017 4:54 PM > To: Jia He <hejianet@gmail.com> > Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; bing.zhao@hxt-semitech.com; Ananyev, Konstantin > <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com > Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue > > Hi, > > On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote: > > Before this patch: > > In __rte_ring_move_cons_head() > > ... > > do { > > /* Restore n as it may change every loop */ > > n = max; > > > > *old_head = r->cons.head; //1st load > > const uint32_t prod_tail = r->prod.tail; //2nd load > > > > In weak memory order architectures(powerpc,arm), the 2nd load might be > > reodered before the 1st load, that makes *entries is bigger than we wanted. > > This nasty reording messed enque/deque up. > > > > cpu1(producer) cpu2(consumer) cpu3(consumer) > > load r->prod.tail > > in enqueue: > > load r->cons.tail > > load r->prod.head > > > > store r->prod.tail > > > > load r->cons.head > > load r->prod.tail > > ... > > store r->cons.{head,tail} > > load r->cons.head > > > > THEN,r->cons.head will be bigger than prod_tail, then make *entries very big > > > > After this patch, the old cons.head will be recaculated after failure of > > rte_atomic32_cmpset > > > > There is no such issue in X86 cpu, because X86 is strong memory order model > > > > Signed-off-by: Jia He <hejianet@gmail.com> > > Signed-off-by: jia.he@hxt-semitech.com > > Signed-off-by: jie2.liu@hxt-semitech.com > > Signed-off-by: bing.zhao@hxt-semitech.com > > > > --- > > lib/librte_ring/rte_ring.h | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h > > index 5e9b3b7..15c72e2 100644 > > --- a/lib/librte_ring/rte_ring.h > > +++ b/lib/librte_ring/rte_ring.h > > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, > > n = max; > > > > *old_head = r->prod.head; > > + > > + /* load of prod.tail can't be reordered before cons.head */ > > + rte_smp_rmb(); > > + > > const uint32_t cons_tail = r->cons.tail; > > /* > > * The subtraction is done between two unsigned 32bits value > > @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, > > n = max; > > > > *old_head = r->cons.head; > > + > > + /* load of prod.tail can't be reordered before cons.head */ > > + rte_smp_rmb(); > > + > > const uint32_t prod_tail = r->prod.tail; > > /* The subtraction is done between two unsigned 32bits value > > * (the result is always modulo 32 bits even if we have > > -- > > 2.7.4 > > > > The explanation convinces me. > > However, since it's in a critical path, it would be good to have other > opinions. This patch reminds me this discussion, that was also related to > memory barrier, but at another place: > http://dpdk.org/ml/archives/dev/2016-July/043765.html > Lead to that patch: http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e > But finally reverted: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3 > > Konstatin, Jerin, do you have any comment? For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make any difference, but I can't see how read reordering would screw things up here... Probably just me and arm or ppc guys could explain what will be the problem if let say cons.tail will be read before prod.head in __rte_ring_move_prod_head(). I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)? Probably new test-case for rte_ring autotest is needed, or is it possible to reproduce it with existing one? Konstantin
-----Original Message----- > Date: Thu, 12 Oct 2017 17:05:50 +0000 > From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com> > To: Olivier MATZ <olivier.matz@6wind.com>, Jia He <hejianet@gmail.com> > CC: "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" > <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" > <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" > <bing.zhao@hxt-semitech.com>, "jerin.jacob@caviumnetworks.com" > <jerin.jacob@caviumnetworks.com> > Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when > doing enqueue/dequeue > > Hi guys, > > > -----Original Message----- > > From: Olivier MATZ [mailto:olivier.matz@6wind.com] > > Sent: Thursday, October 12, 2017 4:54 PM > > To: Jia He <hejianet@gmail.com> > > Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; bing.zhao@hxt-semitech.com; Ananyev, Konstantin > > <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com > > Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue > > > > Hi, > > > > On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote: > > > Before this patch: > > > In __rte_ring_move_cons_head() > > > ... > > > do { > > > /* Restore n as it may change every loop */ > > > n = max; > > > > > > *old_head = r->cons.head; //1st load > > > const uint32_t prod_tail = r->prod.tail; //2nd load > > > > > > In weak memory order architectures(powerpc,arm), the 2nd load might be > > > reodered before the 1st load, that makes *entries is bigger than we wanted. > > > This nasty reording messed enque/deque up. > > > > > > cpu1(producer) cpu2(consumer) cpu3(consumer) > > > load r->prod.tail > > > in enqueue: > > > load r->cons.tail > > > load r->prod.head > > > > > > store r->prod.tail > > > > > > load r->cons.head > > > load r->prod.tail > > > ... > > > store r->cons.{head,tail} > > > load r->cons.head > > > > > > THEN,r->cons.head will be bigger than prod_tail, then make *entries very big > > > > > > After this patch, the old cons.head will be recaculated after failure of > > > rte_atomic32_cmpset > > > > > > There is no such issue in X86 cpu, because X86 is strong memory order model > > > > > > Signed-off-by: Jia He <hejianet@gmail.com> > > > Signed-off-by: jia.he@hxt-semitech.com > > > Signed-off-by: jie2.liu@hxt-semitech.com > > > Signed-off-by: bing.zhao@hxt-semitech.com > > > > > > --- > > > lib/librte_ring/rte_ring.h | 8 ++++++++ > > > 1 file changed, 8 insertions(+) > > > > > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h > > > index 5e9b3b7..15c72e2 100644 > > > --- a/lib/librte_ring/rte_ring.h > > > +++ b/lib/librte_ring/rte_ring.h > > > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, > > > n = max; > > > > > > *old_head = r->prod.head; > > > + > > > + /* load of prod.tail can't be reordered before cons.head */ > > > + rte_smp_rmb(); > > > + > > > const uint32_t cons_tail = r->cons.tail; > > > /* > > > * The subtraction is done between two unsigned 32bits value > > > @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, > > > n = max; > > > > > > *old_head = r->cons.head; > > > + > > > + /* load of prod.tail can't be reordered before cons.head */ > > > + rte_smp_rmb(); > > > + > > > const uint32_t prod_tail = r->prod.tail; > > > /* The subtraction is done between two unsigned 32bits value > > > * (the result is always modulo 32 bits even if we have > > > -- > > > 2.7.4 > > > > > > > The explanation convinces me. > > > > However, since it's in a critical path, it would be good to have other > > opinions. This patch reminds me this discussion, that was also related to > > memory barrier, but at another place: > > http://dpdk.org/ml/archives/dev/2016-July/043765.html > > Lead to that patch: http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e > > But finally reverted: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3 > > > > Konstatin, Jerin, do you have any comment? > > For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make any difference, > but I can't see how read reordering would screw things up here... > Probably just me and arm or ppc guys could explain what will be the problem > if let say cons.tail will be read before prod.head in __rte_ring_move_prod_head(). > I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)? > Probably new test-case for rte_ring autotest is needed, or is it possible to reproduce > it with existing one? On the same lines, Jia He, jie2.liu, bing.zhao, Is this patch based on code review or do you saw this issue on any of the arm/ppc target? arm64 will have performance impact with this change. > Konstantin
Hi guys, We found this issue when we run mbuf_autotest. It failed on a aarch64 platform. I am not sure if it can be reproduced on other platforms. Regards, Jie Liu -----Original Message----- From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com] Sent: 2017年10月13日 1:06 To: Olivier MATZ <olivier.matz@6wind.com>; Jia He <hejianet@gmail.com> Cc: dev@dpdk.org; He, Jia <jia.he@hxt-semitech.com>; Liu, Jie2 <jie2.liu@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt-semitech.com>; jerin.jacob@caviumnetworks.com Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue Hi guys, > -----Original Message----- > From: Olivier MATZ [mailto:olivier.matz@6wind.com] > Sent: Thursday, October 12, 2017 4:54 PM > To: Jia He <hejianet@gmail.com> > Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; > bing.zhao@hxt-semitech.com; Ananyev, Konstantin > <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com > Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading > when doing enqueue/dequeue > > Hi, > > On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote: > > Before this patch: > > In __rte_ring_move_cons_head() > > ... > > do { > > /* Restore n as it may change every loop */ > > n = max; > > > > *old_head = r->cons.head; //1st load > > const uint32_t prod_tail = r->prod.tail; //2nd load > > > > In weak memory order architectures(powerpc,arm), the 2nd load might > > be reodered before the 1st load, that makes *entries is bigger than we wanted. > > This nasty reording messed enque/deque up. > > > > cpu1(producer) cpu2(consumer) cpu3(consumer) > > load r->prod.tail in enqueue: > > load r->cons.tail > > load r->prod.head > > > > store r->prod.tail > > > > load r->cons.head > > load r->prod.tail > > ... > > store r->cons.{head,tail} > > load r->cons.head > > > > THEN,r->cons.head will be bigger than prod_tail, then make *entries > > very big > > > > After this patch, the old cons.head will be recaculated after > > failure of rte_atomic32_cmpset > > > > There is no such issue in X86 cpu, because X86 is strong memory > > order model > > > > Signed-off-by: Jia He <hejianet@gmail.com> > > Signed-off-by: jia.he@hxt-semitech.com > > Signed-off-by: jie2.liu@hxt-semitech.com > > Signed-off-by: bing.zhao@hxt-semitech.com > > > > --- > > lib/librte_ring/rte_ring.h | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h > > index 5e9b3b7..15c72e2 100644 > > --- a/lib/librte_ring/rte_ring.h > > +++ b/lib/librte_ring/rte_ring.h > > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, > > n = max; > > > > *old_head = r->prod.head; > > + > > +/* load of prod.tail can't be reordered before cons.head */ > > +rte_smp_rmb(); > > + > > const uint32_t cons_tail = r->cons.tail; > > /* > > * The subtraction is done between two unsigned 32bits value @@ > > -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, > > n = max; > > > > *old_head = r->cons.head; > > + > > +/* load of prod.tail can't be reordered before cons.head */ > > +rte_smp_rmb(); > > + > > const uint32_t prod_tail = r->prod.tail; > > /* The subtraction is done between two unsigned 32bits value > > * (the result is always modulo 32 bits even if we have > > -- > > 2.7.4 > > > > The explanation convinces me. > > However, since it's in a critical path, it would be good to have other > opinions. This patch reminds me this discussion, that was also related > to memory barrier, but at another place: > http://dpdk.org/ml/archives/dev/2016-July/043765.html > Lead to that patch: > http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e > But finally reverted: > http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3 > > Konstatin, Jerin, do you have any comment? For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make any difference, but I can't see how read reordering would screw things up here... Probably just me and arm or ppc guys could explain what will be the problem if let say cons.tail will be read before prod.head in __rte_ring_move_prod_head(). I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)? Probably new test-case for rte_ring autotest is needed, or is it possible to reproduce it with existing one? Konstantin This email is intended only for the named addressee. It may contain information that is confidential/private, legally privileged, or copyright-protected, and you should handle it accordingly. If you are not the intended recipient, you do not have legal rights to retain, copy, or distribute this email or its contents, and should promptly delete the email and all electronic copies in your system; do not retain copies in any media. If you have received this email in error, please notify the sender promptly. Thank you.
Hi Jerin On 10/13/2017 1:23 AM, Jerin Jacob Wrote: > -----Original Message----- >> Date: Thu, 12 Oct 2017 17:05:50 +0000 >> From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com> >> To: Olivier MATZ <olivier.matz@6wind.com>, Jia He <hejianet@gmail.com> >> CC: "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" >> <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" >> <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" >> <bing.zhao@hxt-semitech.com>, "jerin.jacob@caviumnetworks.com" >> <jerin.jacob@caviumnetworks.com> >> Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when >> doing enqueue/dequeue >> >> Hi guys, >> >>> -----Original Message----- >>> From: Olivier MATZ [mailto:olivier.matz@6wind.com] >>> Sent: Thursday, October 12, 2017 4:54 PM >>> To: Jia He <hejianet@gmail.com> >>> Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; bing.zhao@hxt-semitech.com; Ananyev, Konstantin >>> <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com >>> Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue >>> >>> Hi, >>> >>> On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote: >>>> Before this patch: >>>> In __rte_ring_move_cons_head() >>>> ... >>>> do { >>>> /* Restore n as it may change every loop */ >>>> n = max; >>>> >>>> *old_head = r->cons.head; //1st load >>>> const uint32_t prod_tail = r->prod.tail; //2nd load >>>> >>>> In weak memory order architectures(powerpc,arm), the 2nd load might be >>>> reodered before the 1st load, that makes *entries is bigger than we wanted. >>>> This nasty reording messed enque/deque up. >>>> >>>> cpu1(producer) cpu2(consumer) cpu3(consumer) >>>> load r->prod.tail >>>> in enqueue: >>>> load r->cons.tail >>>> load r->prod.head >>>> >>>> store r->prod.tail >>>> >>>> load r->cons.head >>>> load r->prod.tail >>>> ... >>>> store r->cons.{head,tail} >>>> load r->cons.head >>>> >>>> THEN,r->cons.head will be bigger than prod_tail, then make *entries very big >>>> >>>> After this patch, the old cons.head will be recaculated after failure of >>>> rte_atomic32_cmpset >>>> >>>> There is no such issue in X86 cpu, because X86 is strong memory order model >>>> >>>> Signed-off-by: Jia He <hejianet@gmail.com> >>>> Signed-off-by: jia.he@hxt-semitech.com >>>> Signed-off-by: jie2.liu@hxt-semitech.com >>>> Signed-off-by: bing.zhao@hxt-semitech.com >>>> >>>> --- >>>> lib/librte_ring/rte_ring.h | 8 ++++++++ >>>> 1 file changed, 8 insertions(+) >>>> >>>> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h >>>> index 5e9b3b7..15c72e2 100644 >>>> --- a/lib/librte_ring/rte_ring.h >>>> +++ b/lib/librte_ring/rte_ring.h >>>> @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, >>>> n = max; >>>> >>>> *old_head = r->prod.head; >>>> + >>>> + /* load of prod.tail can't be reordered before cons.head */ >>>> + rte_smp_rmb(); >>>> + >>>> const uint32_t cons_tail = r->cons.tail; >>>> /* >>>> * The subtraction is done between two unsigned 32bits value >>>> @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, >>>> n = max; >>>> >>>> *old_head = r->cons.head; >>>> + >>>> + /* load of prod.tail can't be reordered before cons.head */ >>>> + rte_smp_rmb(); >>>> + >>>> const uint32_t prod_tail = r->prod.tail; >>>> /* The subtraction is done between two unsigned 32bits value >>>> * (the result is always modulo 32 bits even if we have >>>> -- >>>> 2.7.4 >>>> >>> The explanation convinces me. >>> >>> However, since it's in a critical path, it would be good to have other >>> opinions. This patch reminds me this discussion, that was also related to >>> memory barrier, but at another place: >>> http://dpdk.org/ml/archives/dev/2016-July/043765.html >>> Lead to that patch: http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e >>> But finally reverted: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3 >>> >>> Konstatin, Jerin, do you have any comment? >> For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make any difference, >> but I can't see how read reordering would screw things up here... >> Probably just me and arm or ppc guys could explain what will be the problem >> if let say cons.tail will be read before prod.head in __rte_ring_move_prod_head(). >> I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)? >> Probably new test-case for rte_ring autotest is needed, or is it possible to reproduce >> it with existing one? > On the same lines, > > Jia He, jie2.liu, bing.zhao, > > Is this patch based on code review or do you saw this issue on any of the > arm/ppc target? arm64 will have performance impact with this change. Based on mbuf_autotest, the rte_panic will be invoked in seconds. PANIC in test_refcnt_iter(): (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free 1: [./test(rte_dump_stack+0x38) [0x58d868]] Aborted (core dumped) Cheers, Jia > > >> Konstantin
On 10/13/2017 9:02 AM, Jia He Wrote: > Hi Jerin > > > On 10/13/2017 1:23 AM, Jerin Jacob Wrote: >> -----Original Message----- >>> Date: Thu, 12 Oct 2017 17:05:50 +0000 >>> [...] >> On the same lines, >> >> Jia He, jie2.liu, bing.zhao, >> >> Is this patch based on code review or do you saw this issue on any of >> the >> arm/ppc target? arm64 will have performance impact with this change. sorry, miss one important information Our platform is an aarch64 server with 46 cpus. If we reduced the involved cpu numbers, the bug occurred less frequently. Yes, mb barrier impact the performance, but correctness is more important, isn't it ;-) Maybe we can find any other lightweight barrier here? Cheers, Jia > Based on mbuf_autotest, the rte_panic will be invoked in seconds. > > PANIC in test_refcnt_iter(): > (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free > 1: [./test(rte_dump_stack+0x38) [0x58d868]] > Aborted (core dumped) > > Cheers, > Jia >> >> >>> Konstantin >
Hi On 10/13/2017 9:02 AM, Jia He Wrote: > Hi Jerin > > > On 10/13/2017 1:23 AM, Jerin Jacob Wrote: >> -----Original Message----- >>> Date: Thu, 12 Oct 2017 17:05:50 +0000 >>> [...] >> On the same lines, >> >> Jia He, jie2.liu, bing.zhao, >> >> Is this patch based on code review or do you saw this issue on any of >> the >> arm/ppc target? arm64 will have performance impact with this change. sorry, miss one important information Our platform is an aarch64 server with 46 cpus. If we reduced the involved cpu numbers, the bug occurred less frequently. Yes, mb barrier impact the performance, but correctness is more important, isn't it ;-) Maybe we can find any other lightweight barrier here? Cheers, Jia > Based on mbuf_autotest, the rte_panic will be invoked in seconds. > > PANIC in test_refcnt_iter(): > (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free > 1: [./test(rte_dump_stack+0x38) [0x58d868]] > Aborted (core dumped) > > Cheers, > Jia >> >> >>> Konstantin >
-----Original Message----- > Date: Fri, 13 Oct 2017 09:16:31 +0800 > From: Jia He <hejianet@gmail.com> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>, "Ananyev, Konstantin" > <konstantin.ananyev@intel.com> > Cc: Olivier MATZ <olivier.matz@6wind.com>, "dev@dpdk.org" <dev@dpdk.org>, > "jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>, > "jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>, > "bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com> > Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when > doing enqueue/dequeue > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > Thunderbird/52.3.0 > > Hi > > > On 10/13/2017 9:02 AM, Jia He Wrote: > > Hi Jerin > > > > > > On 10/13/2017 1:23 AM, Jerin Jacob Wrote: > > > -----Original Message----- > > > > Date: Thu, 12 Oct 2017 17:05:50 +0000 > > > > > [...] > > > On the same lines, > > > > > > Jia He, jie2.liu, bing.zhao, > > > > > > Is this patch based on code review or do you saw this issue on any > > > of the > > > arm/ppc target? arm64 will have performance impact with this change. > sorry, miss one important information > Our platform is an aarch64 server with 46 cpus. Is this an OOO(Out of order execution) aarch64 CPU implementation? > If we reduced the involved cpu numbers, the bug occurred less frequently. > > Yes, mb barrier impact the performance, but correctness is more important, > isn't it ;-) Yes. > Maybe we can find any other lightweight barrier here? Yes, Regarding the lightweight barrier, arm64 has native support for acquire and release semantics, which is exposed through gcc as architecture agnostic functions. https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html http://preshing.com/20130922/acquire-and-release-fences/ Good to know, 1) How much overhead this patch in your platform? Just relative numbers are enough 2) As a prototype, Is Changing to acquire and release schematics reduces the overhead in your platform? Reference FreeBSD ring/DPDK style ring implementation through acquire and release schematics https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c I will also spend on cycles on this. > > Cheers, > Jia > > Based on mbuf_autotest, the rte_panic will be invoked in seconds. > > > > PANIC in test_refcnt_iter(): > > (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free > > 1: [./test(rte_dump_stack+0x38) [0x58d868]] > > Aborted (core dumped) > > > > Cheers, > > Jia > > > > > > > > > > Konstantin > > >
Hi Guys, If needed, we can provide a simplified UT test case and ring patch to reproduce this. And then we can catch the information to prove that in a SP+MC scenario, when the tail and head pointers' number are smaller than the ring size, some lcore thread will have a fault judge and then make the CH over the PT. Thanks BR. Bing -----Original Message----- From: Liu, Jie2 Sent: 2017年10月13日 8:25 To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Olivier MATZ <olivier.matz@6wind.com>; dev@dpdk.org; jerin.jacob@caviumnetworks.com Cc: He, Jia <jia.he@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt-semitech.com>; Jia He <hejianet@gmail.com> Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue Hi guys, We found this issue when we run mbuf_autotest. It failed on a aarch64 platform. I am not sure if it can be reproduced on other platforms. Regards, Jie Liu -----Original Message----- From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com] Sent: 2017年10月13日 1:06 To: Olivier MATZ <olivier.matz@6wind.com>; Jia He <hejianet@gmail.com> Cc: dev@dpdk.org; He, Jia <jia.he@hxt-semitech.com>; Liu, Jie2 <jie2.liu@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt-semitech.com>; jerin.jacob@caviumnetworks.com Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue Hi guys, > -----Original Message----- > From: Olivier MATZ [mailto:olivier.matz@6wind.com] > Sent: Thursday, October 12, 2017 4:54 PM > To: Jia He <hejianet@gmail.com> > Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; > bing.zhao@hxt-semitech.com; Ananyev, Konstantin > <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com > Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading > when doing enqueue/dequeue > > Hi, > > On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote: > > Before this patch: > > In __rte_ring_move_cons_head() > > ... > > do { > > /* Restore n as it may change every loop */ > > n = max; > > > > *old_head = r->cons.head; //1st load > > const uint32_t prod_tail = r->prod.tail; //2nd load > > > > In weak memory order architectures(powerpc,arm), the 2nd load might > > be reodered before the 1st load, that makes *entries is bigger than we wanted. > > This nasty reording messed enque/deque up. > > > > cpu1(producer) cpu2(consumer) cpu3(consumer) > > load r->prod.tail in enqueue: > > load r->cons.tail > > load r->prod.head > > > > store r->prod.tail > > > > load r->cons.head > > load r->prod.tail > > ... > > store r->cons.{head,tail} > > load r->cons.head > > > > THEN,r->cons.head will be bigger than prod_tail, then make *entries > > very big > > > > After this patch, the old cons.head will be recaculated after > > failure of rte_atomic32_cmpset > > > > There is no such issue in X86 cpu, because X86 is strong memory > > order model > > > > Signed-off-by: Jia He <hejianet@gmail.com> > > Signed-off-by: jia.he@hxt-semitech.com > > Signed-off-by: jie2.liu@hxt-semitech.com > > Signed-off-by: bing.zhao@hxt-semitech.com > > > > --- > > lib/librte_ring/rte_ring.h | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h > > index 5e9b3b7..15c72e2 100644 > > --- a/lib/librte_ring/rte_ring.h > > +++ b/lib/librte_ring/rte_ring.h > > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, > > n = max; > > > > *old_head = r->prod.head; > > + > > +/* load of prod.tail can't be reordered before cons.head */ > > +rte_smp_rmb(); > > + > > const uint32_t cons_tail = r->cons.tail; > > /* > > * The subtraction is done between two unsigned 32bits value @@ > > -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, > > n = max; > > > > *old_head = r->cons.head; > > + > > +/* load of prod.tail can't be reordered before cons.head */ > > +rte_smp_rmb(); > > + > > const uint32_t prod_tail = r->prod.tail; > > /* The subtraction is done between two unsigned 32bits value > > * (the result is always modulo 32 bits even if we have > > -- > > 2.7.4 > > > > The explanation convinces me. > > However, since it's in a critical path, it would be good to have other > opinions. This patch reminds me this discussion, that was also related > to memory barrier, but at another place: > http://dpdk.org/ml/archives/dev/2016-July/043765.html > Lead to that patch: > http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e > But finally reverted: > http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3 > > Konstatin, Jerin, do you have any comment? For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make any difference, but I can't see how read reordering would screw things up here... Probably just me and arm or ppc guys could explain what will be the problem if let say cons.tail will be read before prod.head in __rte_ring_move_prod_head(). I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)? Probably new test-case for rte_ring autotest is needed, or is it possible to reproduce it with existing one? Konstantin This email is intended only for the named addressee. It may contain information that is confidential/private, legally privileged, or copyright-protected, and you should handle it accordingly. If you are not the intended recipient, you do not have legal rights to retain, copy, or distribute this email or its contents, and should promptly delete the email and all electronic copies in your system; do not retain copies in any media. If you have received this email in error, please notify the sender promptly. Thank you.
-----Original Message----- > Date: Fri, 13 Oct 2017 02:12:58 +0000 > From: "Zhao, Bing" <bing.zhao@hxt-semitech.com> > To: "Liu, Jie2" <jie2.liu@hxt-semitech.com>, "Ananyev, Konstantin" > <konstantin.ananyev@intel.com>, Olivier MATZ <olivier.matz@6wind.com>, > "dev@dpdk.org" <dev@dpdk.org>, "jerin.jacob@caviumnetworks.com" > <jerin.jacob@caviumnetworks.com> > CC: "He, Jia" <jia.he@hxt-semitech.com>, Jia He <hejianet@gmail.com> > Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when > doing enqueue/dequeue > > Hi Guys, Hi Bing, > If needed, we can provide a simplified UT test case and ring patch to reproduce this. And then we can catch the information to prove that in a SP+MC scenario, when the tail and head pointers' number are smaller than the ring size, some lcore thread will have a fault judge and then make the CH over the PT. Yes. Please provide simplified UT test case and ring patch. > > Thanks > > BR. Bing > > -----Original Message----- > From: Liu, Jie2 > Sent: 2017年10月13日 8:25 > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Olivier MATZ <olivier.matz@6wind.com>; dev@dpdk.org; jerin.jacob@caviumnetworks.com > Cc: He, Jia <jia.he@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt-semitech.com>; Jia He <hejianet@gmail.com> > Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue > > Hi guys, > We found this issue when we run mbuf_autotest. It failed on a aarch64 platform. I am not sure if it can be reproduced on other platforms. > Regards, > Jie Liu > > -----Original Message----- > From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com] > Sent: 2017年10月13日 1:06 > To: Olivier MATZ <olivier.matz@6wind.com>; Jia He <hejianet@gmail.com> > Cc: dev@dpdk.org; He, Jia <jia.he@hxt-semitech.com>; Liu, Jie2 <jie2.liu@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt-semitech.com>; jerin.jacob@caviumnetworks.com > Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue > > Hi guys, > > > -----Original Message----- > > From: Olivier MATZ [mailto:olivier.matz@6wind.com] > > Sent: Thursday, October 12, 2017 4:54 PM > > To: Jia He <hejianet@gmail.com> > > Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; > > bing.zhao@hxt-semitech.com; Ananyev, Konstantin > > <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com > > Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading > > when doing enqueue/dequeue > > > > Hi, > > > > On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote: > > > Before this patch: > > > In __rte_ring_move_cons_head() > > > ... > > > do { > > > /* Restore n as it may change every loop */ > > > n = max; > > > > > > *old_head = r->cons.head; //1st load > > > const uint32_t prod_tail = r->prod.tail; //2nd load > > > > > > In weak memory order architectures(powerpc,arm), the 2nd load might > > > be reodered before the 1st load, that makes *entries is bigger than we wanted. > > > This nasty reording messed enque/deque up. > > > > > > cpu1(producer) cpu2(consumer) cpu3(consumer) > > > load r->prod.tail in enqueue: > > > load r->cons.tail > > > load r->prod.head > > > > > > store r->prod.tail > > > > > > load r->cons.head > > > load r->prod.tail > > > ... > > > store r->cons.{head,tail} > > > load r->cons.head > > > > > > THEN,r->cons.head will be bigger than prod_tail, then make *entries > > > very big > > > > > > After this patch, the old cons.head will be recaculated after > > > failure of rte_atomic32_cmpset > > > > > > There is no such issue in X86 cpu, because X86 is strong memory > > > order model > > > > > > Signed-off-by: Jia He <hejianet@gmail.com> > > > Signed-off-by: jia.he@hxt-semitech.com > > > Signed-off-by: jie2.liu@hxt-semitech.com > > > Signed-off-by: bing.zhao@hxt-semitech.com > > > > > > --- > > > lib/librte_ring/rte_ring.h | 8 ++++++++ > > > 1 file changed, 8 insertions(+) > > > > > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h > > > index 5e9b3b7..15c72e2 100644 > > > --- a/lib/librte_ring/rte_ring.h > > > +++ b/lib/librte_ring/rte_ring.h > > > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, > > > n = max; > > > > > > *old_head = r->prod.head; > > > + > > > +/* load of prod.tail can't be reordered before cons.head */ > > > +rte_smp_rmb(); > > > + > > > const uint32_t cons_tail = r->cons.tail; > > > /* > > > * The subtraction is done between two unsigned 32bits value @@ > > > -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, > > > n = max; > > > > > > *old_head = r->cons.head; > > > + > > > +/* load of prod.tail can't be reordered before cons.head */ > > > +rte_smp_rmb(); > > > + > > > const uint32_t prod_tail = r->prod.tail; > > > /* The subtraction is done between two unsigned 32bits value > > > * (the result is always modulo 32 bits even if we have > > > -- > > > 2.7.4 > > > > > > > The explanation convinces me. > > > > However, since it's in a critical path, it would be good to have other > > opinions. This patch reminds me this discussion, that was also related > > to memory barrier, but at another place: > > http://dpdk.org/ml/archives/dev/2016-July/043765.html > > Lead to that patch: > > http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e > > But finally reverted: > > http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3 > > > > Konstatin, Jerin, do you have any comment? > > For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make any difference, but I can't see how read reordering would screw things up here... > Probably just me and arm or ppc guys could explain what will be the problem if let say cons.tail will be read before prod.head in __rte_ring_move_prod_head(). > I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)? > Probably new test-case for rte_ring autotest is needed, or is it possible to reproduce it with existing one? > Konstantin > > > > This email is intended only for the named addressee. It may contain information that is confidential/private, legally privileged, or copyright-protected, and you should handle it accordingly. If you are not the intended recipient, you do not have legal rights to retain, copy, or distribute this email or its contents, and should promptly delete the email and all electronic copies in your system; do not retain copies in any media. If you have received this email in error, please notify the sender promptly. Thank you. > >
The 10/13/2017 07:19, Jerin Jacob wrote: > -----Original Message----- > > Date: Fri, 13 Oct 2017 09:16:31 +0800 > > From: Jia He <hejianet@gmail.com> > > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>, "Ananyev, Konstantin" > > <konstantin.ananyev@intel.com> > > Cc: Olivier MATZ <olivier.matz@6wind.com>, "dev@dpdk.org" <dev@dpdk.org>, > > "jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>, > > "jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>, > > "bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com> > > Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when > > doing enqueue/dequeue > > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > > Thunderbird/52.3.0 > > > > Hi > > > > > > On 10/13/2017 9:02 AM, Jia He Wrote: > > > Hi Jerin > > > > > > > > > On 10/13/2017 1:23 AM, Jerin Jacob Wrote: > > > > -----Original Message----- > > > > > Date: Thu, 12 Oct 2017 17:05:50 +0000 > > > > > > > [...] > > > > On the same lines, > > > > > > > > Jia He, jie2.liu, bing.zhao, > > > > > > > > Is this patch based on code review or do you saw this issue on any > > > > of the > > > > arm/ppc target? arm64 will have performance impact with this change. > > sorry, miss one important information > > Our platform is an aarch64 server with 46 cpus. > > Is this an OOO(Out of order execution) aarch64 CPU implementation? > > > If we reduced the involved cpu numbers, the bug occurred less frequently. > > > > Yes, mb barrier impact the performance, but correctness is more important, > > isn't it ;-) > > Yes. > > > Maybe we can find any other lightweight barrier here? > > Yes, Regarding the lightweight barrier, arm64 has native support for acquire and release > semantics, which is exposed through gcc as architecture agnostic > functions. > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html > http://preshing.com/20130922/acquire-and-release-fences/ > > Good to know, > 1) How much overhead this patch in your platform? Just relative > numbers are enough > 2) As a prototype, Is Changing to acquire and release schematics > reduces the overhead in your platform? > +1, can you try what ODP does in the link mentioned below? > Reference FreeBSD ring/DPDK style ring implementation through acquire > and release schematics > https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c > > I will also spend on cycles on this. > > > > > > Cheers, > > Jia > > > Based on mbuf_autotest, the rte_panic will be invoked in seconds. > > > > > > PANIC in test_refcnt_iter(): > > > (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free > > > 1: [./test(rte_dump_stack+0x38) [0x58d868]] > > > Aborted (core dumped) > > > > > > Cheers, > > > Jia > > > > > > > > > > > > > Konstantin > > > > > -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi On 10/13/2017 3:33 PM, Jianbo Liu Wrote: > The 10/13/2017 07:19, Jerin Jacob wrote: >> -----Original Message----- >>> Date: Fri, 13 Oct 2017 09:16:31 +0800 >>> From: Jia He <hejianet@gmail.com> >>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>, "Ananyev, Konstantin" >>> <konstantin.ananyev@intel.com> >>> Cc: Olivier MATZ <olivier.matz@6wind.com>, "dev@dpdk.org" <dev@dpdk.org>, >>> "jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>, >>> "jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>, >>> "bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com> >>> Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when >>> doing enqueue/dequeue >>> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 >>> Thunderbird/52.3.0 >>> >>> Hi >>> >>> >>> On 10/13/2017 9:02 AM, Jia He Wrote: >>>> Hi Jerin >>>> >>>> >>>> On 10/13/2017 1:23 AM, Jerin Jacob Wrote: >>>>> -----Original Message----- >>>>>> Date: Thu, 12 Oct 2017 17:05:50 +0000 >>>>>> >>> [...] >>>>> On the same lines, >>>>> >>>>> Jia He, jie2.liu, bing.zhao, >>>>> >>>>> Is this patch based on code review or do you saw this issue on any >>>>> of the >>>>> arm/ppc target? arm64 will have performance impact with this change. >>> sorry, miss one important information >>> Our platform is an aarch64 server with 46 cpus. >> Is this an OOO(Out of order execution) aarch64 CPU implementation? >> >>> If we reduced the involved cpu numbers, the bug occurred less frequently. >>> >>> Yes, mb barrier impact the performance, but correctness is more important, >>> isn't it ;-) >> Yes. >> >>> Maybe we can find any other lightweight barrier here? >> Yes, Regarding the lightweight barrier, arm64 has native support for acquire and release >> semantics, which is exposed through gcc as architecture agnostic >> functions. >> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html >> http://preshing.com/20130922/acquire-and-release-fences/ >> >> Good to know, >> 1) How much overhead this patch in your platform? Just relative >> numbers are enough >> 2) As a prototype, Is Changing to acquire and release schematics >> reduces the overhead in your platform? >> > +1, can you try what ODP does in the link mentioned below? Sure, pls see the result: root@server:~/odp/test/linux-generic/ring# ./ring_main HW time counter freq: 20000000 hz _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory _ishm.c:880:_odp_ishm_reserve():No huge pages, fall back to normal pages. check: /proc/sys/vm/nr_hugepages. _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory PKTIO: initialized loop interface. PKTIO: initialized ipc interface. PKTIO: initialized socket mmap, use export ODP_PKTIO_DISABLE_SOCKET_MMAP=1 to disable. PKTIO: initialized socket mmsg,use export ODP_PKTIO_DISABLE_SOCKET_MMSG=1 to disable. _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory ODP API version: 1.15.0 ODP implementation name: "odp-linux" ODP implementation version: "odp-linux" 1.15.0-0 (v1.15.0) 1.15.0.0 CUnit - A unit testing framework for C - Version 2.1-3 http://cunit.sourceforge.net/ Suite: ring basic Test: ring_test_basic_create ...pktio/ring.c:177:_ring_create():Requested size is invalid, must be power of 2,and do not exceed the size limit 268435455 _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory _ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory passed Test: ring_test_basic_burst ...passed Test: ring_test_basic_bulk ...passed Test: ring_test_basic_watermark ...passed_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory Suite: ring stress Test: ring_test_stress_1_1_producer_consumer ...passed Test: ring_test_stress_1_N_producer_consumer ...passed Test: ring_test_stress_N_1_producer_consumer ...passed Test: ring_test_stress_N_M_producer_consumer ... <the test case has hung here for half an hour> Cheers, Jia >> Reference FreeBSD ring/DPDK style ring implementation through acquire >> and release schematics >> https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c >> >> I will also spend on cycles on this. >> >> >>> Cheers, >>> Jia >>>> Based on mbuf_autotest, the rte_panic will be invoked in seconds. >>>> >>>> PANIC in test_refcnt_iter(): >>>> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free >>>> 1: [./test(rte_dump_stack+0x38) [0x58d868]] >>>> Aborted (core dumped) >>>> >>>> Cheers, >>>> Jia >>>>> >>>>>> Konstantin > -- > IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. >
Kuusisaari, Juhamatti (Infinera - FI/Espoo)
Oct. 16, 2017, 10:51 a.m. UTC |
#14
Addressed
Unaddressed
> -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Liu, Jie2 > Sent: Friday, October 13, 2017 3:25 AM > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Olivier MATZ > <olivier.matz@6wind.com>; dev@dpdk.org; > jerin.jacob@caviumnetworks.com > Cc: He, Jia <jia.he@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt- > semitech.com>; Jia He <hejianet@gmail.com> > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod > loading when doing enqueue/dequeue > > Hi guys, > We found this issue when we run mbuf_autotest. It failed on a aarch64 > platform. I am not sure if it can be reproduced on other platforms. > Regards, > Jie Liu > > -----Original Message----- > From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com] > Sent: 2017年10月13日 1:06 > To: Olivier MATZ <olivier.matz@6wind.com>; Jia He <hejianet@gmail.com> > Cc: dev@dpdk.org; He, Jia <jia.he@hxt-semitech.com>; Liu, Jie2 > <jie2.liu@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt-semitech.com>; > jerin.jacob@caviumnetworks.com > Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when > doing enqueue/dequeue > > Hi guys, > > > -----Original Message----- > > From: Olivier MATZ [mailto:olivier.matz@6wind.com] > > Sent: Thursday, October 12, 2017 4:54 PM > > To: Jia He <hejianet@gmail.com> > > Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; > > bing.zhao@hxt-semitech.com; Ananyev, Konstantin > > <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com > > Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading > > when doing enqueue/dequeue > > > > Hi, > > > > On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote: > > > Before this patch: > > > In __rte_ring_move_cons_head() > > > ... > > > do { > > > /* Restore n as it may change every loop */ > > > n = max; > > > > > > *old_head = r->cons.head; //1st load > > > const uint32_t prod_tail = r->prod.tail; //2nd load > > > > > > In weak memory order architectures(powerpc,arm), the 2nd load might > > > be reodered before the 1st load, that makes *entries is bigger than we > wanted. > > > This nasty reording messed enque/deque up. > > > > > > cpu1(producer) cpu2(consumer) cpu3(consumer) > > > load r->prod.tail in enqueue: > > > load r->cons.tail > > > load r->prod.head > > > > > > store r->prod.tail > > > > > > load r->cons.head > > > load r->prod.tail > > > ... > > > store r->cons.{head,tail} > > > load r->cons.head > > > > > > THEN,r->cons.head will be bigger than prod_tail, then make *entries > > > very big > > > > > > After this patch, the old cons.head will be recaculated after > > > failure of rte_atomic32_cmpset > > > > > > There is no such issue in X86 cpu, because X86 is strong memory > > > order model > > > > > > Signed-off-by: Jia He <hejianet@gmail.com> > > > Signed-off-by: jia.he@hxt-semitech.com > > > Signed-off-by: jie2.liu@hxt-semitech.com > > > Signed-off-by: bing.zhao@hxt-semitech.com > > > > > > --- > > > lib/librte_ring/rte_ring.h | 8 ++++++++ > > > 1 file changed, 8 insertions(+) > > > > > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h > > > index 5e9b3b7..15c72e2 100644 > > > --- a/lib/librte_ring/rte_ring.h > > > +++ b/lib/librte_ring/rte_ring.h > > > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, > > > int is_sp, n = max; > > > > > > *old_head = r->prod.head; > > > + > > > +/* load of prod.tail can't be reordered before cons.head */ > > > +rte_smp_rmb(); > > > + > > > const uint32_t cons_tail = r->cons.tail; > > > /* > > > * The subtraction is done between two unsigned 32bits value @@ > > > -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int > > > is_sc, n = max; > > > > > > *old_head = r->cons.head; > > > + > > > +/* load of prod.tail can't be reordered before cons.head */ > > > +rte_smp_rmb(); > > > + > > > const uint32_t prod_tail = r->prod.tail; > > > /* The subtraction is done between two unsigned 32bits value > > > * (the result is always modulo 32 bits even if we have > > > -- > > > 2.7.4 > > > > > > > The explanation convinces me. > > > > However, since it's in a critical path, it would be good to have other > > opinions. This patch reminds me this discussion, that was also related > > to memory barrier, but at another place: > > http://dpdk.org/ml/archives/dev/2016-July/043765.html > > Lead to that patch: > > http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e > > But finally reverted: > > http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3 > > > > Konstatin, Jerin, do you have any comment? > > For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't > make any difference, but I can't see how read reordering would screw things > up here... > Probably just me and arm or ppc guys could explain what will be the problem > if let say cons.tail will be read before prod.head in > __rte_ring_move_prod_head(). > I wonder Is there a simple test-case to reproduce that problem (on arm or > ppc)? > Probably new test-case for rte_ring autotest is needed, or is it possible to > reproduce it with existing one? > Konstantin Hi, I think this is a real problem here. We have fixed it so that we read both head and tail atomically as otherwise you may have this problem you are seeing. Works only on 64, of course. Thanks for pointing out the revert, I was unaware of it: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3 Nevertheless, the original fix was as such correct on higher level, both PPC and ARM just happened to use strong enough default read barrier which already guaranteed the ordering to be good enough. As we optimize cycles here, I am of course just fine with the revert as long as we all remember what was going on there. BR, -- Juhamatti
Hi Jia, > > Hi > > > On 10/13/2017 9:02 AM, Jia He Wrote: > > Hi Jerin > > > > > > On 10/13/2017 1:23 AM, Jerin Jacob Wrote: > >> -----Original Message----- > >>> Date: Thu, 12 Oct 2017 17:05:50 +0000 > >>> > [...] > >> On the same lines, > >> > >> Jia He, jie2.liu, bing.zhao, > >> > >> Is this patch based on code review or do you saw this issue on any of > >> the > >> arm/ppc target? arm64 will have performance impact with this change. > sorry, miss one important information > Our platform is an aarch64 server with 46 cpus. > If we reduced the involved cpu numbers, the bug occurred less frequently. > > Yes, mb barrier impact the performance, but correctness is more > important, isn't it ;-) > Maybe we can find any other lightweight barrier here? > > Cheers, > Jia > > Based on mbuf_autotest, the rte_panic will be invoked in seconds. > > > > PANIC in test_refcnt_iter(): > > (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free > > 1: [./test(rte_dump_stack+0x38) [0x58d868]] > > Aborted (core dumped) > > So is it only reproducible with mbuf refcnt test? Could it be reproduced with some 'pure' ring test (no mempools/mbufs refcnt, etc.)? The reason I am asking - in that test we also have mbuf refcnt updates (that's what for that test was created) and we are doing some optimizations here too to avoid excessive atomic updates. BTW, if the problem is not reproducible without mbuf refcnt, can I suggest to extend the test with: - add a check that enqueue() operation was successful - walk through the pool and check/printf refcnt of each mbuf. Hopefully that would give us some extra information what is going wrong here. Konstantin
Hi, On 2017/10/19 18:02, Ananyev, Konstantin wrote: > > Hi Jia, > >> >> Hi >> >> >> On 10/13/2017 9:02 AM, Jia He Wrote: >>> Hi Jerin >>> >>> >>> On 10/13/2017 1:23 AM, Jerin Jacob Wrote: >>>> -----Original Message----- >>>>> Date: Thu, 12 Oct 2017 17:05:50 +0000 >>>>> >> [...] >>>> On the same lines, >>>> >>>> Jia He, jie2.liu, bing.zhao, >>>> >>>> Is this patch based on code review or do you saw this issue on any of >>>> the >>>> arm/ppc target? arm64 will have performance impact with this change. >> sorry, miss one important information >> Our platform is an aarch64 server with 46 cpus. >> If we reduced the involved cpu numbers, the bug occurred less frequently. >> >> Yes, mb barrier impact the performance, but correctness is more >> important, isn't it ;-) >> Maybe we can find any other lightweight barrier here? >> >> Cheers, >> Jia >>> Based on mbuf_autotest, the rte_panic will be invoked in seconds. >>> >>> PANIC in test_refcnt_iter(): >>> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free >>> 1: [./test(rte_dump_stack+0x38) [0x58d868]] >>> Aborted (core dumped) >>> > > So is it only reproducible with mbuf refcnt test? > Could it be reproduced with some 'pure' ring test > (no mempools/mbufs refcnt, etc.)? > The reason I am asking - in that test we also have mbuf refcnt updates > (that's what for that test was created) and we are doing some optimizations here too > to avoid excessive atomic updates. > BTW, if the problem is not reproducible without mbuf refcnt, > can I suggest to extend the test with: > - add a check that enqueue() operation was successful > - walk through the pool and check/printf refcnt of each mbuf. > Hopefully that would give us some extra information what is going wrong here. > Konstantin > > Currently, the issue is only found in this case here on the ARM platform, not sure how it is going with the X86_64 platform. In another mail of this thread, we've made a simple test based on this and captured some information and I pasted there.(I pasted the patch there :-)) And it seems that Juhamatti & Jacod found some reverting action several months ago. BR. Bing
> > Hi, > > On 2017/10/19 18:02, Ananyev, Konstantin wrote: > > > > Hi Jia, > > > >> > >> Hi > >> > >> > >> On 10/13/2017 9:02 AM, Jia He Wrote: > >>> Hi Jerin > >>> > >>> > >>> On 10/13/2017 1:23 AM, Jerin Jacob Wrote: > >>>> -----Original Message----- > >>>>> Date: Thu, 12 Oct 2017 17:05:50 +0000 > >>>>> > >> [...] > >>>> On the same lines, > >>>> > >>>> Jia He, jie2.liu, bing.zhao, > >>>> > >>>> Is this patch based on code review or do you saw this issue on any of > >>>> the > >>>> arm/ppc target? arm64 will have performance impact with this change. > >> sorry, miss one important information > >> Our platform is an aarch64 server with 46 cpus. > >> If we reduced the involved cpu numbers, the bug occurred less frequently. > >> > >> Yes, mb barrier impact the performance, but correctness is more > >> important, isn't it ;-) > >> Maybe we can find any other lightweight barrier here? > >> > >> Cheers, > >> Jia > >>> Based on mbuf_autotest, the rte_panic will be invoked in seconds. > >>> > >>> PANIC in test_refcnt_iter(): > >>> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free > >>> 1: [./test(rte_dump_stack+0x38) [0x58d868]] > >>> Aborted (core dumped) > >>> > > > > So is it only reproducible with mbuf refcnt test? > > Could it be reproduced with some 'pure' ring test > > (no mempools/mbufs refcnt, etc.)? > > The reason I am asking - in that test we also have mbuf refcnt updates > > (that's what for that test was created) and we are doing some optimizations here too > > to avoid excessive atomic updates. > > BTW, if the problem is not reproducible without mbuf refcnt, > > can I suggest to extend the test with: > > - add a check that enqueue() operation was successful > > - walk through the pool and check/printf refcnt of each mbuf. > > Hopefully that would give us some extra information what is going wrong here. > > Konstantin > > > > > Currently, the issue is only found in this case here on the ARM > platform, not sure how it is going with the X86_64 platform I understand that it is only reproducible on arm so far. What I am asking - with dpdk is there any other way to reproduce it (on arm) except then running mbuf_autotest? Something really simple that not using mbuf/mempool etc? Just do dequeue/enqueue from multiple threads and check data integrity at the end? If not - what makes you think that the problem is precisely in rte_ring code? Why not in rte_mbuf let say? >. In another > mail of this thread, we've made a simple test based on this and captured > some information and I pasted there.(I pasted the patch there :-)) Are you talking about that one: http://dpdk.org/dev/patchwork/patch/30405/ ? It still uses test/test/test_mbuf.c..., but anyway I don't really understand how mbuf_autotest supposed to work with these changes: @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, rte_ring_enqueue(refcnt_mbuf_ring, m); } } - rte_pktmbuf_free(m); + // rte_pktmbuf_free(m); } @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, while (!rte_ring_empty(refcnt_mbuf_ring)) ; + if (NULL != m) { + if (1 != rte_mbuf_refcnt_read(m)) + printf("m ref is %u\n", rte_mbuf_refcnt_read(m)); + rte_pktmbuf_free(m); + } + /* check that all mbufs are back into mempool by now */ for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) { if ((i = rte_mempool_avail_count(refcnt_pool)) == n) { That means all your mbufs (except the last one) will still be allocated. So the test would fail - as it should, I think. > And > it seems that Juhamatti & Jacod found some reverting action several > months ago. Didn't get that one either. Konstantin
> -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin > Sent: Thursday, October 19, 2017 3:15 PM > To: Zhao, Bing <ilovethull@163.com>; Jia He <hejianet@gmail.com>; Jerin Jacob <jerin.jacob@caviumnetworks.com> > Cc: Olivier MATZ <olivier.matz@6wind.com>; dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; bing.zhao@hxt- > semitech.com > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue > > > > > Hi, > > > > On 2017/10/19 18:02, Ananyev, Konstantin wrote: > > > > > > Hi Jia, > > > > > >> > > >> Hi > > >> > > >> > > >> On 10/13/2017 9:02 AM, Jia He Wrote: > > >>> Hi Jerin > > >>> > > >>> > > >>> On 10/13/2017 1:23 AM, Jerin Jacob Wrote: > > >>>> -----Original Message----- > > >>>>> Date: Thu, 12 Oct 2017 17:05:50 +0000 > > >>>>> > > >> [...] > > >>>> On the same lines, > > >>>> > > >>>> Jia He, jie2.liu, bing.zhao, > > >>>> > > >>>> Is this patch based on code review or do you saw this issue on any of > > >>>> the > > >>>> arm/ppc target? arm64 will have performance impact with this change. > > >> sorry, miss one important information > > >> Our platform is an aarch64 server with 46 cpus. > > >> If we reduced the involved cpu numbers, the bug occurred less frequently. > > >> > > >> Yes, mb barrier impact the performance, but correctness is more > > >> important, isn't it ;-) > > >> Maybe we can find any other lightweight barrier here? > > >> > > >> Cheers, > > >> Jia > > >>> Based on mbuf_autotest, the rte_panic will be invoked in seconds. > > >>> > > >>> PANIC in test_refcnt_iter(): > > >>> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free > > >>> 1: [./test(rte_dump_stack+0x38) [0x58d868]] > > >>> Aborted (core dumped) > > >>> > > > > > > So is it only reproducible with mbuf refcnt test? > > > Could it be reproduced with some 'pure' ring test > > > (no mempools/mbufs refcnt, etc.)? > > > The reason I am asking - in that test we also have mbuf refcnt updates > > > (that's what for that test was created) and we are doing some optimizations here too > > > to avoid excessive atomic updates. > > > BTW, if the problem is not reproducible without mbuf refcnt, > > > can I suggest to extend the test with: > > > - add a check that enqueue() operation was successful > > > - walk through the pool and check/printf refcnt of each mbuf. > > > Hopefully that would give us some extra information what is going wrong here. > > > Konstantin > > > > > > > > Currently, the issue is only found in this case here on the ARM > > platform, not sure how it is going with the X86_64 platform > > I understand that it is only reproducible on arm so far. > What I am asking - with dpdk is there any other way to reproduce it (on arm) > except then running mbuf_autotest? > Something really simple that not using mbuf/mempool etc? > Just do dequeue/enqueue from multiple threads and check data integrity at the end? > If not - what makes you think that the problem is precisely in rte_ring code? > Why not in rte_mbuf let say? Actually I reread your original mail and finally get your point. If I understand you correctly the problem with read reordering here is that after we read prot.tail but before we read cons.head both cons.head and prod.tail might be updated, but for us prod.tail change might be unnoticed. As an example: time 0 (cons.head == 0, prod.tail == 0): prod_tail = r->prod.tail; /* due read reordering */ /* prod_tail == 0 */ time 1 (cons.head ==5, prod.tail == 5): *old_head = r->cons.head; /* cons.head == 5 */ *entries = (prod_tail - *old_head); /* *entries == (0 - 5) == 0xFFFFFFFB */ And we will move our cons.head forward, even though there are no filled entries in the ring. Is that right? As I side notice, I wonder why we have here: *entries = (prod_tail - *old_head); instead of: *entries = r->size + prod_tail - *old_head; ? Konstantin > > >. In another > > mail of this thread, we've made a simple test based on this and captured > > some information and I pasted there.(I pasted the patch there :-)) > > Are you talking about that one: > http://dpdk.org/dev/patchwork/patch/30405/ > ? > It still uses test/test/test_mbuf.c..., > but anyway I don't really understand how mbuf_autotest supposed > to work with these changes: > @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, > rte_ring_enqueue(refcnt_mbuf_ring, m); > } > } > - rte_pktmbuf_free(m); > + // rte_pktmbuf_free(m); > } > @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, > while (!rte_ring_empty(refcnt_mbuf_ring)) > ; > > + if (NULL != m) { > + if (1 != rte_mbuf_refcnt_read(m)) > + printf("m ref is %u\n", rte_mbuf_refcnt_read(m)); > + rte_pktmbuf_free(m); > + } > + > /* check that all mbufs are back into mempool by now */ > for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) { > if ((i = rte_mempool_avail_count(refcnt_pool)) == n) { > > That means all your mbufs (except the last one) will still be allocated. > So the test would fail - as it should, I think. > > > And > > it seems that Juhamatti & Jacod found some reverting action several > > months ago. > > Didn't get that one either. > Konstantin
On 10/20/2017 4:02 AM, Ananyev, Konstantin Wrote: > >> -----Original Message----- >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin >> Sent: Thursday, October 19, 2017 3:15 PM >> To: Zhao, Bing <ilovethull@163.com>; Jia He <hejianet@gmail.com>; Jerin Jacob <jerin.jacob@caviumnetworks.com> >> Cc: Olivier MATZ <olivier.matz@6wind.com>; dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; bing.zhao@hxt- >> semitech.com >> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue >> >>> Hi, >>> >>> On 2017/10/19 18:02, Ananyev, Konstantin wrote: >>>> Hi Jia, >>>> >>>>> Hi >>>>> >>>>> >>>>> On 10/13/2017 9:02 AM, Jia He Wrote: >>>>>> Hi Jerin >>>>>> >>>>>> >>>>>> On 10/13/2017 1:23 AM, Jerin Jacob Wrote: >>>>>>> -----Original Message----- >>>>>>>> Date: Thu, 12 Oct 2017 17:05:50 +0000 >>>>>>>> >>>>> [...] >>>>>>> On the same lines, >>>>>>> >>>>>>> Jia He, jie2.liu, bing.zhao, >>>>>>> >>>>>>> Is this patch based on code review or do you saw this issue on any of >>>>>>> the >>>>>>> arm/ppc target? arm64 will have performance impact with this change. >>>>> sorry, miss one important information >>>>> Our platform is an aarch64 server with 46 cpus. >>>>> If we reduced the involved cpu numbers, the bug occurred less frequently. >>>>> >>>>> Yes, mb barrier impact the performance, but correctness is more >>>>> important, isn't it ;-) >>>>> Maybe we can find any other lightweight barrier here? >>>>> >>>>> Cheers, >>>>> Jia >>>>>> Based on mbuf_autotest, the rte_panic will be invoked in seconds. >>>>>> >>>>>> PANIC in test_refcnt_iter(): >>>>>> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free >>>>>> 1: [./test(rte_dump_stack+0x38) [0x58d868]] >>>>>> Aborted (core dumped) >>>>>> >>>> So is it only reproducible with mbuf refcnt test? >>>> Could it be reproduced with some 'pure' ring test >>>> (no mempools/mbufs refcnt, etc.)? >>>> The reason I am asking - in that test we also have mbuf refcnt updates >>>> (that's what for that test was created) and we are doing some optimizations here too >>>> to avoid excessive atomic updates. >>>> BTW, if the problem is not reproducible without mbuf refcnt, >>>> can I suggest to extend the test with: >>>> - add a check that enqueue() operation was successful >>>> - walk through the pool and check/printf refcnt of each mbuf. >>>> Hopefully that would give us some extra information what is going wrong here. >>>> Konstantin >>>> >>>> >>> Currently, the issue is only found in this case here on the ARM >>> platform, not sure how it is going with the X86_64 platform >> I understand that it is only reproducible on arm so far. >> What I am asking - with dpdk is there any other way to reproduce it (on arm) >> except then running mbuf_autotest? >> Something really simple that not using mbuf/mempool etc? >> Just do dequeue/enqueue from multiple threads and check data integrity at the end? >> If not - what makes you think that the problem is precisely in rte_ring code? >> Why not in rte_mbuf let say? > Actually I reread your original mail and finally get your point. > If I understand you correctly the problem with read reordering here is that > after we read prot.tail but before we read cons.head > both cons.head and prod.tail might be updated, > but for us prod.tail change might be unnoticed. > As an example: > time 0 (cons.head == 0, prod.tail == 0): > prod_tail = r->prod.tail; /* due read reordering */ > /* prod_tail == 0 */ > > time 1 (cons.head ==5, prod.tail == 5): > *old_head = r->cons.head; > /* cons.head == 5 */ > *entries = (prod_tail - *old_head); > /* *entries == (0 - 5) == 0xFFFFFFFB */ > > And we will move our cons.head forward, even though there are no filled entries in the ring. > Is that right? Yes > As I side notice, I wonder why we have here: > *entries = (prod_tail - *old_head); > instead of: > *entries = r->size + prod_tail - *old_head; > ? Yes, I agree with you at this code line. But reordering will still mess up things even after this change(I have tested, still the same as before) I thought the *entries is a door to prevent consumer from moving forward too fast than the producer. But in some cases, it is correct that prod_tail is smaller than *old_head due to the cirular queue. In other cases, it is incorrect because of read/read reordering. AFAICT, the root cause here is the producer tail and cosumer head are dependant on each other. Thus a memory barrier is neccessary. Cheers, Jia > > Konstantin > >>> . In another >>> mail of this thread, we've made a simple test based on this and captured >>> some information and I pasted there.(I pasted the patch there :-)) >> Are you talking about that one: >> http://dpdk.org/dev/patchwork/patch/30405/ >> ? >> It still uses test/test/test_mbuf.c..., >> but anyway I don't really understand how mbuf_autotest supposed >> to work with these changes: >> @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, >> rte_ring_enqueue(refcnt_mbuf_ring, m); >> } >> } >> - rte_pktmbuf_free(m); >> + // rte_pktmbuf_free(m); >> } >> @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, >> while (!rte_ring_empty(refcnt_mbuf_ring)) >> ; >> >> + if (NULL != m) { >> + if (1 != rte_mbuf_refcnt_read(m)) >> + printf("m ref is %u\n", rte_mbuf_refcnt_read(m)); >> + rte_pktmbuf_free(m); >> + } >> + >> /* check that all mbufs are back into mempool by now */ >> for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) { >> if ((i = rte_mempool_avail_count(refcnt_pool)) == n) { >> >> That means all your mbufs (except the last one) will still be allocated. >> So the test would fail - as it should, I think. >> >>> And >>> it seems that Juhamatti & Jacod found some reverting action several >>> months ago. >> Didn't get that one either. >> Konstantin
-----Original Message----- > Date: Fri, 20 Oct 2017 09:57:58 +0800 > From: Jia He <hejianet@gmail.com> > To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" > <ilovethull@163.com>, Jerin Jacob <jerin.jacob@caviumnetworks.com> > Cc: Olivier MATZ <olivier.matz@6wind.com>, "dev@dpdk.org" <dev@dpdk.org>, > "jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>, > "jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>, > "bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com>, "Richardson, > Bruce" <bruce.richardson@intel.com> > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod > loading when doing enqueue/dequeue > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > Thunderbird/52.4.0 > > > > On 10/20/2017 4:02 AM, Ananyev, Konstantin Wrote: > > > > > -----Original Message----- > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin > > > Sent: Thursday, October 19, 2017 3:15 PM > > > To: Zhao, Bing <ilovethull@163.com>; Jia He <hejianet@gmail.com>; Jerin Jacob <jerin.jacob@caviumnetworks.com> > > > Cc: Olivier MATZ <olivier.matz@6wind.com>; dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; bing.zhao@hxt- > > > semitech.com > > > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue > > > > > > > Hi, > > > > > > > > On 2017/10/19 18:02, Ananyev, Konstantin wrote: > > > > > Hi Jia, > > > > > > > > > > > Hi > > > > > > > > > > > > > > > > > > On 10/13/2017 9:02 AM, Jia He Wrote: > > > > > > > Hi Jerin > > > > > > > > > > > > > > > > > > > > > On 10/13/2017 1:23 AM, Jerin Jacob Wrote: > > > > > > > > -----Original Message----- > > > > > > > > > Date: Thu, 12 Oct 2017 17:05:50 +0000 > > > > > > > > > > > > > > > [...] > > > > > > > > On the same lines, > > > > > > > > > > > > > > > > Jia He, jie2.liu, bing.zhao, > > > > > > > > > > > > > > > > Is this patch based on code review or do you saw this issue on any of > > > > > > > > the > > > > > > > > arm/ppc target? arm64 will have performance impact with this change. > > > > > > sorry, miss one important information > > > > > > Our platform is an aarch64 server with 46 cpus. > > > > > > If we reduced the involved cpu numbers, the bug occurred less frequently. > > > > > > > > > > > > Yes, mb barrier impact the performance, but correctness is more > > > > > > important, isn't it ;-) > > > > > > Maybe we can find any other lightweight barrier here? > > > > > > > > > > > > Cheers, > > > > > > Jia > > > > > > > Based on mbuf_autotest, the rte_panic will be invoked in seconds. > > > > > > > > > > > > > > PANIC in test_refcnt_iter(): > > > > > > > (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free > > > > > > > 1: [./test(rte_dump_stack+0x38) [0x58d868]] > > > > > > > Aborted (core dumped) > > > > > > > > > > > > So is it only reproducible with mbuf refcnt test? > > > > > Could it be reproduced with some 'pure' ring test > > > > > (no mempools/mbufs refcnt, etc.)? > > > > > The reason I am asking - in that test we also have mbuf refcnt updates > > > > > (that's what for that test was created) and we are doing some optimizations here too > > > > > to avoid excessive atomic updates. > > > > > BTW, if the problem is not reproducible without mbuf refcnt, > > > > > can I suggest to extend the test with: > > > > > - add a check that enqueue() operation was successful > > > > > - walk through the pool and check/printf refcnt of each mbuf. > > > > > Hopefully that would give us some extra information what is going wrong here. > > > > > Konstantin > > > > > > > > > > > > > > Currently, the issue is only found in this case here on the ARM > > > > platform, not sure how it is going with the X86_64 platform > > > I understand that it is only reproducible on arm so far. > > > What I am asking - with dpdk is there any other way to reproduce it (on arm) > > > except then running mbuf_autotest? > > > Something really simple that not using mbuf/mempool etc? > > > Just do dequeue/enqueue from multiple threads and check data integrity at the end? > > > If not - what makes you think that the problem is precisely in rte_ring code? > > > Why not in rte_mbuf let say? > > Actually I reread your original mail and finally get your point. > > If I understand you correctly the problem with read reordering here is that > > after we read prot.tail but before we read cons.head > > both cons.head and prod.tail might be updated, > > but for us prod.tail change might be unnoticed. > > As an example: > > time 0 (cons.head == 0, prod.tail == 0): > > prod_tail = r->prod.tail; /* due read reordering */ > > /* prod_tail == 0 */ > > > > time 1 (cons.head ==5, prod.tail == 5): > > *old_head = r->cons.head; > > /* cons.head == 5 */ > > *entries = (prod_tail - *old_head); > > /* *entries == (0 - 5) == 0xFFFFFFFB */ > > > > And we will move our cons.head forward, even though there are no filled entries in the ring. > > Is that right? > Yes > > As I side notice, I wonder why we have here: > > *entries = (prod_tail - *old_head); > > instead of: > > *entries = r->size + prod_tail - *old_head; > > ? > Yes, I agree with you at this code line. > But reordering will still mess up things even after this change(I have > tested, still the same as before) > I thought the *entries is a door to prevent consumer from moving forward too > fast than the producer. > But in some cases, it is correct that prod_tail is smaller than *old_head > due to the cirular queue. > In other cases, it is incorrect because of read/read reordering. > > AFAICT, the root cause here is the producer tail and cosumer head are > dependant on each other. > Thus a memory barrier is neccessary. Yes. The barrier is necessary. In fact, upstream freebsd fixed this issue for arm64. DPDK ring implementation is derived from freebsd's buf_ring.h. https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 I think, the only outstanding issue is, how to reduce the performance impact for arm64. I believe using accurate/release semantics instead of rte_smp_rmb() will reduce the performance overhead like similar ring implementations below, freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 odp: https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c Jia, 1) Can you verify the use of accurate/release semantics fixes the problem in your platform? like use of atomic_load_acq* in the reference code. 2) If so, What is the overhead between accurate/release and plane smp_smb() barriers. Based on that we need decide what path to take. Note: This issue wont come in all the arm64 implementation. it comes on arm64 implementation with OOO(out of order) implementations. > > Cheers, > Jia > > > > > Konstantin > > > > > > . In another > > > > mail of this thread, we've made a simple test based on this and captured > > > > some information and I pasted there.(I pasted the patch there :-)) > > > Are you talking about that one: > > > http://dpdk.org/dev/patchwork/patch/30405/ > > > ? > > > It still uses test/test/test_mbuf.c..., > > > but anyway I don't really understand how mbuf_autotest supposed > > > to work with these changes: > > > @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, > > > rte_ring_enqueue(refcnt_mbuf_ring, m); > > > } > > > } > > > - rte_pktmbuf_free(m); > > > + // rte_pktmbuf_free(m); > > > } > > > @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, > > > while (!rte_ring_empty(refcnt_mbuf_ring)) > > > ; > > > > > > + if (NULL != m) { > > > + if (1 != rte_mbuf_refcnt_read(m)) > > > + printf("m ref is %u\n", rte_mbuf_refcnt_read(m)); > > > + rte_pktmbuf_free(m); > > > + } > > > + > > > /* check that all mbufs are back into mempool by now */ > > > for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) { > > > if ((i = rte_mempool_avail_count(refcnt_pool)) == n) { > > > > > > That means all your mbufs (except the last one) will still be allocated. > > > So the test would fail - as it should, I think. > > > > > > > And > > > > it seems that Juhamatti & Jacod found some reverting action several > > > > months ago. > > > Didn't get that one either. > > > Konstantin >
> >> > >>> Hi, > >>> > >>> On 2017/10/19 18:02, Ananyev, Konstantin wrote: > >>>> Hi Jia, > >>>> > >>>>> Hi > >>>>> > >>>>> > >>>>> On 10/13/2017 9:02 AM, Jia He Wrote: > >>>>>> Hi Jerin > >>>>>> > >>>>>> > >>>>>> On 10/13/2017 1:23 AM, Jerin Jacob Wrote: > >>>>>>> -----Original Message----- > >>>>>>>> Date: Thu, 12 Oct 2017 17:05:50 +0000 > >>>>>>>> > >>>>> [...] > >>>>>>> On the same lines, > >>>>>>> > >>>>>>> Jia He, jie2.liu, bing.zhao, > >>>>>>> > >>>>>>> Is this patch based on code review or do you saw this issue on any of > >>>>>>> the > >>>>>>> arm/ppc target? arm64 will have performance impact with this change. > >>>>> sorry, miss one important information > >>>>> Our platform is an aarch64 server with 46 cpus. > >>>>> If we reduced the involved cpu numbers, the bug occurred less frequently. > >>>>> > >>>>> Yes, mb barrier impact the performance, but correctness is more > >>>>> important, isn't it ;-) > >>>>> Maybe we can find any other lightweight barrier here? > >>>>> > >>>>> Cheers, > >>>>> Jia > >>>>>> Based on mbuf_autotest, the rte_panic will be invoked in seconds. > >>>>>> > >>>>>> PANIC in test_refcnt_iter(): > >>>>>> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free > >>>>>> 1: [./test(rte_dump_stack+0x38) [0x58d868]] > >>>>>> Aborted (core dumped) > >>>>>> > >>>> So is it only reproducible with mbuf refcnt test? > >>>> Could it be reproduced with some 'pure' ring test > >>>> (no mempools/mbufs refcnt, etc.)? > >>>> The reason I am asking - in that test we also have mbuf refcnt updates > >>>> (that's what for that test was created) and we are doing some optimizations here too > >>>> to avoid excessive atomic updates. > >>>> BTW, if the problem is not reproducible without mbuf refcnt, > >>>> can I suggest to extend the test with: > >>>> - add a check that enqueue() operation was successful > >>>> - walk through the pool and check/printf refcnt of each mbuf. > >>>> Hopefully that would give us some extra information what is going wrong here. > >>>> Konstantin > >>>> > >>>> > >>> Currently, the issue is only found in this case here on the ARM > >>> platform, not sure how it is going with the X86_64 platform > >> I understand that it is only reproducible on arm so far. > >> What I am asking - with dpdk is there any other way to reproduce it (on arm) > >> except then running mbuf_autotest? > >> Something really simple that not using mbuf/mempool etc? > >> Just do dequeue/enqueue from multiple threads and check data integrity at the end? > >> If not - what makes you think that the problem is precisely in rte_ring code? > >> Why not in rte_mbuf let say? > > Actually I reread your original mail and finally get your point. > > If I understand you correctly the problem with read reordering here is that > > after we read prot.tail but before we read cons.head > > both cons.head and prod.tail might be updated, > > but for us prod.tail change might be unnoticed. > > As an example: > > time 0 (cons.head == 0, prod.tail == 0): > > prod_tail = r->prod.tail; /* due read reordering */ > > /* prod_tail == 0 */ > > > > time 1 (cons.head ==5, prod.tail == 5): > > *old_head = r->cons.head; > > /* cons.head == 5 */ > > *entries = (prod_tail - *old_head); > > /* *entries == (0 - 5) == 0xFFFFFFFB */ > > > > And we will move our cons.head forward, even though there are no filled entries in the ring. > > Is that right? > Yes Ok, thanks for your patience - it took me a while to understand what you guys are talking about here. > > As I side notice, I wonder why we have here: > > *entries = (prod_tail - *old_head); > > instead of: > > *entries = r->size + prod_tail - *old_head; > > ? > Yes, I agree with you at this code line. > But reordering will still mess up things even after this change(I have > tested, still the same as before) Yes, I am convinced now that rmb is necessary here :) As I said this is just a different thing I noticed while looking at the code. Probably should discuss it in a different thread. Konstantin > I thought the *entries is a door to prevent consumer from moving forward > too fast than the producer. > But in some cases, it is correct that prod_tail is smaller than > *old_head due to the cirular queue. > In other cases, it is incorrect because of read/read reordering. > > AFAICT, the root cause here is the producer tail and cosumer head are > dependant on each other. > Thus a memory barrier is neccessary. > > Cheers, > Jia > > > > > Konstantin > > > >>> . In another > >>> mail of this thread, we've made a simple test based on this and captured > >>> some information and I pasted there.(I pasted the patch there :-)) > >> Are you talking about that one: > >> http://dpdk.org/dev/patchwork/patch/30405/ > >> ? > >> It still uses test/test/test_mbuf.c..., > >> but anyway I don't really understand how mbuf_autotest supposed > >> to work with these changes: > >> @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, > >> rte_ring_enqueue(refcnt_mbuf_ring, m); > >> } > >> } > >> - rte_pktmbuf_free(m); > >> + // rte_pktmbuf_free(m); > >> } > >> @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, > >> while (!rte_ring_empty(refcnt_mbuf_ring)) > >> ; > >> > >> + if (NULL != m) { > >> + if (1 != rte_mbuf_refcnt_read(m)) > >> + printf("m ref is %u\n", rte_mbuf_refcnt_read(m)); > >> + rte_pktmbuf_free(m); > >> + } > >> + > >> /* check that all mbufs are back into mempool by now */ > >> for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) { > >> if ((i = rte_mempool_avail_count(refcnt_pool)) == n) { > >> > >> That means all your mbufs (except the last one) will still be allocated. > >> So the test would fail - as it should, I think. > >> > >>> And > >>> it seems that Juhamatti & Jacod found some reverting action several > >>> months ago. > >> Didn't get that one either. > >> Konstantin
Hi Jerin On 10/20/2017 1:43 PM, Jerin Jacob Wrote: > -----Original Message----- >> [...] >> dependant on each other. >> Thus a memory barrier is neccessary. > Yes. The barrier is necessary. > In fact, upstream freebsd fixed this issue for arm64. DPDK ring > implementation is derived from freebsd's buf_ring.h. > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > I think, the only outstanding issue is, how to reduce the performance > impact for arm64. I believe using accurate/release semantics instead > of rte_smp_rmb() will reduce the performance overhead like similar ring implementations below, > freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > odp: https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c > > Jia, > 1) Can you verify the use of accurate/release semantics fixes the problem in your > platform? like use of atomic_load_acq* in the reference code. > 2) If so, What is the overhead between accurate/release and plane smp_smb() > barriers. Based on that we need decide what path to take. I've tested 3 cases. The new 3rd case is to use the load_acquire barrier (half barrier) you mentioned at above link. The patch seems like: @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, /* Reset n to the initial burst count */ n = max; - *old_head = r->prod.head; - const uint32_t cons_tail = r->cons.tail; + *old_head = atomic_load_acq_32(&r->prod.head); + const uint32_t cons_tail = atomic_load_acq_32(&r->cons.tail); @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s /* Restore n as it may change every loop */ n = max; - *old_head = r->cons.head; - const uint32_t prod_tail = r->prod.tail; + *old_head = atomic_load_acq_32(&r->cons.head); + const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail) /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have * cons_head > prod_tail). So 'entries' is always between 0 * and size(ring)-1. */ The half barrier patch passed the fuctional test. As for the performance comparision on *arm64*(the debug patch is at http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see the test results below: [case 1] old codes, no barrier ============================================ Performance counter stats for './test --no-huge -l 1-10': 689275.001200 task-clock (msec) # 9.771 CPUs utilized 6223 context-switches # 0.009 K/sec 10 cpu-migrations # 0.000 K/sec 653 page-faults # 0.001 K/sec 1721190914583 cycles # 2.497 GHz 3363238266430 instructions # 1.95 insn per cycle <not supported> branches 27804740 branch-misses # 0.00% of all branches 70.540618825 seconds time elapsed [case 2] full barrier with rte_smp_rmb() ============================================ Performance counter stats for './test --no-huge -l 1-10': 582557.895850 task-clock (msec) # 9.752 CPUs utilized 5242 context-switches # 0.009 K/sec 10 cpu-migrations # 0.000 K/sec 665 page-faults # 0.001 K/sec 1454360730055 cycles # 2.497 GHz 587197839907 instructions # 0.40 insn per cycle <not supported> branches 27799687 branch-misses # 0.00% of all branches 59.735582356 seconds time elapse [case 1] half barrier with load_acquire ============================================ Performance counter stats for './test --no-huge -l 1-10': 660758.877050 task-clock (msec) # 9.764 CPUs utilized 5982 context-switches # 0.009 K/sec 11 cpu-migrations # 0.000 K/sec 657 page-faults # 0.001 K/sec 1649875318044 cycles # 2.497 GHz 591583257765 instructions # 0.36 insn per cycle <not supported> branches 27994903 branch-misses # 0.00% of all branches 67.672855107 seconds time elapsed Please see the context-switches in the perf results test result sorted by time is: full barrier < half barrier < no barrier AFAICT, in this case ,the cpu reordering will add the possibility for context switching and increase the running time. Any ideas? Cheers, Jia > > Note: > This issue wont come in all the arm64 implementation. it comes on arm64 > implementation with OOO(out of order) implementations. > > >> Cheers, >> Jia >> >>> Konstantin >>> >>>>> . In another >>>>> mail of this thread, we've made a simple test based on this and captured >>>>> some information and I pasted there.(I pasted the patch there :-)) >>>> Are you talking about that one: >>>> http://dpdk.org/dev/patchwork/patch/30405/ >>>> ? >>>> It still uses test/test/test_mbuf.c..., >>>> but anyway I don't really understand how mbuf_autotest supposed >>>> to work with these changes: >>>> @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, >>>> rte_ring_enqueue(refcnt_mbuf_ring, m); >>>> } >>>> } >>>> - rte_pktmbuf_free(m); >>>> + // rte_pktmbuf_free(m); >>>> } >>>> @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, >>>> while (!rte_ring_empty(refcnt_mbuf_ring)) >>>> ; >>>> >>>> + if (NULL != m) { >>>> + if (1 != rte_mbuf_refcnt_read(m)) >>>> + printf("m ref is %u\n", rte_mbuf_refcnt_read(m)); >>>> + rte_pktmbuf_free(m); >>>> + } >>>> + >>>> /* check that all mbufs are back into mempool by now */ >>>> for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) { >>>> if ((i = rte_mempool_avail_count(refcnt_pool)) == n) { >>>> >>>> That means all your mbufs (except the last one) will still be allocated. >>>> So the test would fail - as it should, I think. >>>> >>>>> And >>>>> it seems that Juhamatti & Jacod found some reverting action several >>>>> months ago. >>>> Didn't get that one either. >>>> Konstantin
Kuusisaari, Juhamatti (Infinera - FI/Espoo)
Oct. 23, 2017, 9:05 a.m. UTC |
#23
Addressed
Unaddressed
Hi, > On 10/20/2017 1:43 PM, Jerin Jacob Wrote: > > -----Original Message----- > >> > [...] > >> dependant on each other. > >> Thus a memory barrier is neccessary. > > Yes. The barrier is necessary. > > In fact, upstream freebsd fixed this issue for arm64. DPDK ring > > implementation is derived from freebsd's buf_ring.h. > > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > > > I think, the only outstanding issue is, how to reduce the performance > > impact for arm64. I believe using accurate/release semantics instead > > of rte_smp_rmb() will reduce the performance overhead like similar > > ring implementations below, > > freebsd: > > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > odp: > > https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio > > /ring.c > > > > Jia, > > 1) Can you verify the use of accurate/release semantics fixes the > > problem in your platform? like use of atomic_load_acq* in the reference > code. > > 2) If so, What is the overhead between accurate/release and plane > > smp_smb() barriers. Based on that we need decide what path to take. > I've tested 3 cases. The new 3rd case is to use the load_acquire barrier (half > barrier) you mentioned at above link. > The patch seems like: > @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int > is_sp, > /* Reset n to the initial burst count */ > n = max; > > - *old_head = r->prod.head; > - const uint32_t cons_tail = r->cons.tail; > + *old_head = atomic_load_acq_32(&r->prod.head); > + const uint32_t cons_tail = > atomic_load_acq_32(&r->cons.tail); > > @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int > is_s > /* Restore n as it may change every loop */ > n = max; > > - *old_head = r->cons.head; > - const uint32_t prod_tail = r->prod.tail; > + *old_head = atomic_load_acq_32(&r->cons.head); > + const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail) > /* The subtraction is done between two unsigned 32bits > value > * (the result is always modulo 32 bits even if we have > * cons_head > prod_tail). So 'entries' is always between 0 > * and size(ring)-1. */ > > The half barrier patch passed the fuctional test. > > As for the performance comparision on *arm64*(the debug patch is at > http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see > the test results > below: > > [case 1] old codes, no barrier > ============================================ > Performance counter stats for './test --no-huge -l 1-10': > > 689275.001200 task-clock (msec) # 9.771 CPUs utilized > 6223 context-switches # 0.009 K/sec > 10 cpu-migrations # 0.000 K/sec > 653 page-faults # 0.001 K/sec > 1721190914583 cycles # 2.497 GHz > 3363238266430 instructions # 1.95 insn per > cycle > <not supported> branches > 27804740 branch-misses # 0.00% of all > branches > > 70.540618825 seconds time elapsed > > [case 2] full barrier with rte_smp_rmb() > ============================================ > Performance counter stats for './test --no-huge -l 1-10': > > 582557.895850 task-clock (msec) # 9.752 CPUs utilized > 5242 context-switches # 0.009 K/sec > 10 cpu-migrations # 0.000 K/sec > 665 page-faults # 0.001 K/sec > 1454360730055 cycles # 2.497 GHz > 587197839907 instructions # 0.40 insn per > cycle > <not supported> branches > 27799687 branch-misses # 0.00% of all > branches > > 59.735582356 seconds time elapse > > [case 1] half barrier with load_acquire > ============================================ > Performance counter stats for './test --no-huge -l 1-10': > > 660758.877050 task-clock (msec) # 9.764 CPUs utilized > 5982 context-switches # 0.009 K/sec > 11 cpu-migrations # 0.000 K/sec > 657 page-faults # 0.001 K/sec > 1649875318044 cycles # 2.497 GHz > 591583257765 instructions # 0.36 insn per > cycle > <not supported> branches > 27994903 branch-misses # 0.00% of all > branches > > 67.672855107 seconds time elapsed > > Please see the context-switches in the perf results > test result sorted by time is: > full barrier < half barrier < no barrier > > AFAICT, in this case ,the cpu reordering will add the possibility for > context switching and > increase the running time. > > Any ideas? You could also try a case where you rearrange the rte_ring structure so that prod.head and cons.tail are parts of an union of a 64-bit variable and then read this 64-bit variable with one atomic read. I do not think that half barrier is even needed here with this approach, as long as you can really read the 64-bit variable fully at once. This should speed up. Cheers, -- Juhamatti > Cheers, > Jia > > > > > Note: > > This issue wont come in all the arm64 implementation. it comes on arm64 > > implementation with OOO(out of order) implementations. > > > > > >> Cheers, > >> Jia > >> > >>> Konstantin > >>> > >>>>> . In another > >>>>> mail of this thread, we've made a simple test based on this and > captured > >>>>> some information and I pasted there.(I pasted the patch there :-)) > >>>> Are you talking about that one: > >>>> http://dpdk.org/dev/patchwork/patch/30405/ > >>>> ? > >>>> It still uses test/test/test_mbuf.c..., > >>>> but anyway I don't really understand how mbuf_autotest supposed > >>>> to work with these changes: > >>>> @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int > iter, > >>>> rte_ring_enqueue(refcnt_mbuf_ring, m); > >>>> } > >>>> } > >>>> - rte_pktmbuf_free(m); > >>>> + // rte_pktmbuf_free(m); > >>>> } > >>>> @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned > int iter, > >>>> while (!rte_ring_empty(refcnt_mbuf_ring)) > >>>> ; > >>>> > >>>> + if (NULL != m) { > >>>> + if (1 != rte_mbuf_refcnt_read(m)) > >>>> + printf("m ref is %u\n", rte_mbuf_refcnt_read(m)); > >>>> + rte_pktmbuf_free(m); > >>>> + } > >>>> + > >>>> /* check that all mbufs are back into mempool by now */ > >>>> for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) { > >>>> if ((i = rte_mempool_avail_count(refcnt_pool)) == n) { > >>>> > >>>> That means all your mbufs (except the last one) will still be allocated. > >>>> So the test would fail - as it should, I think. > >>>> > >>>>> And > >>>>> it seems that Juhamatti & Jacod found some reverting action several > >>>>> months ago. > >>>> Didn't get that one either. > >>>> Konstantin > > -- > Cheers, > Jia
On Mon, Oct 23, 2017 at 09:05:24AM +0000, Kuusisaari, Juhamatti wrote: > > Hi, > > > On 10/20/2017 1:43 PM, Jerin Jacob Wrote: > > > -----Original Message----- > > >> > > [...] > > >> dependant on each other. Thus a memory barrier is neccessary. > > > Yes. The barrier is necessary. In fact, upstream freebsd fixed > > > this issue for arm64. DPDK ring implementation is derived from > > > freebsd's buf_ring.h. > > > > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > > > > > I think, the only outstanding issue is, how to reduce the > > > performance impact for arm64. I believe using accurate/release > > > semantics instead of rte_smp_rmb() will reduce the performance > > > overhead like similar ring implementations below, freebsd: > > > > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > > odp: > > > https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio > > > /ring.c > > > > > > Jia, 1) Can you verify the use of accurate/release semantics fixes > > > the problem in your platform? like use of atomic_load_acq* in the > > > reference > > code. > > > 2) If so, What is the overhead between accurate/release and plane > > > smp_smb() barriers. Based on that we need decide what path to > > > take. > > I've tested 3 cases. The new 3rd case is to use the load_acquire > > barrier (half barrier) you mentioned at above link. The patch seems > > like: @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring > > *r, int is_sp, /* Reset n to the initial burst count > > */ n = max; > > > > - *old_head = r->prod.head; - const > > uint32_t cons_tail = r->cons.tail; + *old_head = > > atomic_load_acq_32(&r->prod.head); + const uint32_t > > cons_tail = atomic_load_acq_32(&r->cons.tail); > > > > @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, > > int is_s /* Restore n as it may change every loop */ > > n = max; > > > > - *old_head = r->cons.head; - const > > uint32_t prod_tail = r->prod.tail; + *old_head = > > atomic_load_acq_32(&r->cons.head); + const uint32_t > > prod_tail = atomic_load_acq_32(&r->prod.tail) /* The > > subtraction is done between two unsigned 32bits value > > * (the result is always modulo 32 bits even if we > > have * cons_head > prod_tail). So 'entries' is > > always between 0 * and size(ring)-1. */ > > > > The half barrier patch passed the fuctional test. > > > > As for the performance comparision on *arm64*(the debug patch is at > > http://dpdk.org/ml/archives/dev/2017-October/079012.html), please > > see the test results below: > > > > [case 1] old codes, no barrier > > ============================================ Performance counter > > stats for './test --no-huge -l 1-10': > > > > 689275.001200 task-clock (msec) # 9.771 CPUs > > utilized 6223 context-switches # > > 0.009 K/sec 10 cpu-migrations # > > 0.000 K/sec 653 page-faults # > > 0.001 K/sec 1721190914583 cycles # > > 2.497 GHz 3363238266430 instructions # > > 1.95 insn per cycle <not supported> branches > > 27804740 branch-misses # 0.00% of all branches > > > > 70.540618825 seconds time elapsed > > > > [case 2] full barrier with rte_smp_rmb() > > ============================================ Performance counter > > stats for './test --no-huge -l 1-10': > > > > 582557.895850 task-clock (msec) # 9.752 CPUs > > utilized 5242 context-switches # > > 0.009 K/sec 10 cpu-migrations # > > 0.000 K/sec 665 page-faults # > > 0.001 K/sec 1454360730055 cycles # > > 2.497 GHz 587197839907 instructions # > > 0.40 insn per cycle <not supported> branches > > 27799687 branch-misses # 0.00% of all branches > > > > 59.735582356 seconds time elapse > > > > [case 1] half barrier with load_acquire > > ============================================ Performance counter > > stats for './test --no-huge -l 1-10': > > > > 660758.877050 task-clock (msec) # 9.764 CPUs > > utilized 5982 context-switches # > > 0.009 K/sec 11 cpu-migrations # > > 0.000 K/sec 657 page-faults # > > 0.001 K/sec 1649875318044 cycles # > > 2.497 GHz 591583257765 instructions # > > 0.36 insn per cycle <not supported> branches > > 27994903 branch-misses # 0.00% of all branches > > > > 67.672855107 seconds time elapsed > > > > Please see the context-switches in the perf results test result > > sorted by time is: full barrier < half barrier < no barrier > > > > AFAICT, in this case ,the cpu reordering will add the possibility > > for context switching and increase the running time. > > > > Any ideas? > > You could also try a case where you rearrange the rte_ring structure > so that prod.head and cons.tail are parts of an union of a 64-bit > variable and then read this 64-bit variable with one atomic read. I do > not think that half barrier is even needed here with this approach, as > long as you can really read the 64-bit variable fully at once. This > should speed up. > > Cheers, -- Yes, that would work, but unfortunately it means mixing producer and consumer data onto the same cacheline, which may have other negative effects on performance (depending on arch/platform) due to cores invalidating each other's cachelines. /Bruce
-----Original Message----- > Date: Mon, 23 Oct 2017 16:49:01 +0800 > From: Jia He <hejianet@gmail.com> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com> > Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" > <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, > "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" > <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" > <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" > <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" > <bruce.richardson@intel.com> > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod > loading when doing enqueue/dequeue > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > Thunderbird/52.4.0 > > Hi Jerin > > > On 10/20/2017 1:43 PM, Jerin Jacob Wrote: > > -----Original Message----- > > > > [...] > > > dependant on each other. > > > Thus a memory barrier is neccessary. > > Yes. The barrier is necessary. > > In fact, upstream freebsd fixed this issue for arm64. DPDK ring > > implementation is derived from freebsd's buf_ring.h. > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > > > I think, the only outstanding issue is, how to reduce the performance > > impact for arm64. I believe using accurate/release semantics instead > > of rte_smp_rmb() will reduce the performance overhead like similar ring implementations below, > > freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > odp: https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c > > > > Jia, > > 1) Can you verify the use of accurate/release semantics fixes the problem in your > > platform? like use of atomic_load_acq* in the reference code. > > 2) If so, What is the overhead between accurate/release and plane smp_smb() > > barriers. Based on that we need decide what path to take. > I've tested 3 cases. The new 3rd case is to use the load_acquire barrier > (half barrier) you mentioned > at above link. > The patch seems like: > @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, > /* Reset n to the initial burst count */ > n = max; > > - *old_head = r->prod.head; > - const uint32_t cons_tail = r->cons.tail; > + *old_head = atomic_load_acq_32(&r->prod.head); > + const uint32_t cons_tail = > atomic_load_acq_32(&r->cons.tail); > > @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s > /* Restore n as it may change every loop */ > n = max; > > - *old_head = r->cons.head; > - const uint32_t prod_tail = r->prod.tail; > + *old_head = atomic_load_acq_32(&r->cons.head); > + const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail) > /* The subtraction is done between two unsigned 32bits value > * (the result is always modulo 32 bits even if we have > * cons_head > prod_tail). So 'entries' is always between 0 > * and size(ring)-1. */ > > The half barrier patch passed the fuctional test. > > As for the performance comparision on *arm64*(the debug patch is at > http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see the > test results > below: > > [case 1] old codes, no barrier > ============================================ > Performance counter stats for './test --no-huge -l 1-10': > > 689275.001200 task-clock (msec) # 9.771 CPUs utilized > 6223 context-switches # 0.009 K/sec > 10 cpu-migrations # 0.000 K/sec > 653 page-faults # 0.001 K/sec > 1721190914583 cycles # 2.497 GHz > 3363238266430 instructions # 1.95 insn per cycle > <not supported> branches > 27804740 branch-misses # 0.00% of all branches > > 70.540618825 seconds time elapsed > > [case 2] full barrier with rte_smp_rmb() > ============================================ > Performance counter stats for './test --no-huge -l 1-10': > > 582557.895850 task-clock (msec) # 9.752 CPUs utilized > 5242 context-switches # 0.009 K/sec > 10 cpu-migrations # 0.000 K/sec > 665 page-faults # 0.001 K/sec > 1454360730055 cycles # 2.497 GHz > 587197839907 instructions # 0.40 insn per cycle > <not supported> branches > 27799687 branch-misses # 0.00% of all branches > > 59.735582356 seconds time elapse > > [case 1] half barrier with load_acquire > ============================================ > Performance counter stats for './test --no-huge -l 1-10': > > 660758.877050 task-clock (msec) # 9.764 CPUs utilized > 5982 context-switches # 0.009 K/sec > 11 cpu-migrations # 0.000 K/sec > 657 page-faults # 0.001 K/sec > 1649875318044 cycles # 2.497 GHz > 591583257765 instructions # 0.36 insn per cycle > <not supported> branches > 27994903 branch-misses # 0.00% of all branches > > 67.672855107 seconds time elapsed > > Please see the context-switches in the perf results > test result sorted by time is: > full barrier < half barrier < no barrier > > AFAICT, in this case ,the cpu reordering will add the possibility for > context switching and > increase the running time. > Any ideas? Regarding performance test, it better to use ring perf test case on _isolated_ cores to measure impact on number of enqueue/dequeue operations. example: ./build/app/test -c 0xff -n 4 >>ring_perf_autotest By default, arm64+dpdk will be using el0 counter to measure the cycles. I think, in your SoC, it will be running at 50MHz or 100MHz.So, You can follow the below scheme to get accurate cycle measurement scheme: See: http://dpdk.org/doc/guides/prog_guide/profile_app.html check: 44.2.2. High-resolution cycle counter
Hi Jerin On 10/23/2017 6:06 PM, Jerin Jacob Wrote: > -----Original Message----- >> Date: Mon, 23 Oct 2017 16:49:01 +0800 >> From: Jia He <hejianet@gmail.com> >> To: Jerin Jacob <jerin.jacob@caviumnetworks.com> >> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" >> <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, >> "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" >> <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" >> <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" >> <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" >> <bruce.richardson@intel.com> >> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod >> loading when doing enqueue/dequeue >> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 >> Thunderbird/52.4.0 >> >> Hi Jerin >> >> >> On 10/20/2017 1:43 PM, Jerin Jacob Wrote: >>> -----Original Message----- >> [...] >>>> dependant on each other. >>>> Thus a memory barrier is neccessary. >>> Yes. The barrier is necessary. >>> In fact, upstream freebsd fixed this issue for arm64. DPDK ring >>> implementation is derived from freebsd's buf_ring.h. >>> https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 >>> >>> I think, the only outstanding issue is, how to reduce the performance >>> impact for arm64. I believe using accurate/release semantics instead >>> of rte_smp_rmb() will reduce the performance overhead like similar ring implementations below, >>> freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 >>> odp: https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c >>> >>> Jia, >>> 1) Can you verify the use of accurate/release semantics fixes the problem in your >>> platform? like use of atomic_load_acq* in the reference code. >>> 2) If so, What is the overhead between accurate/release and plane smp_smb() >>> barriers. Based on that we need decide what path to take. >> I've tested 3 cases. The new 3rd case is to use the load_acquire barrier >> (half barrier) you mentioned >> at above link. >> The patch seems like: >> @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, >> /* Reset n to the initial burst count */ >> n = max; >> >> - *old_head = r->prod.head; >> - const uint32_t cons_tail = r->cons.tail; >> + *old_head = atomic_load_acq_32(&r->prod.head); >> + const uint32_t cons_tail = >> atomic_load_acq_32(&r->cons.tail); >> >> @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s >> /* Restore n as it may change every loop */ >> n = max; >> >> - *old_head = r->cons.head; >> - const uint32_t prod_tail = r->prod.tail; >> + *old_head = atomic_load_acq_32(&r->cons.head); >> + const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail) >> /* The subtraction is done between two unsigned 32bits value >> * (the result is always modulo 32 bits even if we have >> * cons_head > prod_tail). So 'entries' is always between 0 >> * and size(ring)-1. */ >> >> The half barrier patch passed the fuctional test. >> >> As for the performance comparision on *arm64*(the debug patch is at >> http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see the >> test results >> below: >> >> [case 1] old codes, no barrier >> ============================================ >> Performance counter stats for './test --no-huge -l 1-10': >> >> 689275.001200 task-clock (msec) # 9.771 CPUs utilized >> 6223 context-switches # 0.009 K/sec >> 10 cpu-migrations # 0.000 K/sec >> 653 page-faults # 0.001 K/sec >> 1721190914583 cycles # 2.497 GHz >> 3363238266430 instructions # 1.95 insn per cycle >> <not supported> branches >> 27804740 branch-misses # 0.00% of all branches >> >> 70.540618825 seconds time elapsed >> >> [case 2] full barrier with rte_smp_rmb() >> ============================================ >> Performance counter stats for './test --no-huge -l 1-10': >> >> 582557.895850 task-clock (msec) # 9.752 CPUs utilized >> 5242 context-switches # 0.009 K/sec >> 10 cpu-migrations # 0.000 K/sec >> 665 page-faults # 0.001 K/sec >> 1454360730055 cycles # 2.497 GHz >> 587197839907 instructions # 0.40 insn per cycle >> <not supported> branches >> 27799687 branch-misses # 0.00% of all branches >> >> 59.735582356 seconds time elapse >> >> [case 1] half barrier with load_acquire >> ============================================ >> Performance counter stats for './test --no-huge -l 1-10': >> >> 660758.877050 task-clock (msec) # 9.764 CPUs utilized >> 5982 context-switches # 0.009 K/sec >> 11 cpu-migrations # 0.000 K/sec >> 657 page-faults # 0.001 K/sec >> 1649875318044 cycles # 2.497 GHz >> 591583257765 instructions # 0.36 insn per cycle >> <not supported> branches >> 27994903 branch-misses # 0.00% of all branches >> >> 67.672855107 seconds time elapsed >> >> Please see the context-switches in the perf results >> test result sorted by time is: >> full barrier < half barrier < no barrier >> >> AFAICT, in this case ,the cpu reordering will add the possibility for >> context switching and >> increase the running time. >> Any ideas? > Regarding performance test, it better to use ring perf test case > on _isolated_ cores to measure impact on number of enqueue/dequeue operations. > > example: > ./build/app/test -c 0xff -n 4 >>> ring_perf_autotest Seem in our arm64 server, the ring_perf_autotest will be finished in a few seconds: Anything wrong about configuration or environment setup? root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4 EAL: Detected 44 lcore(s) EAL: Probing VFIO support... APP: HPET is not enabled, using TSC as default timer RTE>>per_lcore_autotest RTE>>ring_perf_autotest ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 0 MP/MC single enq/dequeue: 2 SP/SC burst enq/dequeue (size: 8): 0 MP/MC burst enq/dequeue (size: 8): 0 SP/SC burst enq/dequeue (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0 ### Testing empty dequeue ### SC empty dequeue: 0.02 MC empty dequeue: 0.04 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 0.12 MP/MC bulk enq/dequeue (size: 8): 0.31 SP/SC bulk enq/dequeue (size: 32): 0.05 MP/MC bulk enq/dequeue (size: 32): 0.09 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 0.12 MP/MC bulk enq/dequeue (size: 8): 0.39 SP/SC bulk enq/dequeue (size: 32): 0.04 MP/MC bulk enq/dequeue (size: 32): 0.12 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 0.37 MP/MC bulk enq/dequeue (size: 8): 0.92 SP/SC bulk enq/dequeue (size: 32): 0.12 MP/MC bulk enq/dequeue (size: 32): 0.26 Test OK RTE>> Cheers, Jia > By default, arm64+dpdk will be using el0 counter to measure the cycles. I > think, in your SoC, it will be running at 50MHz or 100MHz.So, You can > follow the below scheme to get accurate cycle measurement scheme: > > See: http://dpdk.org/doc/guides/prog_guide/profile_app.html > check: 44.2.2. High-resolution cycle counter
-----Original Message----- > Date: Tue, 24 Oct 2017 10:04:26 +0800 > From: Jia He <hejianet@gmail.com> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com> > Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" > <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, > "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" > <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" > <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" > <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" > <bruce.richardson@intel.com> > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod > loading when doing enqueue/dequeue > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > Thunderbird/52.4.0 > > Hi Jerin Hi Jia, > > example: > > ./build/app/test -c 0xff -n 4 > > > > ring_perf_autotest > Seem in our arm64 server, the ring_perf_autotest will be finished in a few > seconds: Yes. It just need a few seconds. > Anything wrong about configuration or environment setup? By default, arm64+dpdk will be using el0 counter to measure the cycles. I think, in your SoC, it will be running at 50MHz or 100MHz.So, You can follow the below scheme to get accurate cycle measurement scheme: See: http://dpdk.org/doc/guides/prog_guide/profile_app.html check: 44.2.2. High-resolution cycle counter > > root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4 > EAL: Detected 44 lcore(s) > EAL: Probing VFIO support... > APP: HPET is not enabled, using TSC as default timer > RTE>>per_lcore_autotest > RTE>>ring_perf_autotest > ### Testing single element and burst enq/deq ### > SP/SC single enq/dequeue: 0 > MP/MC single enq/dequeue: 2 > SP/SC burst enq/dequeue (size: 8): 0 If you follow the above link, The value '0' will be replaced with more meaning full data. > MP/MC burst enq/dequeue (size: 8): 0 > SP/SC burst enq/dequeue (size: 32): 0 > MP/MC burst enq/dequeue (size: 32): 0 > > ### Testing empty dequeue ### > SC empty dequeue: 0.02 > MC empty dequeue: 0.04 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 0.12 > MP/MC bulk enq/dequeue (size: 8): 0.31 > SP/SC bulk enq/dequeue (size: 32): 0.05 > MP/MC bulk enq/dequeue (size: 32): 0.09 > > ### Testing using two hyperthreads ### > SP/SC bulk enq/dequeue (size: 8): 0.12 > MP/MC bulk enq/dequeue (size: 8): 0.39 > SP/SC bulk enq/dequeue (size: 32): 0.04 > MP/MC bulk enq/dequeue (size: 32): 0.12 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 0.37 > MP/MC bulk enq/dequeue (size: 8): 0.92 > SP/SC bulk enq/dequeue (size: 32): 0.12 > MP/MC bulk enq/dequeue (size: 32): 0.26 > Test OK > RTE>> > > Cheers, > Jia > > By default, arm64+dpdk will be using el0 counter to measure the cycles. I > > think, in your SoC, it will be running at 50MHz or 100MHz.So, You can > > follow the below scheme to get accurate cycle measurement scheme: > > > > See: http://dpdk.org/doc/guides/prog_guide/profile_app.html > > check: 44.2.2. High-resolution cycle counter >
Hi Jerin On 10/25/2017 9:26 PM, Jerin Jacob Wrote: > -----Original Message----- >> Date: Tue, 24 Oct 2017 10:04:26 +0800 >> From: Jia He <hejianet@gmail.com> >> To: Jerin Jacob <jerin.jacob@caviumnetworks.com> >> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" >> <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, >> "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" >> <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" >> <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" >> <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" >> <bruce.richardson@intel.com> >> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod >> loading when doing enqueue/dequeue >> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 >> Thunderbird/52.4.0 >> >> Hi Jerin > Hi Jia, > > >>> example: >>> ./build/app/test -c 0xff -n 4 >>>>> ring_perf_autotest >> Seem in our arm64 server, the ring_perf_autotest will be finished in a few >> seconds: > Yes. It just need a few seconds. > >> Anything wrong about configuration or environment setup? > By default, arm64+dpdk will be using el0 counter to measure the cycles. I > think, in your SoC, it will be running at 50MHz or 100MHz.So, You can > follow the below scheme to get accurate cycle measurement scheme: > > See: http://dpdk.org/doc/guides/prog_guide/profile_app.html > check: 44.2.2. High-resolution cycle counter Thank you for the suggestions. But I tried your provided ko module to enable the accurate cycle measurement in user space, the test result is as below: root@nfv-demo01:~/dpdk/build/build/test/test# lsmod |grep pmu pmu_el0_cycle_counter 262144 0 [old codes, without any patches] ============================================ RTE>>ring_perf_autotest ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 0 MP/MC single enq/dequeue: 0 SP/SC burst enq/dequeue (size: 8): 0 MP/MC burst enq/dequeue (size: 8): 0 SP/SC burst enq/dequeue (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0 ### Testing empty dequeue ### SC empty dequeue: 0.00 MC empty dequeue: 0.00 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 0.00 MP/MC bulk enq/dequeue (size: 8): 0.00 SP/SC bulk enq/dequeue (size: 32): 0.00 MP/MC bulk enq/dequeue (size: 32): 0.00 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 0.00 MP/MC bulk enq/dequeue (size: 8): 0.00 SP/SC bulk enq/dequeue (size: 32): 0.00 MP/MC bulk enq/dequeue (size: 32): 0.00 Test OK [with full rte_smp_rmb barrier patch] ====================================== RTE>>ring_perf_autotest ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 0 MP/MC single enq/dequeue: 0 SP/SC burst enq/dequeue (size: 8): 0 MP/MC burst enq/dequeue (size: 8): 0 SP/SC burst enq/dequeue (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0 ### Testing empty dequeue ### SC empty dequeue: 0.00 MC empty dequeue: 0.00 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 0.00 MP/MC bulk enq/dequeue (size: 8): 0.00 SP/SC bulk enq/dequeue (size: 32): 0.00 MP/MC bulk enq/dequeue (size: 32): 0.00 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 0.00 MP/MC bulk enq/dequeue (size: 8): 0.00 SP/SC bulk enq/dequeue (size: 32): 0.00 MP/MC bulk enq/dequeue (size: 32): 0.00 Test OK RTE>> No difference,all time is 0 ? If I rmmod pmu_el0_cycle_counter and revise the ./build/.config to comment the config line #CONFIG_RTE_ARM_EAL_RDTSC_USE_PMU=y Then the time is bigger than 0 >> root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4 >> EAL: Detected 44 lcore(s) >> EAL: Probing VFIO support... >> APP: HPET is not enabled, using TSC as default timer >> RTE>>per_lcore_autotest >> RTE>>ring_perf_autotest >> ### Testing single element and burst enq/deq ### >> SP/SC single enq/dequeue: 0 >> MP/MC single enq/dequeue: 2 >> SP/SC burst enq/dequeue (size: 8): 0 > If you follow the above link, The value '0' will be replaced with more meaning full data. > >> MP/MC burst enq/dequeue (size: 8): 0 >> SP/SC burst enq/dequeue (size: 32): 0 >> MP/MC burst enq/dequeue (size: 32): 0 >> >> ### Testing empty dequeue ### >> SC empty dequeue: 0.02 >> MC empty dequeue: 0.04 >> >> ### Testing using a single lcore ### >> SP/SC bulk enq/dequeue (size: 8): 0.12 >> MP/MC bulk enq/dequeue (size: 8): 0.31 >> SP/SC bulk enq/dequeue (size: 32): 0.05 >> MP/MC bulk enq/dequeue (size: 32): 0.09 >> >> ### Testing using two hyperthreads ### >> SP/SC bulk enq/dequeue (size: 8): 0.12 >> MP/MC bulk enq/dequeue (size: 8): 0.39 >> SP/SC bulk enq/dequeue (size: 32): 0.04 >> MP/MC bulk enq/dequeue (size: 32): 0.12 >> >> ### Testing using two physical cores ### >> SP/SC bulk enq/dequeue (size: 8): 0.37 >> MP/MC bulk enq/dequeue (size: 8): 0.92 >> SP/SC bulk enq/dequeue (size: 32): 0.12 >> MP/MC bulk enq/dequeue (size: 32): 0.26 >> Test OK >> RTE>> >> >> Cheers, >> Jia >>> By default, arm64+dpdk will be using el0 counter to measure the cycles. I >>> think, in your SoC, it will be running at 50MHz or 100MHz.So, You can >>> follow the below scheme to get accurate cycle measurement scheme: >>> >>> See: http://dpdk.org/doc/guides/prog_guide/profile_app.html >>> check: 44.2.2. High-resolution cycle counter
Hi Jerin Do you think next step whether I need to implement the load_acquire half barrier as per freebsd or find any other performance test case to compare the performance impact? Thanks for any suggestions. Cheers, Jia On 10/25/2017 9:26 PM, Jerin Jacob Wrote: > -----Original Message----- >> Date: Tue, 24 Oct 2017 10:04:26 +0800 >> From: Jia He <hejianet@gmail.com> >> To: Jerin Jacob <jerin.jacob@caviumnetworks.com> >> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" >> <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, >> "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" >> <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" >> <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" >> <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" >> <bruce.richardson@intel.com> >> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod >> loading when doing enqueue/dequeue >> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 >> Thunderbird/52.4.0 >> >> Hi Jerin > Hi Jia, > > >>> example: >>> ./build/app/test -c 0xff -n 4 >>>>> ring_perf_autotest >> Seem in our arm64 server, the ring_perf_autotest will be finished in a few >> seconds: > Yes. It just need a few seconds. > >> Anything wrong about configuration or environment setup? > By default, arm64+dpdk will be using el0 counter to measure the cycles. I > think, in your SoC, it will be running at 50MHz or 100MHz.So, You can > follow the below scheme to get accurate cycle measurement scheme: > > See: http://dpdk.org/doc/guides/prog_guide/profile_app.html > check: 44.2.2. High-resolution cycle counter > >> root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4 >> EAL: Detected 44 lcore(s) >> EAL: Probing VFIO support... >> APP: HPET is not enabled, using TSC as default timer >> RTE>>per_lcore_autotest >> RTE>>ring_perf_autotest >> ### Testing single element and burst enq/deq ### >> SP/SC single enq/dequeue: 0 >> MP/MC single enq/dequeue: 2 >> SP/SC burst enq/dequeue (size: 8): 0 > If you follow the above link, The value '0' will be replaced with more meaning full data. > >> MP/MC burst enq/dequeue (size: 8): 0 >> SP/SC burst enq/dequeue (size: 32): 0 >> MP/MC burst enq/dequeue (size: 32): 0 >> >> ### Testing empty dequeue ### >> SC empty dequeue: 0.02 >> MC empty dequeue: 0.04 >> >> ### Testing using a single lcore ### >> SP/SC bulk enq/dequeue (size: 8): 0.12 >> MP/MC bulk enq/dequeue (size: 8): 0.31 >> SP/SC bulk enq/dequeue (size: 32): 0.05 >> MP/MC bulk enq/dequeue (size: 32): 0.09 >> >> ### Testing using two hyperthreads ### >> SP/SC bulk enq/dequeue (size: 8): 0.12 >> MP/MC bulk enq/dequeue (size: 8): 0.39 >> SP/SC bulk enq/dequeue (size: 32): 0.04 >> MP/MC bulk enq/dequeue (size: 32): 0.12 >> >> ### Testing using two physical cores ### >> SP/SC bulk enq/dequeue (size: 8): 0.37 >> MP/MC bulk enq/dequeue (size: 8): 0.92 >> SP/SC bulk enq/dequeue (size: 32): 0.12 >> MP/MC bulk enq/dequeue (size: 32): 0.26 >> Test OK >> RTE>> >> >> Cheers, >> Jia >>> By default, arm64+dpdk will be using el0 counter to measure the cycles. I >>> think, in your SoC, it will be running at 50MHz or 100MHz.So, You can >>> follow the below scheme to get accurate cycle measurement scheme: >>> >>> See: http://dpdk.org/doc/guides/prog_guide/profile_app.html >>> check: 44.2.2. High-resolution cycle counter
-----Original Message----- > Date: Tue, 31 Oct 2017 10:55:15 +0800 > From: Jia He <hejianet@gmail.com> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com> > Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" > <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, > "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" > <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" > <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" > <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" > <bruce.richardson@intel.com> > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod > loading when doing enqueue/dequeue > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > Thunderbird/52.4.0 > > Hi Jerin Hi Jia, > > Do you think next step whether I need to implement the load_acquire half > barrier as per freebsd I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine) and Platform B: arm64 OOO machine) smp_rmb() performs better in Platform A: acquire/release semantics perform better in platform B: Here is the patch: https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch In terms of next step: - I am not sure the cost associated with acquire/release semantics on x86 or ppc. IMO, We need to have both options under conditional compilation flags and let the target platform choose the best one. Thoughts? Here is the performance numbers: - Both platforms are running at different frequency, So absolute numbers does not matter, Just check the relative numbers. Platform A: Performance numbers: ================================ no patch(Non arm64 OOO machine) ------------------------------- SP/SC single enq/dequeue: 40 MP/MC single enq/dequeue: 282 SP/SC burst enq/dequeue (size: 8): 11 MP/MC burst enq/dequeue (size: 8): 42 SP/SC burst enq/dequeue (size: 32): 8 MP/MC burst enq/dequeue (size: 32): 16 ### Testing empty dequeue ### SC empty dequeue: 8.01 MC empty dequeue: 11.01 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 11.30 MP/MC bulk enq/dequeue (size: 8): 42.85 SP/SC bulk enq/dequeue (size: 32): 8.25 MP/MC bulk enq/dequeue (size: 32): 16.46 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 20.62 MP/MC bulk enq/dequeue (size: 8): 56.30 SP/SC bulk enq/dequeue (size: 32): 10.94 MP/MC bulk enq/dequeue (size: 32): 18.66 Test OK # smp_rmb() patch((Non OOO arm64 machine) http://dpdk.org/dev/patchwork/patch/30029/ ----------------------------------------- SP/SC single enq/dequeue: 42 MP/MC single enq/dequeue: 291 SP/SC burst enq/dequeue (size: 8): 12 MP/MC burst enq/dequeue (size: 8): 44 SP/SC burst enq/dequeue (size: 32): 8 MP/MC burst enq/dequeue (size: 32): 16 ### Testing empty dequeue ### SC empty dequeue: 13.01 MC empty dequeue: 15.01 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 11.60 MP/MC bulk enq/dequeue (size: 8): 44.32 SP/SC bulk enq/dequeue (size: 32): 8.60 MP/MC bulk enq/dequeue (size: 32): 16.50 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 20.95 MP/MC bulk enq/dequeue (size: 8): 56.90 SP/SC bulk enq/dequeue (size: 32): 10.90 MP/MC bulk enq/dequeue (size: 32): 18.78 Test OK RTE>> # c11 memory model patch((Non OOO arm64 machine) https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch --------------------------------------------------------------------------------------------- ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 197 MP/MC single enq/dequeue: 328 SP/SC burst enq/dequeue (size: 8): 31 MP/MC burst enq/dequeue (size: 8): 50 SP/SC burst enq/dequeue (size: 32): 13 MP/MC burst enq/dequeue (size: 32): 18 ### Testing empty dequeue ### SC empty dequeue: 13.01 MC empty dequeue: 18.02 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 30.95 MP/MC bulk enq/dequeue (size: 8): 50.30 SP/SC bulk enq/dequeue (size: 32): 13.27 MP/MC bulk enq/dequeue (size: 32): 18.11 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 43.38 MP/MC bulk enq/dequeue (size: 8): 64.42 SP/SC bulk enq/dequeue (size: 32): 16.71 MP/MC bulk enq/dequeue (size: 32): 22.21 Platform B: Performance numbers: ============================== #no patch(OOO arm64 machine) ---------------------------- ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 81 MP/MC single enq/dequeue: 207 SP/SC burst enq/dequeue (size: 8): 15 MP/MC burst enq/dequeue (size: 8): 31 SP/SC burst enq/dequeue (size: 32): 7 MP/MC burst enq/dequeue (size: 32): 11 ### Testing empty dequeue ### SC empty dequeue: 3.00 MC empty dequeue: 5.00 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 15.38 MP/MC bulk enq/dequeue (size: 8): 30.64 SP/SC bulk enq/dequeue (size: 32): 7.25 MP/MC bulk enq/dequeue (size: 32): 11.06 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 31.51 MP/MC bulk enq/dequeue (size: 8): 49.38 SP/SC bulk enq/dequeue (size: 32): 14.32 MP/MC bulk enq/dequeue (size: 32): 15.89 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 72.66 MP/MC bulk enq/dequeue (size: 8): 121.89 SP/SC bulk enq/dequeue (size: 32): 16.88 MP/MC bulk enq/dequeue (size: 32): 24.23 Test OK RTE>> # smp_rmb() patch((OOO arm64 machine) http://dpdk.org/dev/patchwork/patch/30029/ ------------------------------------------- ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 152 MP/MC single enq/dequeue: 265 SP/SC burst enq/dequeue (size: 8): 24 MP/MC burst enq/dequeue (size: 8): 39 SP/SC burst enq/dequeue (size: 32): 9 MP/MC burst enq/dequeue (size: 32): 13 ### Testing empty dequeue ### SC empty dequeue: 31.01 MC empty dequeue: 32.01 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 24.26 MP/MC bulk enq/dequeue (size: 8): 39.52 SP/SC bulk enq/dequeue (size: 32): 9.47 MP/MC bulk enq/dequeue (size: 32): 13.31 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 40.29 MP/MC bulk enq/dequeue (size: 8): 59.57 SP/SC bulk enq/dequeue (size: 32): 17.34 MP/MC bulk enq/dequeue (size: 32): 21.58 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 79.05 MP/MC bulk enq/dequeue (size: 8): 153.46 SP/SC bulk enq/dequeue (size: 32): 26.41 MP/MC bulk enq/dequeue (size: 32): 38.37 Test OK RTE>> # c11 memory model patch((OOO arm64 machine) https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch ---------------------------------------------------------------------------------------------- ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 98 MP/MC single enq/dequeue: 130 SP/SC burst enq/dequeue (size: 8): 18 MP/MC burst enq/dequeue (size: 8): 22 SP/SC burst enq/dequeue (size: 32): 7 MP/MC burst enq/dequeue (size: 32): 9 ### Testing empty dequeue ### SC empty dequeue: 4.00 MC empty dequeue: 5.00 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 17.40 MP/MC bulk enq/dequeue (size: 8): 22.88 SP/SC bulk enq/dequeue (size: 32): 7.62 MP/MC bulk enq/dequeue (size: 32): 8.96 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 20.24 MP/MC bulk enq/dequeue (size: 8): 25.83 SP/SC bulk enq/dequeue (size: 32): 12.21 MP/MC bulk enq/dequeue (size: 32): 13.20 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 67.54 MP/MC bulk enq/dequeue (size: 8): 124.63 SP/SC bulk enq/dequeue (size: 32): 21.13 MP/MC bulk enq/dequeue (size: 32): 28.44 Test OK RTE>>quit > or find any other performance test case to compare the performance impact? As far as I know, ring_perf_autotest is the better performance test. If you have trouble in using "High-resolution cycle counter" in your platform then also you can use ring_perf_auto test to compare the performance(as relative number matters) Jerin > Thanks for any suggestions. > > Cheers, > Jia
Hi Jerin Thanks for your suggestions. I will try to use config macro to let it be chosen by user. I need to point out one possible issue in your load_acq/store_rel patch at https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch @@ -516,8 +541,13 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, /* Restore n as it may change every loop */ n = max; +#if 0 *old_head = r->cons.head; const uint32_t prod_tail = r->prod.tail; +#else + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED); --[1] + const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, __ATOMIC_ACQUIRE); --[2] +#endif line [1] __ATOMIC_RELAXED is not enough for this case(tested in our ARM64 server). line [2] __ATOMIC_ACQUIRE guarantee the 2nd load will not be reorded before the 1st load, but will not guarantee the 1st load will not be reordered after the 2nd load. Please also refer to your mentioned freebsd implementation. They use __ATOMIC_ACQUIRE at line [1]. Should it be like instead? +#else + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE); + const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, __ATOMIC_ACQUIRE); Cheers, Jia On 10/31/2017 7:14 PM, Jerin Jacob Wrote: > -----Original Message----- >> Date: Tue, 31 Oct 2017 10:55:15 +0800 >> From: Jia He <hejianet@gmail.com> >> To: Jerin Jacob <jerin.jacob@caviumnetworks.com> >> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" >> <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, >> "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" >> <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" >> <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" >> <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" >> <bruce.richardson@intel.com> >> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod >> loading when doing enqueue/dequeue >> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 >> Thunderbird/52.4.0 >> >> Hi Jerin > Hi Jia, > >> Do you think next step whether I need to implement the load_acquire half >> barrier as per freebsd > I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics > and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine) > and Platform B: arm64 OOO machine) > > smp_rmb() performs better in Platform A: > acquire/release semantics perform better in platform B: > > Here is the patch: > https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch > > In terms of next step: > - I am not sure the cost associated with acquire/release semantics on x86 or ppc. > IMO, We need to have both options under conditional compilation > flags and let the target platform choose the best one. > > Thoughts? > > Here is the performance numbers: > - Both platforms are running at different frequency, So absolute numbers does not > matter, Just check the relative numbers. > > Platform A: Performance numbers: > ================================ > no patch(Non arm64 OOO machine) > ------------------------------- > > SP/SC single enq/dequeue: 40 > MP/MC single enq/dequeue: 282 > SP/SC burst enq/dequeue (size: 8): 11 > MP/MC burst enq/dequeue (size: 8): 42 > SP/SC burst enq/dequeue (size: 32): 8 > MP/MC burst enq/dequeue (size: 32): 16 > > ### Testing empty dequeue ### > SC empty dequeue: 8.01 > MC empty dequeue: 11.01 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 11.30 > MP/MC bulk enq/dequeue (size: 8): 42.85 > SP/SC bulk enq/dequeue (size: 32): 8.25 > MP/MC bulk enq/dequeue (size: 32): 16.46 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 20.62 > MP/MC bulk enq/dequeue (size: 8): 56.30 > SP/SC bulk enq/dequeue (size: 32): 10.94 > MP/MC bulk enq/dequeue (size: 32): 18.66 > Test OK > > # smp_rmb() patch((Non OOO arm64 machine) > http://dpdk.org/dev/patchwork/patch/30029/ > ----------------------------------------- > > SP/SC single enq/dequeue: 42 > MP/MC single enq/dequeue: 291 > SP/SC burst enq/dequeue (size: 8): 12 > MP/MC burst enq/dequeue (size: 8): 44 > SP/SC burst enq/dequeue (size: 32): 8 > MP/MC burst enq/dequeue (size: 32): 16 > > ### Testing empty dequeue ### > SC empty dequeue: 13.01 > MC empty dequeue: 15.01 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 11.60 > MP/MC bulk enq/dequeue (size: 8): 44.32 > SP/SC bulk enq/dequeue (size: 32): 8.60 > MP/MC bulk enq/dequeue (size: 32): 16.50 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 20.95 > MP/MC bulk enq/dequeue (size: 8): 56.90 > SP/SC bulk enq/dequeue (size: 32): 10.90 > MP/MC bulk enq/dequeue (size: 32): 18.78 > Test OK > RTE>> > > # c11 memory model patch((Non OOO arm64 machine) > https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch > --------------------------------------------------------------------------------------------- > ### Testing single element and burst enq/deq ### > SP/SC single enq/dequeue: 197 > MP/MC single enq/dequeue: 328 > SP/SC burst enq/dequeue (size: 8): 31 > MP/MC burst enq/dequeue (size: 8): 50 > SP/SC burst enq/dequeue (size: 32): 13 > MP/MC burst enq/dequeue (size: 32): 18 > > ### Testing empty dequeue ### > SC empty dequeue: 13.01 > MC empty dequeue: 18.02 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 30.95 > MP/MC bulk enq/dequeue (size: 8): 50.30 > SP/SC bulk enq/dequeue (size: 32): 13.27 > MP/MC bulk enq/dequeue (size: 32): 18.11 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 43.38 > MP/MC bulk enq/dequeue (size: 8): 64.42 > SP/SC bulk enq/dequeue (size: 32): 16.71 > MP/MC bulk enq/dequeue (size: 32): 22.21 > > > Platform B: Performance numbers: > ============================== > #no patch(OOO arm64 machine) > ---------------------------- > > ### Testing single element and burst enq/deq ### > SP/SC single enq/dequeue: 81 > MP/MC single enq/dequeue: 207 > SP/SC burst enq/dequeue (size: 8): 15 > MP/MC burst enq/dequeue (size: 8): 31 > SP/SC burst enq/dequeue (size: 32): 7 > MP/MC burst enq/dequeue (size: 32): 11 > > ### Testing empty dequeue ### > SC empty dequeue: 3.00 > MC empty dequeue: 5.00 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 15.38 > MP/MC bulk enq/dequeue (size: 8): 30.64 > SP/SC bulk enq/dequeue (size: 32): 7.25 > MP/MC bulk enq/dequeue (size: 32): 11.06 > > ### Testing using two hyperthreads ### > SP/SC bulk enq/dequeue (size: 8): 31.51 > MP/MC bulk enq/dequeue (size: 8): 49.38 > SP/SC bulk enq/dequeue (size: 32): 14.32 > MP/MC bulk enq/dequeue (size: 32): 15.89 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 72.66 > MP/MC bulk enq/dequeue (size: 8): 121.89 > SP/SC bulk enq/dequeue (size: 32): 16.88 > MP/MC bulk enq/dequeue (size: 32): 24.23 > Test OK > RTE>> > > > # smp_rmb() patch((OOO arm64 machine) > http://dpdk.org/dev/patchwork/patch/30029/ > ------------------------------------------- > > ### Testing single element and burst enq/deq ### > SP/SC single enq/dequeue: 152 > MP/MC single enq/dequeue: 265 > SP/SC burst enq/dequeue (size: 8): 24 > MP/MC burst enq/dequeue (size: 8): 39 > SP/SC burst enq/dequeue (size: 32): 9 > MP/MC burst enq/dequeue (size: 32): 13 > > ### Testing empty dequeue ### > SC empty dequeue: 31.01 > MC empty dequeue: 32.01 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 24.26 > MP/MC bulk enq/dequeue (size: 8): 39.52 > SP/SC bulk enq/dequeue (size: 32): 9.47 > MP/MC bulk enq/dequeue (size: 32): 13.31 > > ### Testing using two hyperthreads ### > SP/SC bulk enq/dequeue (size: 8): 40.29 > MP/MC bulk enq/dequeue (size: 8): 59.57 > SP/SC bulk enq/dequeue (size: 32): 17.34 > MP/MC bulk enq/dequeue (size: 32): 21.58 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 79.05 > MP/MC bulk enq/dequeue (size: 8): 153.46 > SP/SC bulk enq/dequeue (size: 32): 26.41 > MP/MC bulk enq/dequeue (size: 32): 38.37 > Test OK > RTE>> > > > # c11 memory model patch((OOO arm64 machine) > https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch > ---------------------------------------------------------------------------------------------- > ### Testing single element and burst enq/deq ### > SP/SC single enq/dequeue: 98 > MP/MC single enq/dequeue: 130 > SP/SC burst enq/dequeue (size: 8): 18 > MP/MC burst enq/dequeue (size: 8): 22 > SP/SC burst enq/dequeue (size: 32): 7 > MP/MC burst enq/dequeue (size: 32): 9 > > ### Testing empty dequeue ### > SC empty dequeue: 4.00 > MC empty dequeue: 5.00 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 17.40 > MP/MC bulk enq/dequeue (size: 8): 22.88 > SP/SC bulk enq/dequeue (size: 32): 7.62 > MP/MC bulk enq/dequeue (size: 32): 8.96 > > ### Testing using two hyperthreads ### > SP/SC bulk enq/dequeue (size: 8): 20.24 > MP/MC bulk enq/dequeue (size: 8): 25.83 > SP/SC bulk enq/dequeue (size: 32): 12.21 > MP/MC bulk enq/dequeue (size: 32): 13.20 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 67.54 > MP/MC bulk enq/dequeue (size: 8): 124.63 > SP/SC bulk enq/dequeue (size: 32): 21.13 > MP/MC bulk enq/dequeue (size: 32): 28.44 > Test OK > RTE>>quit > > >> or find any other performance test case to compare the performance impact? > As far as I know, ring_perf_autotest is the better performance test. > If you have trouble in using "High-resolution cycle counter" in your platform then also > you can use ring_perf_auto test to compare the performance(as relative > number matters) > > Jerin > >> Thanks for any suggestions. >> >> Cheers, >> Jia
Hi Jerin On 10/31/2017 7:14 PM, Jerin Jacob Wrote: > -----Original Message----- >> Date: Tue, 31 Oct 2017 10:55:15 +0800 >> From: Jia He <hejianet@gmail.com> >> To: Jerin Jacob <jerin.jacob@caviumnetworks.com> >> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" >> <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, >> "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" >> <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" >> <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" >> <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" >> <bruce.richardson@intel.com> >> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod >> loading when doing enqueue/dequeue >> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 >> Thunderbird/52.4.0 >> >> Hi Jerin > Hi Jia, > >> Do you think next step whether I need to implement the load_acquire half >> barrier as per freebsd > I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics > and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine) > and Platform B: arm64 OOO machine) Can you elaborate anything about your Non arm64 OOO machine? As I know, all arm64 server is strong memory order. Am I missed anything?
Date: Thu, 2 Nov 2017 00:27:46 +0530 From: Jerin Jacob <jerin.jacob@caviumnetworks.com> To: Jia He <hejianet@gmail.com> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" <bruce.richardson@intel.com>, jianbo.liu@arm.com, hemant.agrawal@nxp.com Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue Message-ID: <20171101185723.GA18759@jerin> References: <2601191342CEEE43887BDE71AB9772585FAAB570@IRSMSX103.ger.corp.intel.com> <3e580cd7-2854-d855-be9c-7c4ce06e3ed5@gmail.com> <20171020054319.GA4249@jerin> <ab7154a2-a9f8-f12e-b6a0-2805c2065e2e@gmail.com> <20171023100617.GA17957@jerin> <b0e6eae2-e61b-1946-ac44-d781225778e5@gmail.com> <20171025132642.GA13977@jerin> <b54ce545-b75d-9813-209f-4125bd76c7db@gmail.com> <20171031111433.GA21742@jerin> <69adfb00-4582-b362-0540-d1d9d6bcf6aa@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <69adfb00-4582-b362-0540-d1d9d6bcf6aa@gmail.com> User-Agent: Mutt/1.9.1 (2017-09-22) -----Original Message----- > Date: Wed, 1 Nov 2017 10:53:12 +0800 > From: Jia He <hejianet@gmail.com> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com> > Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" > <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, > "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" > <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" > <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" > <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" > <bruce.richardson@intel.com>, jianbo.liu@arm.com, hemant.agrawal@nxp.com > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod > loading when doing enqueue/dequeue > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > Thunderbird/52.4.0 > > Hi Jerin > > Thanks for your suggestions. I will try to use config macro to let it be > chosen by user. It is better, if we avoid #ifdef's in the common code. I think, you can do the scheme like examples/l3fwd/l3fwd_em_hlm_neon.h. Where, the common code will have all generic routine, difference will be moved to a separate files to reduce #ifdef clutter. > > I need to point out one possible issue in your load_acq/store_rel patch > > at https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch > > @@ -516,8 +541,13 @@ __rte_ring_move_cons_head(struct rte_ring *r, int > is_sc, > /* Restore n as it may change every loop */ > n = max; > > +#if 0 > *old_head = r->cons.head; > const uint32_t prod_tail = r->prod.tail; > +#else > + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED); > --[1] > + const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, > __ATOMIC_ACQUIRE); --[2] > +#endif > > line [1] __ATOMIC_RELAXED is not enough for this case(tested in our ARM64 > server). > > line [2] __ATOMIC_ACQUIRE guarantee the 2nd load will not be reorded before > the 1st load, but will not > > guarantee the 1st load will not be reordered after the 2nd load. Please also For me it looks same. [1] can not cross [2] is same as [2] cannot cross [1], if [1] are [2] at back to back. No ? > refer to your mentioned freebsd implementation. They use __ATOMIC_ACQUIRE at > line [1]. If I understand it correctly. in freebsd, they are planning to revisit the usage of first __ATOMIC_ACQUIRE in Freebsd-12 https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 > > Should it be like instead? > > +#else > + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE); > + const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, > __ATOMIC_ACQUIRE); It would be nice to see how much overhead it gives.ie back to back __ATOMIC_ACQUIRE. > > > Cheers, > > Jia > > > > On 10/31/2017 7:14 PM, Jerin Jacob Wrote: > > -----Original Message----- > > > Date: Tue, 31 Oct 2017 10:55:15 +0800 > > > From: Jia He <hejianet@gmail.com> > > > To: Jerin Jacob <jerin.jacob@caviumnetworks.com> > > > Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" > > > <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, > > > "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" > > > <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" > > > <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" > > > <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" > > > <bruce.richardson@intel.com> > > > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod > > > loading when doing enqueue/dequeue > > > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > > > Thunderbird/52.4.0 > > > > > > Hi Jerin > > Hi Jia, > > > > > Do you think next step whether I need to implement the load_acquire half > > > barrier as per freebsd > > I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics > > and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine) > > and Platform B: arm64 OOO machine) > > > > smp_rmb() performs better in Platform A: > > acquire/release semantics perform better in platform B: > > > > Here is the patch: > > https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch > > > > In terms of next step: > > - I am not sure the cost associated with acquire/release semantics on x86 or ppc. > > IMO, We need to have both options under conditional compilation > > flags and let the target platform choose the best one. > > > > Thoughts? > > > > Here is the performance numbers: > > - Both platforms are running at different frequency, So absolute numbers does not > > matter, Just check the relative numbers. > > > > Platform A: Performance numbers: > > ================================ > > no patch(Non arm64 OOO machine) > > ------------------------------- > > > > SP/SC single enq/dequeue: 40 > > MP/MC single enq/dequeue: 282 > > SP/SC burst enq/dequeue (size: 8): 11 > > MP/MC burst enq/dequeue (size: 8): 42 > > SP/SC burst enq/dequeue (size: 32): 8 > > MP/MC burst enq/dequeue (size: 32): 16 > > > > ### Testing empty dequeue ### > > SC empty dequeue: 8.01 > > MC empty dequeue: 11.01 > > > > ### Testing using a single lcore ### > > SP/SC bulk enq/dequeue (size: 8): 11.30 > > MP/MC bulk enq/dequeue (size: 8): 42.85 > > SP/SC bulk enq/dequeue (size: 32): 8.25 > > MP/MC bulk enq/dequeue (size: 32): 16.46 > > > > ### Testing using two physical cores ### > > SP/SC bulk enq/dequeue (size: 8): 20.62 > > MP/MC bulk enq/dequeue (size: 8): 56.30 > > SP/SC bulk enq/dequeue (size: 32): 10.94 > > MP/MC bulk enq/dequeue (size: 32): 18.66 > > Test OK > > > > # smp_rmb() patch((Non OOO arm64 machine) > > http://dpdk.org/dev/patchwork/patch/30029/ > > ----------------------------------------- > > > > SP/SC single enq/dequeue: 42 > > MP/MC single enq/dequeue: 291 > > SP/SC burst enq/dequeue (size: 8): 12 > > MP/MC burst enq/dequeue (size: 8): 44 > > SP/SC burst enq/dequeue (size: 32): 8 > > MP/MC burst enq/dequeue (size: 32): 16 > > > > ### Testing empty dequeue ### > > SC empty dequeue: 13.01 > > MC empty dequeue: 15.01 > > > > ### Testing using a single lcore ### > > SP/SC bulk enq/dequeue (size: 8): 11.60 > > MP/MC bulk enq/dequeue (size: 8): 44.32 > > SP/SC bulk enq/dequeue (size: 32): 8.60 > > MP/MC bulk enq/dequeue (size: 32): 16.50 > > > > ### Testing using two physical cores ### > > SP/SC bulk enq/dequeue (size: 8): 20.95 > > MP/MC bulk enq/dequeue (size: 8): 56.90 > > SP/SC bulk enq/dequeue (size: 32): 10.90 > > MP/MC bulk enq/dequeue (size: 32): 18.78 > > Test OK > > RTE>> > > > > # c11 memory model patch((Non OOO arm64 machine) > > https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch > > --------------------------------------------------------------------------------------------- > > ### Testing single element and burst enq/deq ### > > SP/SC single enq/dequeue: 197 > > MP/MC single enq/dequeue: 328 > > SP/SC burst enq/dequeue (size: 8): 31 > > MP/MC burst enq/dequeue (size: 8): 50 > > SP/SC burst enq/dequeue (size: 32): 13 > > MP/MC burst enq/dequeue (size: 32): 18 > > > > ### Testing empty dequeue ### > > SC empty dequeue: 13.01 > > MC empty dequeue: 18.02 > > > > ### Testing using a single lcore ### > > SP/SC bulk enq/dequeue (size: 8): 30.95 > > MP/MC bulk enq/dequeue (size: 8): 50.30 > > SP/SC bulk enq/dequeue (size: 32): 13.27 > > MP/MC bulk enq/dequeue (size: 32): 18.11 > > > > ### Testing using two physical cores ### > > SP/SC bulk enq/dequeue (size: 8): 43.38 > > MP/MC bulk enq/dequeue (size: 8): 64.42 > > SP/SC bulk enq/dequeue (size: 32): 16.71 > > MP/MC bulk enq/dequeue (size: 32): 22.21 > > > > > > Platform B: Performance numbers: > > ============================== > > #no patch(OOO arm64 machine) > > ---------------------------- > > > > ### Testing single element and burst enq/deq ### > > SP/SC single enq/dequeue: 81 > > MP/MC single enq/dequeue: 207 > > SP/SC burst enq/dequeue (size: 8): 15 > > MP/MC burst enq/dequeue (size: 8): 31 > > SP/SC burst enq/dequeue (size: 32): 7 > > MP/MC burst enq/dequeue (size: 32): 11 > > > > ### Testing empty dequeue ### > > SC empty dequeue: 3.00 > > MC empty dequeue: 5.00 > > > > ### Testing using a single lcore ### > > SP/SC bulk enq/dequeue (size: 8): 15.38 > > MP/MC bulk enq/dequeue (size: 8): 30.64 > > SP/SC bulk enq/dequeue (size: 32): 7.25 > > MP/MC bulk enq/dequeue (size: 32): 11.06 > > > > ### Testing using two hyperthreads ### > > SP/SC bulk enq/dequeue (size: 8): 31.51 > > MP/MC bulk enq/dequeue (size: 8): 49.38 > > SP/SC bulk enq/dequeue (size: 32): 14.32 > > MP/MC bulk enq/dequeue (size: 32): 15.89 > > > > ### Testing using two physical cores ### > > SP/SC bulk enq/dequeue (size: 8): 72.66 > > MP/MC bulk enq/dequeue (size: 8): 121.89 > > SP/SC bulk enq/dequeue (size: 32): 16.88 > > MP/MC bulk enq/dequeue (size: 32): 24.23 > > Test OK > > RTE>> > > > > > > # smp_rmb() patch((OOO arm64 machine) > > http://dpdk.org/dev/patchwork/patch/30029/ > > ------------------------------------------- > > > > ### Testing single element and burst enq/deq ### > > SP/SC single enq/dequeue: 152 > > MP/MC single enq/dequeue: 265 > > SP/SC burst enq/dequeue (size: 8): 24 > > MP/MC burst enq/dequeue (size: 8): 39 > > SP/SC burst enq/dequeue (size: 32): 9 > > MP/MC burst enq/dequeue (size: 32): 13 > > > > ### Testing empty dequeue ### > > SC empty dequeue: 31.01 > > MC empty dequeue: 32.01 > > > > ### Testing using a single lcore ### > > SP/SC bulk enq/dequeue (size: 8): 24.26 > > MP/MC bulk enq/dequeue (size: 8): 39.52 > > SP/SC bulk enq/dequeue (size: 32): 9.47 > > MP/MC bulk enq/dequeue (size: 32): 13.31 > > > > ### Testing using two hyperthreads ### > > SP/SC bulk enq/dequeue (size: 8): 40.29 > > MP/MC bulk enq/dequeue (size: 8): 59.57 > > SP/SC bulk enq/dequeue (size: 32): 17.34 > > MP/MC bulk enq/dequeue (size: 32): 21.58 > > > > ### Testing using two physical cores ### > > SP/SC bulk enq/dequeue (size: 8): 79.05 > > MP/MC bulk enq/dequeue (size: 8): 153.46 > > SP/SC bulk enq/dequeue (size: 32): 26.41 > > MP/MC bulk enq/dequeue (size: 32): 38.37 > > Test OK > > RTE>> > > > > > > # c11 memory model patch((OOO arm64 machine) > > https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch > > ---------------------------------------------------------------------------------------------- > > ### Testing single element and burst enq/deq ### > > SP/SC single enq/dequeue: 98 > > MP/MC single enq/dequeue: 130 > > SP/SC burst enq/dequeue (size: 8): 18 > > MP/MC burst enq/dequeue (size: 8): 22 > > SP/SC burst enq/dequeue (size: 32): 7 > > MP/MC burst enq/dequeue (size: 32): 9 > > > > ### Testing empty dequeue ### > > SC empty dequeue: 4.00 > > MC empty dequeue: 5.00 > > > > ### Testing using a single lcore ### > > SP/SC bulk enq/dequeue (size: 8): 17.40 > > MP/MC bulk enq/dequeue (size: 8): 22.88 > > SP/SC bulk enq/dequeue (size: 32): 7.62 > > MP/MC bulk enq/dequeue (size: 32): 8.96 > > > > ### Testing using two hyperthreads ### > > SP/SC bulk enq/dequeue (size: 8): 20.24 > > MP/MC bulk enq/dequeue (size: 8): 25.83 > > SP/SC bulk enq/dequeue (size: 32): 12.21 > > MP/MC bulk enq/dequeue (size: 32): 13.20 > > > > ### Testing using two physical cores ### > > SP/SC bulk enq/dequeue (size: 8): 67.54 > > MP/MC bulk enq/dequeue (size: 8): 124.63 > > SP/SC bulk enq/dequeue (size: 32): 21.13 > > MP/MC bulk enq/dequeue (size: 32): 28.44 > > Test OK > > RTE>>quit > > > > > > > or find any other performance test case to compare the performance impact? > > As far as I know, ring_perf_autotest is the better performance test. > > If you have trouble in using "High-resolution cycle counter" in your platform then also > > you can use ring_perf_auto test to compare the performance(as relative > > number matters) > > > > Jerin > > > > > Thanks for any suggestions. > > > > > > Cheers, > > > Jia >
-----Original Message----- > Date: Wed, 1 Nov 2017 12:48:31 +0800 > From: Jia He <hejianet@gmail.com> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com> > Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" > <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, > "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" > <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" > <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" > <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" > <bruce.richardson@intel.com>, jianbo.liu@arm.com, hemant.agrawal@nxp.com > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod > loading when doing enqueue/dequeue > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > Thunderbird/52.4.0 > > Hi Jerin > > > On 10/31/2017 7:14 PM, Jerin Jacob Wrote: > > -----Original Message----- > > > Date: Tue, 31 Oct 2017 10:55:15 +0800 > > > From: Jia He <hejianet@gmail.com> > > > To: Jerin Jacob <jerin.jacob@caviumnetworks.com> > > > Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" > > > <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, > > > "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" > > > <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" > > > <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" > > > <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" > > > <bruce.richardson@intel.com> > > > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod > > > loading when doing enqueue/dequeue > > > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > > > Thunderbird/52.4.0 > > > > > > Hi Jerin > > Hi Jia, > > > > > Do you think next step whether I need to implement the load_acquire half > > > barrier as per freebsd > > I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics > > and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine) > > and Platform B: arm64 OOO machine) > Can you elaborate anything about your Non arm64 OOO machine? As I know, all > arm64 server is strong > memory order. Am I missed anything? It is implementation defined. The low end arm64 boxes(non server) may not have the complete OOO support.
Hi Jerin On 11/2/2017 3:04 AM, Jerin Jacob Wrote: > Date: Thu, 2 Nov 2017 00:27:46 +0530 > From: Jerin Jacob <jerin.jacob@caviumnetworks.com> > To: Jia He <hejianet@gmail.com> > Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, > "Zhao, Bing" <ilovethull@163.com>, > Olivier MATZ <olivier.matz@6wind.com>, > "dev@dpdk.org" <dev@dpdk.org>, > "jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>, > "jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>, > "bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com>, > "Richardson, Bruce" <bruce.richardson@intel.com>, > jianbo.liu@arm.com, hemant.agrawal@nxp.com > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading > when doing enqueue/dequeue > Message-ID: <20171101185723.GA18759@jerin> > References: <2601191342CEEE43887BDE71AB9772585FAAB570@IRSMSX103.ger.corp.intel.com> > <3e580cd7-2854-d855-be9c-7c4ce06e3ed5@gmail.com> > <20171020054319.GA4249@jerin> > <ab7154a2-a9f8-f12e-b6a0-2805c2065e2e@gmail.com> > <20171023100617.GA17957@jerin> > <b0e6eae2-e61b-1946-ac44-d781225778e5@gmail.com> > <20171025132642.GA13977@jerin> > <b54ce545-b75d-9813-209f-4125bd76c7db@gmail.com> > <20171031111433.GA21742@jerin> > <69adfb00-4582-b362-0540-d1d9d6bcf6aa@gmail.com> > MIME-Version: 1.0 > Content-Type: text/plain; charset=iso-8859-1 > Content-Disposition: inline > Content-Transfer-Encoding: 8bit > In-Reply-To: <69adfb00-4582-b362-0540-d1d9d6bcf6aa@gmail.com> > User-Agent: Mutt/1.9.1 (2017-09-22) > > -----Original Message----- >> Date: Wed, 1 Nov 2017 10:53:12 +0800 >> From: Jia He <hejianet@gmail.com> >> To: Jerin Jacob <jerin.jacob@caviumnetworks.com> >> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" >> <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, >> "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" >> <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" >> <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" >> <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" >> <bruce.richardson@intel.com>, jianbo.liu@arm.com, hemant.agrawal@nxp.com >> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod >> loading when doing enqueue/dequeue >> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 >> Thunderbird/52.4.0 >> >> Hi Jerin >> >> Thanks for your suggestions. I will try to use config macro to let it be >> chosen by user. > It is better, if we avoid #ifdef's in the common code. I think, you can > do the scheme like examples/l3fwd/l3fwd_em_hlm_neon.h. Where, > the common code will have all generic routine, difference will be > moved to a separate files to reduce #ifdef clutter. > >> I need to point out one possible issue in your load_acq/store_rel patch >> >> at https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch >> >> @@ -516,8 +541,13 @@ __rte_ring_move_cons_head(struct rte_ring *r, int >> is_sc, >> /* Restore n as it may change every loop */ >> n = max; >> >> +#if 0 >> *old_head = r->cons.head; >> const uint32_t prod_tail = r->prod.tail; >> +#else >> + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED); >> --[1] >> + const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, >> __ATOMIC_ACQUIRE); --[2] >> +#endif >> >> line [1] __ATOMIC_RELAXED is not enough for this case(tested in our ARM64 >> server). >> >> line [2] __ATOMIC_ACQUIRE guarantee the 2nd load will not be reorded before >> the 1st load, but will not >> >> guarantee the 1st load will not be reordered after the 2nd load. Please also > For me it looks same. [1] can not cross [2] is same as [2] cannot cross > [1], if [1] are [2] at back to back. No ? No, IIUC, load_acquire(a) is equal to the pseudo codes: load(a) one-direction-barrier() instead of one-direction-barrier() load(a) Thus, in below cases, load(a) and load(b) can still be reordered, this is not the semantic violation of the load_acquire: load(a) load_acquire(b) IIUC, the orginal implementation in https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 is not optimal. I tested the changes as follow and it works fine: + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE); const uint32_t prod_tail = r->prod.tail; i.e. load_acquire(a) load(b) Cheers, Jia
Hi, Jerin please see my performance test below On 11/2/2017 3:04 AM, Jerin Jacob Wrote: [...] > Should it be like instead? > > +#else > + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE); > + const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, > __ATOMIC_ACQUIRE); > It would be nice to see how much overhead it gives.ie back to back > __ATOMIC_ACQUIRE. I can NOT test ring_perf_autotest in our server because of the something wrong in PMU counter. All the return value of rte_rdtsc is 0 with and without your provided ko module. I am still investigating the reason. I ever tested the difference with my debug patch, the difference is minor, less than +-1%
Hi Jerin On 11/2/2017 4:57 PM, Jia He Wrote: > > Hi, Jerin > please see my performance test below > On 11/2/2017 3:04 AM, Jerin Jacob Wrote: > [...] >> Should it be like instead? >> >> +#else >> + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE); >> + const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, >> __ATOMIC_ACQUIRE); >> It would be nice to see how much overhead it gives.ie back to back >> __ATOMIC_ACQUIRE. > I can NOT test ring_perf_autotest in our server because of the > something wrong in PMU counter. > All the return value of rte_rdtsc is 0 with and without your provided > ko module. I am still > investigating the reason. > Hi Jerin As for the root cause of rte_rdtsc issue, it might be due to the pmu counter frequency is too low in our arm64 server("Amberwing" from qualcom) [586990.057779] arch_timer_get_cntfrq()=20000000 Only 20MHz instead of 100M/200MHz, and CNTFRQ_EL0 is not even writable in kernel space. Maybe the code in ring_perf_autotest needs to be changed? e.g. printf("SC empty dequeue: %.2F\n", (double)(sc_end-sc_start) / iterations); printf("MC empty dequeue: %.2F\n", (double)(mc_end-mc_start) / iterations); Otherwise it is always 0 if the time difference divides by iterations.
-----Original Message----- > Date: Fri, 3 Nov 2017 10:55:40 +0800 > From: Jia He <hejianet@gmail.com> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com> > Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing" > <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>, > "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com" > <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com" > <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com" > <bing.zhao@hxt-semitech.com>, "Richardson, Bruce" > <bruce.richardson@intel.com>, jianbo.liu@arm.com, hemant.agrawal@nxp.com > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod > loading when doing enqueue/dequeue > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > Thunderbird/52.4.0 > > Hi Jerin > > > On 11/2/2017 4:57 PM, Jia He Wrote: > > > > Hi, Jerin > > please see my performance test below > > On 11/2/2017 3:04 AM, Jerin Jacob Wrote: > > [...] > > > Should it be like instead? > > > > > > +#else > > > + *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE); > > > + const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, > > > __ATOMIC_ACQUIRE); > > > It would be nice to see how much overhead it gives.ie back to back > > > __ATOMIC_ACQUIRE. > > I can NOT test ring_perf_autotest in our server because of the something > > wrong in PMU counter. > > All the return value of rte_rdtsc is 0 with and without your provided ko > > module. I am still > > investigating the reason. > > > > Hi Jerin > > As for the root cause of rte_rdtsc issue, it might be due to the pmu counter > frequency is too low > > in our arm64 server("Amberwing" from qualcom) > > [586990.057779] arch_timer_get_cntfrq()=20000000 > > Only 20MHz instead of 100M/200MHz, and CNTFRQ_EL0 is not even writable in > kernel space. May not be true, as I guess, linux 'perf' write those register in kernel space. Another option could be write from ATF/Secure boot loader if that is the case. > > Maybe the code in ring_perf_autotest needs to be changed? Increase the "iterations" to measure @ 200MHz. > > e.g. > > printf("SC empty dequeue: %.2F\n", > (double)(sc_end-sc_start) / iterations); > printf("MC empty dequeue: %.2F\n", > (double)(mc_end-mc_start) / iterations); > > Otherwise it is always 0 if the time difference divides by iterations. > > > -- > Cheers, > Jia >
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 5e9b3b7..15c72e2 100644 --- a/lib/librte_ring/rte_ring.h +++ b/lib/librte_ring/rte_ring.h @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, n = max; *old_head = r->prod.head; + + /* load of prod.tail can't be reordered before cons.head */ + rte_smp_rmb(); + const uint32_t cons_tail = r->cons.tail; /* * The subtraction is done between two unsigned 32bits value @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, n = max; *old_head = r->cons.head; + + /* load of prod.tail can't be reordered before cons.head */ + rte_smp_rmb(); + const uint32_t prod_tail = r->prod.tail; /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have