[dpdk-dev] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

Message ID 20171010095636.4507-1-hejianet@gmail.com (mailing list archive)
State Superseded, archived
Headers

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation fail Compilation issues

Commit Message

Jia He Oct. 10, 2017, 9:56 a.m. UTC
  Before this patch:
In __rte_ring_move_cons_head()
...
        do {
                /* Restore n as it may change every loop */
                n = max;

                *old_head = r->cons.head;                //1st load
                const uint32_t prod_tail = r->prod.tail; //2nd load

In weak memory order architectures(powerpc,arm), the 2nd load might be
reodered before the 1st load, that makes *entries is bigger than we wanted.
This nasty reording messed enque/deque up.

cpu1(producer)          cpu2(consumer)          cpu3(consumer)
                        load r->prod.tail
in enqueue:
load r->cons.tail
load r->prod.head

store r->prod.tail

                                                load r->cons.head
                                                load r->prod.tail
                                                ...
                                                store r->cons.{head,tail}
                        load r->cons.head

THEN,r->cons.head will be bigger than prod_tail, then make *entries very big

After this patch, the old cons.head will be recaculated after failure of
rte_atomic32_cmpset

There is no such issue in X86 cpu, because X86 is strong memory order model

Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: jia.he@hxt-semitech.com
Signed-off-by: jie2.liu@hxt-semitech.com
Signed-off-by: bing.zhao@hxt-semitech.com

---
 lib/librte_ring/rte_ring.h | 8 ++++++++
 1 file changed, 8 insertions(+)
  

Comments

Olivier Matz Oct. 12, 2017, 3:53 p.m. UTC | #1
Hi,

On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote:
> Before this patch:
> In __rte_ring_move_cons_head()
> ...
>         do {
>                 /* Restore n as it may change every loop */
>                 n = max;
> 
>                 *old_head = r->cons.head;                //1st load
>                 const uint32_t prod_tail = r->prod.tail; //2nd load
> 
> In weak memory order architectures(powerpc,arm), the 2nd load might be
> reodered before the 1st load, that makes *entries is bigger than we wanted.
> This nasty reording messed enque/deque up.
> 
> cpu1(producer)          cpu2(consumer)          cpu3(consumer)
>                         load r->prod.tail
> in enqueue:
> load r->cons.tail
> load r->prod.head
> 
> store r->prod.tail
> 
>                                                 load r->cons.head
>                                                 load r->prod.tail
>                                                 ...
>                                                 store r->cons.{head,tail}
>                         load r->cons.head
> 
> THEN,r->cons.head will be bigger than prod_tail, then make *entries very big
> 
> After this patch, the old cons.head will be recaculated after failure of
> rte_atomic32_cmpset
> 
> There is no such issue in X86 cpu, because X86 is strong memory order model
> 
> Signed-off-by: Jia He <hejianet@gmail.com>
> Signed-off-by: jia.he@hxt-semitech.com
> Signed-off-by: jie2.liu@hxt-semitech.com
> Signed-off-by: bing.zhao@hxt-semitech.com
> 
> ---
>  lib/librte_ring/rte_ring.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> index 5e9b3b7..15c72e2 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
>  		n = max;
>  
>  		*old_head = r->prod.head;
> +
> +		/* load of prod.tail can't be reordered before cons.head */
> +		rte_smp_rmb();
> +
>  		const uint32_t cons_tail = r->cons.tail;
>  		/*
>  		 *  The subtraction is done between two unsigned 32bits value
> @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
>  		n = max;
>  
>  		*old_head = r->cons.head;
> +
> +		/* load of prod.tail can't be reordered before cons.head */
> +		rte_smp_rmb();
> +
>  		const uint32_t prod_tail = r->prod.tail;
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
> -- 
> 2.7.4
> 

The explanation convinces me.

However, since it's in a critical path, it would be good to have other
opinions. This patch reminds me this discussion, that was also related to
memory barrier, but at another place:
http://dpdk.org/ml/archives/dev/2016-July/043765.html
Lead to that patch: http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e
But finally reverted: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3

Konstatin, Jerin, do you have any comment?
  
Stephen Hemminger Oct. 12, 2017, 4:15 p.m. UTC | #2
On Thu, 12 Oct 2017 17:53:51 +0200
Olivier MATZ <olivier.matz@6wind.com> wrote:

> Hi,
> 
> On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote:
> > Before this patch:
> > In __rte_ring_move_cons_head()
> > ...
> >         do {
> >                 /* Restore n as it may change every loop */
> >                 n = max;
> > 
> >                 *old_head = r->cons.head;                //1st load
> >                 const uint32_t prod_tail = r->prod.tail; //2nd load
> > 
> > In weak memory order architectures(powerpc,arm), the 2nd load might be
> > reodered before the 1st load, that makes *entries is bigger than we wanted.
> > This nasty reording messed enque/deque up.
> > 
> > cpu1(producer)          cpu2(consumer)          cpu3(consumer)
> >                         load r->prod.tail
> > in enqueue:
> > load r->cons.tail
> > load r->prod.head
> > 
> > store r->prod.tail
> > 
> >                                                 load r->cons.head
> >                                                 load r->prod.tail
> >                                                 ...
> >                                                 store r->cons.{head,tail}
> >                         load r->cons.head
> > 
> > THEN,r->cons.head will be bigger than prod_tail, then make *entries very big
> > 
> > After this patch, the old cons.head will be recaculated after failure of
> > rte_atomic32_cmpset
> > 
> > There is no such issue in X86 cpu, because X86 is strong memory order model
> > 
> > Signed-off-by: Jia He <hejianet@gmail.com>
> > Signed-off-by: jia.he@hxt-semitech.com
> > Signed-off-by: jie2.liu@hxt-semitech.com
> > Signed-off-by: bing.zhao@hxt-semitech.com
> > 
> > ---
> >  lib/librte_ring/rte_ring.h | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> > index 5e9b3b7..15c72e2 100644
> > --- a/lib/librte_ring/rte_ring.h
> > +++ b/lib/librte_ring/rte_ring.h
> > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
> >  		n = max;
> >  
> >  		*old_head = r->prod.head;
> > +
> > +		/* load of prod.tail can't be reordered before cons.head */
> > +		rte_smp_rmb();
> > +
> >  		const uint32_t cons_tail = r->cons.tail;
> >  		/*
> >  		 *  The subtraction is done between two unsigned 32bits value
> > @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
> >  		n = max;
> >  
> >  		*old_head = r->cons.head;
> > +
> > +		/* load of prod.tail can't be reordered before cons.head */
> > +		rte_smp_rmb();
> > +
> >  		const uint32_t prod_tail = r->prod.tail;
> >  		/* The subtraction is done between two unsigned 32bits value
> >  		 * (the result is always modulo 32 bits even if we have
> > -- 
> > 2.7.4
> >   
> 
> The explanation convinces me.
> 
> However, since it's in a critical path, it would be good to have other
> opinions. This patch reminds me this discussion, that was also related to
> memory barrier, but at another place:
> http://dpdk.org/ml/archives/dev/2016-July/043765.html
> Lead to that patch: http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e
> But finally reverted: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3
> 
> Konstatin, Jerin, do you have any comment?

On strongly ordered architectures like x86 rte_smp_rmb() turns into compiler_barrier().
Compiler barriers generate no instructions, they just tell the compiler to
not do data flow optimization across.  Since the pointers are mostly volatile
it shouldn't even change the generated code.
  
Ananyev, Konstantin Oct. 12, 2017, 5:05 p.m. UTC | #3
Hi guys,

> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Thursday, October 12, 2017 4:54 PM
> To: Jia He <hejianet@gmail.com>
> Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; bing.zhao@hxt-semitech.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com
> Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
> 
> Hi,
> 
> On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote:
> > Before this patch:
> > In __rte_ring_move_cons_head()
> > ...
> >         do {
> >                 /* Restore n as it may change every loop */
> >                 n = max;
> >
> >                 *old_head = r->cons.head;                //1st load
> >                 const uint32_t prod_tail = r->prod.tail; //2nd load
> >
> > In weak memory order architectures(powerpc,arm), the 2nd load might be
> > reodered before the 1st load, that makes *entries is bigger than we wanted.
> > This nasty reording messed enque/deque up.
> >
> > cpu1(producer)          cpu2(consumer)          cpu3(consumer)
> >                         load r->prod.tail
> > in enqueue:
> > load r->cons.tail
> > load r->prod.head
> >
> > store r->prod.tail
> >
> >                                                 load r->cons.head
> >                                                 load r->prod.tail
> >                                                 ...
> >                                                 store r->cons.{head,tail}
> >                         load r->cons.head
> >
> > THEN,r->cons.head will be bigger than prod_tail, then make *entries very big
> >
> > After this patch, the old cons.head will be recaculated after failure of
> > rte_atomic32_cmpset
> >
> > There is no such issue in X86 cpu, because X86 is strong memory order model
> >
> > Signed-off-by: Jia He <hejianet@gmail.com>
> > Signed-off-by: jia.he@hxt-semitech.com
> > Signed-off-by: jie2.liu@hxt-semitech.com
> > Signed-off-by: bing.zhao@hxt-semitech.com
> >
> > ---
> >  lib/librte_ring/rte_ring.h | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> > index 5e9b3b7..15c72e2 100644
> > --- a/lib/librte_ring/rte_ring.h
> > +++ b/lib/librte_ring/rte_ring.h
> > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
> >  		n = max;
> >
> >  		*old_head = r->prod.head;
> > +
> > +		/* load of prod.tail can't be reordered before cons.head */
> > +		rte_smp_rmb();
> > +
> >  		const uint32_t cons_tail = r->cons.tail;
> >  		/*
> >  		 *  The subtraction is done between two unsigned 32bits value
> > @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
> >  		n = max;
> >
> >  		*old_head = r->cons.head;
> > +
> > +		/* load of prod.tail can't be reordered before cons.head */
> > +		rte_smp_rmb();
> > +
> >  		const uint32_t prod_tail = r->prod.tail;
> >  		/* The subtraction is done between two unsigned 32bits value
> >  		 * (the result is always modulo 32 bits even if we have
> > --
> > 2.7.4
> >
> 
> The explanation convinces me.
> 
> However, since it's in a critical path, it would be good to have other
> opinions. This patch reminds me this discussion, that was also related to
> memory barrier, but at another place:
> http://dpdk.org/ml/archives/dev/2016-July/043765.html
> Lead to that patch: http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e
> But finally reverted: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3
> 
> Konstatin, Jerin, do you have any comment?

For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make any difference,
but  I can't see how read reordering would screw things up here...
Probably just me and arm or ppc guys could explain what will be the problem
if let say cons.tail will be read before prod.head in __rte_ring_move_prod_head().
I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)?
Probably new test-case for rte_ring autotest is needed, or is it possible to reproduce
it with existing one?  
Konstantin
  
Jerin Jacob Oct. 12, 2017, 5:23 p.m. UTC | #4
-----Original Message-----
> Date: Thu, 12 Oct 2017 17:05:50 +0000
> From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
> To: Olivier MATZ <olivier.matz@6wind.com>, Jia He <hejianet@gmail.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>  <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>  <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>  <bing.zhao@hxt-semitech.com>, "jerin.jacob@caviumnetworks.com"
>  <jerin.jacob@caviumnetworks.com>
> Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when
>  doing enqueue/dequeue
> 
> Hi guys,
> 
> > -----Original Message-----
> > From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> > Sent: Thursday, October 12, 2017 4:54 PM
> > To: Jia He <hejianet@gmail.com>
> > Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; bing.zhao@hxt-semitech.com; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com
> > Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
> > 
> > Hi,
> > 
> > On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote:
> > > Before this patch:
> > > In __rte_ring_move_cons_head()
> > > ...
> > >         do {
> > >                 /* Restore n as it may change every loop */
> > >                 n = max;
> > >
> > >                 *old_head = r->cons.head;                //1st load
> > >                 const uint32_t prod_tail = r->prod.tail; //2nd load
> > >
> > > In weak memory order architectures(powerpc,arm), the 2nd load might be
> > > reodered before the 1st load, that makes *entries is bigger than we wanted.
> > > This nasty reording messed enque/deque up.
> > >
> > > cpu1(producer)          cpu2(consumer)          cpu3(consumer)
> > >                         load r->prod.tail
> > > in enqueue:
> > > load r->cons.tail
> > > load r->prod.head
> > >
> > > store r->prod.tail
> > >
> > >                                                 load r->cons.head
> > >                                                 load r->prod.tail
> > >                                                 ...
> > >                                                 store r->cons.{head,tail}
> > >                         load r->cons.head
> > >
> > > THEN,r->cons.head will be bigger than prod_tail, then make *entries very big
> > >
> > > After this patch, the old cons.head will be recaculated after failure of
> > > rte_atomic32_cmpset
> > >
> > > There is no such issue in X86 cpu, because X86 is strong memory order model
> > >
> > > Signed-off-by: Jia He <hejianet@gmail.com>
> > > Signed-off-by: jia.he@hxt-semitech.com
> > > Signed-off-by: jie2.liu@hxt-semitech.com
> > > Signed-off-by: bing.zhao@hxt-semitech.com
> > >
> > > ---
> > >  lib/librte_ring/rte_ring.h | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> > > index 5e9b3b7..15c72e2 100644
> > > --- a/lib/librte_ring/rte_ring.h
> > > +++ b/lib/librte_ring/rte_ring.h
> > > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
> > >  		n = max;
> > >
> > >  		*old_head = r->prod.head;
> > > +
> > > +		/* load of prod.tail can't be reordered before cons.head */
> > > +		rte_smp_rmb();
> > > +
> > >  		const uint32_t cons_tail = r->cons.tail;
> > >  		/*
> > >  		 *  The subtraction is done between two unsigned 32bits value
> > > @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
> > >  		n = max;
> > >
> > >  		*old_head = r->cons.head;
> > > +
> > > +		/* load of prod.tail can't be reordered before cons.head */
> > > +		rte_smp_rmb();
> > > +
> > >  		const uint32_t prod_tail = r->prod.tail;
> > >  		/* The subtraction is done between two unsigned 32bits value
> > >  		 * (the result is always modulo 32 bits even if we have
> > > --
> > > 2.7.4
> > >
> > 
> > The explanation convinces me.
> > 
> > However, since it's in a critical path, it would be good to have other
> > opinions. This patch reminds me this discussion, that was also related to
> > memory barrier, but at another place:
> > http://dpdk.org/ml/archives/dev/2016-July/043765.html
> > Lead to that patch: http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e
> > But finally reverted: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3
> > 
> > Konstatin, Jerin, do you have any comment?
> 
> For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make any difference,
> but  I can't see how read reordering would screw things up here...
> Probably just me and arm or ppc guys could explain what will be the problem
> if let say cons.tail will be read before prod.head in __rte_ring_move_prod_head().
> I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)?
> Probably new test-case for rte_ring autotest is needed, or is it possible to reproduce
> it with existing one?  

On the same lines,

Jia He, jie2.liu, bing.zhao,

Is this patch based on code review or do you saw this issue on any of the
arm/ppc target? arm64 will have performance impact with this change.



> Konstantin
  
Liu, Jie2 Oct. 13, 2017, 12:24 a.m. UTC | #5
Hi guys,
We found this issue when we run mbuf_autotest. It failed on a aarch64 platform. I am not sure if it can be reproduced on other platforms.
Regards,
Jie Liu

-----Original Message-----
From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]

Sent: 2017年10月13日 1:06
To: Olivier MATZ <olivier.matz@6wind.com>; Jia He <hejianet@gmail.com>
Cc: dev@dpdk.org; He, Jia <jia.he@hxt-semitech.com>; Liu, Jie2 <jie2.liu@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt-semitech.com>; jerin.jacob@caviumnetworks.com
Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

Hi guys,

> -----Original Message-----

> From: Olivier MATZ [mailto:olivier.matz@6wind.com]

> Sent: Thursday, October 12, 2017 4:54 PM

> To: Jia He <hejianet@gmail.com>

> Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com;

> bing.zhao@hxt-semitech.com; Ananyev, Konstantin

> <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com

> Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading

> when doing enqueue/dequeue

>

> Hi,

>

> On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote:

> > Before this patch:

> > In __rte_ring_move_cons_head()

> > ...

> >         do {

> >                 /* Restore n as it may change every loop */

> >                 n = max;

> >

> >                 *old_head = r->cons.head;                //1st load

> >                 const uint32_t prod_tail = r->prod.tail; //2nd load

> >

> > In weak memory order architectures(powerpc,arm), the 2nd load might

> > be reodered before the 1st load, that makes *entries is bigger than we wanted.

> > This nasty reording messed enque/deque up.

> >

> > cpu1(producer)          cpu2(consumer)          cpu3(consumer)

> >                         load r->prod.tail in enqueue:

> > load r->cons.tail

> > load r->prod.head

> >

> > store r->prod.tail

> >

> >                                                 load r->cons.head

> >                                                 load r->prod.tail

> >                                                 ...

> >                                                 store r->cons.{head,tail}

> >                         load r->cons.head

> >

> > THEN,r->cons.head will be bigger than prod_tail, then make *entries

> > very big

> >

> > After this patch, the old cons.head will be recaculated after

> > failure of rte_atomic32_cmpset

> >

> > There is no such issue in X86 cpu, because X86 is strong memory

> > order model

> >

> > Signed-off-by: Jia He <hejianet@gmail.com>

> > Signed-off-by: jia.he@hxt-semitech.com

> > Signed-off-by: jie2.liu@hxt-semitech.com

> > Signed-off-by: bing.zhao@hxt-semitech.com

> >

> > ---

> >  lib/librte_ring/rte_ring.h | 8 ++++++++

> >  1 file changed, 8 insertions(+)

> >

> > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h

> > index 5e9b3b7..15c72e2 100644

> > --- a/lib/librte_ring/rte_ring.h

> > +++ b/lib/librte_ring/rte_ring.h

> > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,

> >  n = max;

> >

> >  *old_head = r->prod.head;

> > +

> > +/* load of prod.tail can't be reordered before cons.head */

> > +rte_smp_rmb();

> > +

> >  const uint32_t cons_tail = r->cons.tail;

> >  /*

> >   *  The subtraction is done between two unsigned 32bits value @@

> > -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,

> >  n = max;

> >

> >  *old_head = r->cons.head;

> > +

> > +/* load of prod.tail can't be reordered before cons.head */

> > +rte_smp_rmb();

> > +

> >  const uint32_t prod_tail = r->prod.tail;

> >  /* The subtraction is done between two unsigned 32bits value

> >   * (the result is always modulo 32 bits even if we have

> > --

> > 2.7.4

> >

>

> The explanation convinces me.

>

> However, since it's in a critical path, it would be good to have other

> opinions. This patch reminds me this discussion, that was also related

> to memory barrier, but at another place:

> http://dpdk.org/ml/archives/dev/2016-July/043765.html

> Lead to that patch:

> http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e

> But finally reverted:

> http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3

>

> Konstatin, Jerin, do you have any comment?


For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make any difference, but  I can't see how read reordering would screw things up here...
Probably just me and arm or ppc guys could explain what will be the problem if let say cons.tail will be read before prod.head in __rte_ring_move_prod_head().
I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)?
Probably new test-case for rte_ring autotest is needed, or is it possible to reproduce it with existing one?
Konstantin



This email is intended only for the named addressee. It may contain information that is confidential/private, legally privileged, or copyright-protected, and you should handle it accordingly. If you are not the intended recipient, you do not have legal rights to retain, copy, or distribute this email or its contents, and should promptly delete the email and all electronic copies in your system; do not retain copies in any media. If you have received this email in error, please notify the sender promptly. Thank you.
  
Jia He Oct. 13, 2017, 1:02 a.m. UTC | #6
Hi Jerin


On 10/13/2017 1:23 AM, Jerin Jacob Wrote:
> -----Original Message-----
>> Date: Thu, 12 Oct 2017 17:05:50 +0000
>> From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
>> To: Olivier MATZ <olivier.matz@6wind.com>, Jia He <hejianet@gmail.com>
>> CC: "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>>   <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>>   <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>>   <bing.zhao@hxt-semitech.com>, "jerin.jacob@caviumnetworks.com"
>>   <jerin.jacob@caviumnetworks.com>
>> Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when
>>   doing enqueue/dequeue
>>
>> Hi guys,
>>
>>> -----Original Message-----
>>> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
>>> Sent: Thursday, October 12, 2017 4:54 PM
>>> To: Jia He <hejianet@gmail.com>
>>> Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; bing.zhao@hxt-semitech.com; Ananyev, Konstantin
>>> <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com
>>> Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
>>>
>>> Hi,
>>>
>>> On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote:
>>>> Before this patch:
>>>> In __rte_ring_move_cons_head()
>>>> ...
>>>>          do {
>>>>                  /* Restore n as it may change every loop */
>>>>                  n = max;
>>>>
>>>>                  *old_head = r->cons.head;                //1st load
>>>>                  const uint32_t prod_tail = r->prod.tail; //2nd load
>>>>
>>>> In weak memory order architectures(powerpc,arm), the 2nd load might be
>>>> reodered before the 1st load, that makes *entries is bigger than we wanted.
>>>> This nasty reording messed enque/deque up.
>>>>
>>>> cpu1(producer)          cpu2(consumer)          cpu3(consumer)
>>>>                          load r->prod.tail
>>>> in enqueue:
>>>> load r->cons.tail
>>>> load r->prod.head
>>>>
>>>> store r->prod.tail
>>>>
>>>>                                                  load r->cons.head
>>>>                                                  load r->prod.tail
>>>>                                                  ...
>>>>                                                  store r->cons.{head,tail}
>>>>                          load r->cons.head
>>>>
>>>> THEN,r->cons.head will be bigger than prod_tail, then make *entries very big
>>>>
>>>> After this patch, the old cons.head will be recaculated after failure of
>>>> rte_atomic32_cmpset
>>>>
>>>> There is no such issue in X86 cpu, because X86 is strong memory order model
>>>>
>>>> Signed-off-by: Jia He <hejianet@gmail.com>
>>>> Signed-off-by: jia.he@hxt-semitech.com
>>>> Signed-off-by: jie2.liu@hxt-semitech.com
>>>> Signed-off-by: bing.zhao@hxt-semitech.com
>>>>
>>>> ---
>>>>   lib/librte_ring/rte_ring.h | 8 ++++++++
>>>>   1 file changed, 8 insertions(+)
>>>>
>>>> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
>>>> index 5e9b3b7..15c72e2 100644
>>>> --- a/lib/librte_ring/rte_ring.h
>>>> +++ b/lib/librte_ring/rte_ring.h
>>>> @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
>>>>   		n = max;
>>>>
>>>>   		*old_head = r->prod.head;
>>>> +
>>>> +		/* load of prod.tail can't be reordered before cons.head */
>>>> +		rte_smp_rmb();
>>>> +
>>>>   		const uint32_t cons_tail = r->cons.tail;
>>>>   		/*
>>>>   		 *  The subtraction is done between two unsigned 32bits value
>>>> @@ -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
>>>>   		n = max;
>>>>
>>>>   		*old_head = r->cons.head;
>>>> +
>>>> +		/* load of prod.tail can't be reordered before cons.head */
>>>> +		rte_smp_rmb();
>>>> +
>>>>   		const uint32_t prod_tail = r->prod.tail;
>>>>   		/* The subtraction is done between two unsigned 32bits value
>>>>   		 * (the result is always modulo 32 bits even if we have
>>>> --
>>>> 2.7.4
>>>>
>>> The explanation convinces me.
>>>
>>> However, since it's in a critical path, it would be good to have other
>>> opinions. This patch reminds me this discussion, that was also related to
>>> memory barrier, but at another place:
>>> http://dpdk.org/ml/archives/dev/2016-July/043765.html
>>> Lead to that patch: http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e
>>> But finally reverted: http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3
>>>
>>> Konstatin, Jerin, do you have any comment?
>> For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make any difference,
>> but  I can't see how read reordering would screw things up here...
>> Probably just me and arm or ppc guys could explain what will be the problem
>> if let say cons.tail will be read before prod.head in __rte_ring_move_prod_head().
>> I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)?
>> Probably new test-case for rte_ring autotest is needed, or is it possible to reproduce
>> it with existing one?
> On the same lines,
>
> Jia He, jie2.liu, bing.zhao,
>
> Is this patch based on code review or do you saw this issue on any of the
> arm/ppc target? arm64 will have performance impact with this change.
Based on mbuf_autotest, the rte_panic will be invoked in seconds.

PANIC in test_refcnt_iter():
(lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
1: [./test(rte_dump_stack+0x38) [0x58d868]]
Aborted (core dumped)

Cheers,
Jia
>
>
>> Konstantin
  
Jia He Oct. 13, 2017, 1:15 a.m. UTC | #7
On 10/13/2017 9:02 AM, Jia He Wrote:
> Hi Jerin
>
>
> On 10/13/2017 1:23 AM, Jerin Jacob Wrote:
>> -----Original Message-----
>>> Date: Thu, 12 Oct 2017 17:05:50 +0000
>>>
[...]
>> On the same lines,
>>
>> Jia He, jie2.liu, bing.zhao,
>>
>> Is this patch based on code review or do you saw this issue on any of 
>> the
>> arm/ppc target? arm64 will have performance impact with this change.
sorry, miss one important information
Our platform is an aarch64 server with 46 cpus.
If we reduced the involved cpu numbers, the bug occurred less frequently.

Yes, mb barrier impact the performance, but correctness is more 
important, isn't it ;-)
Maybe we can  find any other lightweight barrier here?

Cheers,
Jia
> Based on mbuf_autotest, the rte_panic will be invoked in seconds.
>
> PANIC in test_refcnt_iter():
> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
> 1: [./test(rte_dump_stack+0x38) [0x58d868]]
> Aborted (core dumped)
>
> Cheers,
> Jia
>>
>>
>>> Konstantin
>
  
Jia He Oct. 13, 2017, 1:16 a.m. UTC | #8
Hi


On 10/13/2017 9:02 AM, Jia He Wrote:
> Hi Jerin
>
>
> On 10/13/2017 1:23 AM, Jerin Jacob Wrote:
>> -----Original Message-----
>>> Date: Thu, 12 Oct 2017 17:05:50 +0000
>>>
[...]
>> On the same lines,
>>
>> Jia He, jie2.liu, bing.zhao,
>>
>> Is this patch based on code review or do you saw this issue on any of 
>> the
>> arm/ppc target? arm64 will have performance impact with this change.
sorry, miss one important information
Our platform is an aarch64 server with 46 cpus.
If we reduced the involved cpu numbers, the bug occurred less frequently.

Yes, mb barrier impact the performance, but correctness is more 
important, isn't it ;-)
Maybe we can  find any other lightweight barrier here?

Cheers,
Jia
> Based on mbuf_autotest, the rte_panic will be invoked in seconds.
>
> PANIC in test_refcnt_iter():
> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
> 1: [./test(rte_dump_stack+0x38) [0x58d868]]
> Aborted (core dumped)
>
> Cheers,
> Jia
>>
>>
>>> Konstantin
>
  
Jerin Jacob Oct. 13, 2017, 1:49 a.m. UTC | #9
-----Original Message-----
> Date: Fri, 13 Oct 2017 09:16:31 +0800
> From: Jia He <hejianet@gmail.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>, "Ananyev, Konstantin"
>  <konstantin.ananyev@intel.com>
> Cc: Olivier MATZ <olivier.matz@6wind.com>, "dev@dpdk.org" <dev@dpdk.org>,
>  "jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>,
>  "jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>,
>  "bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com>
> Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when
>  doing enqueue/dequeue
> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>  Thunderbird/52.3.0
> 
> Hi
> 
> 
> On 10/13/2017 9:02 AM, Jia He Wrote:
> > Hi Jerin
> > 
> > 
> > On 10/13/2017 1:23 AM, Jerin Jacob Wrote:
> > > -----Original Message-----
> > > > Date: Thu, 12 Oct 2017 17:05:50 +0000
> > > > 
> [...]
> > > On the same lines,
> > > 
> > > Jia He, jie2.liu, bing.zhao,
> > > 
> > > Is this patch based on code review or do you saw this issue on any
> > > of the
> > > arm/ppc target? arm64 will have performance impact with this change.
> sorry, miss one important information
> Our platform is an aarch64 server with 46 cpus.

Is this an OOO(Out of order execution) aarch64 CPU implementation?

> If we reduced the involved cpu numbers, the bug occurred less frequently.
> 
> Yes, mb barrier impact the performance, but correctness is more important,
> isn't it ;-)

Yes.

> Maybe we can  find any other lightweight barrier here?

Yes, Regarding the lightweight barrier, arm64 has native support for acquire and release
semantics, which is exposed through gcc as architecture agnostic
functions.
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
http://preshing.com/20130922/acquire-and-release-fences/

Good to know,
1) How much overhead this patch in your platform? Just relative
numbers are enough
2) As a prototype, Is Changing to acquire and release schematics
reduces the overhead in your platform?

Reference FreeBSD ring/DPDK style ring implementation through acquire
and release schematics
https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c

I will also spend on cycles on this.


> 
> Cheers,
> Jia
> > Based on mbuf_autotest, the rte_panic will be invoked in seconds.
> > 
> > PANIC in test_refcnt_iter():
> > (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
> > 1: [./test(rte_dump_stack+0x38) [0x58d868]]
> > Aborted (core dumped)
> > 
> > Cheers,
> > Jia
> > > 
> > > 
> > > > Konstantin
> > 
>
  
Zhao, Bing Oct. 13, 2017, 2:12 a.m. UTC | #10
Hi Guys,
If needed, we can provide a simplified UT test case and ring patch to reproduce this. And then we can catch the information to prove that in a SP+MC scenario, when the tail and head pointers' number are smaller than the ring size, some lcore thread will have a fault judge and then make the CH over the PT.

Thanks

BR. Bing

-----Original Message-----
From: Liu, Jie2

Sent: 2017年10月13日 8:25
To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Olivier MATZ <olivier.matz@6wind.com>; dev@dpdk.org; jerin.jacob@caviumnetworks.com
Cc: He, Jia <jia.he@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt-semitech.com>; Jia He <hejianet@gmail.com>
Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

Hi guys,
We found this issue when we run mbuf_autotest. It failed on a aarch64 platform. I am not sure if it can be reproduced on other platforms.
Regards,
Jie Liu

-----Original Message-----
From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]

Sent: 2017年10月13日 1:06
To: Olivier MATZ <olivier.matz@6wind.com>; Jia He <hejianet@gmail.com>
Cc: dev@dpdk.org; He, Jia <jia.he@hxt-semitech.com>; Liu, Jie2 <jie2.liu@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt-semitech.com>; jerin.jacob@caviumnetworks.com
Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

Hi guys,

> -----Original Message-----

> From: Olivier MATZ [mailto:olivier.matz@6wind.com]

> Sent: Thursday, October 12, 2017 4:54 PM

> To: Jia He <hejianet@gmail.com>

> Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com;

> bing.zhao@hxt-semitech.com; Ananyev, Konstantin

> <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com

> Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading

> when doing enqueue/dequeue

>

> Hi,

>

> On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote:

> > Before this patch:

> > In __rte_ring_move_cons_head()

> > ...

> >         do {

> >                 /* Restore n as it may change every loop */

> >                 n = max;

> >

> >                 *old_head = r->cons.head;                //1st load

> >                 const uint32_t prod_tail = r->prod.tail; //2nd load

> >

> > In weak memory order architectures(powerpc,arm), the 2nd load might

> > be reodered before the 1st load, that makes *entries is bigger than we wanted.

> > This nasty reording messed enque/deque up.

> >

> > cpu1(producer)          cpu2(consumer)          cpu3(consumer)

> >                         load r->prod.tail in enqueue:

> > load r->cons.tail

> > load r->prod.head

> >

> > store r->prod.tail

> >

> >                                                 load r->cons.head

> >                                                 load r->prod.tail

> >                                                 ...

> >                                                 store r->cons.{head,tail}

> >                         load r->cons.head

> >

> > THEN,r->cons.head will be bigger than prod_tail, then make *entries

> > very big

> >

> > After this patch, the old cons.head will be recaculated after

> > failure of rte_atomic32_cmpset

> >

> > There is no such issue in X86 cpu, because X86 is strong memory

> > order model

> >

> > Signed-off-by: Jia He <hejianet@gmail.com>

> > Signed-off-by: jia.he@hxt-semitech.com

> > Signed-off-by: jie2.liu@hxt-semitech.com

> > Signed-off-by: bing.zhao@hxt-semitech.com

> >

> > ---

> >  lib/librte_ring/rte_ring.h | 8 ++++++++

> >  1 file changed, 8 insertions(+)

> >

> > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h

> > index 5e9b3b7..15c72e2 100644

> > --- a/lib/librte_ring/rte_ring.h

> > +++ b/lib/librte_ring/rte_ring.h

> > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,

> >  n = max;

> >

> >  *old_head = r->prod.head;

> > +

> > +/* load of prod.tail can't be reordered before cons.head */

> > +rte_smp_rmb();

> > +

> >  const uint32_t cons_tail = r->cons.tail;

> >  /*

> >   *  The subtraction is done between two unsigned 32bits value @@

> > -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,

> >  n = max;

> >

> >  *old_head = r->cons.head;

> > +

> > +/* load of prod.tail can't be reordered before cons.head */

> > +rte_smp_rmb();

> > +

> >  const uint32_t prod_tail = r->prod.tail;

> >  /* The subtraction is done between two unsigned 32bits value

> >   * (the result is always modulo 32 bits even if we have

> > --

> > 2.7.4

> >

>

> The explanation convinces me.

>

> However, since it's in a critical path, it would be good to have other

> opinions. This patch reminds me this discussion, that was also related

> to memory barrier, but at another place:

> http://dpdk.org/ml/archives/dev/2016-July/043765.html

> Lead to that patch:

> http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e

> But finally reverted:

> http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3

>

> Konstatin, Jerin, do you have any comment?


For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make any difference, but  I can't see how read reordering would screw things up here...
Probably just me and arm or ppc guys could explain what will be the problem if let say cons.tail will be read before prod.head in __rte_ring_move_prod_head().
I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)?
Probably new test-case for rte_ring autotest is needed, or is it possible to reproduce it with existing one?
Konstantin



This email is intended only for the named addressee. It may contain information that is confidential/private, legally privileged, or copyright-protected, and you should handle it accordingly. If you are not the intended recipient, you do not have legal rights to retain, copy, or distribute this email or its contents, and should promptly delete the email and all electronic copies in your system; do not retain copies in any media. If you have received this email in error, please notify the sender promptly. Thank you.
  
Jerin Jacob Oct. 13, 2017, 2:34 a.m. UTC | #11
-----Original Message-----
> Date: Fri, 13 Oct 2017 02:12:58 +0000
> From: "Zhao, Bing" <bing.zhao@hxt-semitech.com>
> To: "Liu, Jie2" <jie2.liu@hxt-semitech.com>, "Ananyev, Konstantin"
>  <konstantin.ananyev@intel.com>, Olivier MATZ <olivier.matz@6wind.com>,
>  "dev@dpdk.org" <dev@dpdk.org>, "jerin.jacob@caviumnetworks.com"
>  <jerin.jacob@caviumnetworks.com>
> CC: "He, Jia" <jia.he@hxt-semitech.com>, Jia He <hejianet@gmail.com>
> Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when
>  doing enqueue/dequeue
> 
> Hi Guys,

Hi Bing,

> If needed, we can provide a simplified UT test case and ring patch to reproduce this. And then we can catch the information to prove that in a SP+MC scenario, when the tail and head pointers' number are smaller than the ring size, some lcore thread will have a fault judge and then make the CH over the PT.

Yes. Please provide simplified UT test case and ring patch.

> 
> Thanks
> 
> BR. Bing
> 
> -----Original Message-----
> From: Liu, Jie2
> Sent: 2017年10月13日 8:25
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Olivier MATZ <olivier.matz@6wind.com>; dev@dpdk.org; jerin.jacob@caviumnetworks.com
> Cc: He, Jia <jia.he@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt-semitech.com>; Jia He <hejianet@gmail.com>
> Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
> 
> Hi guys,
> We found this issue when we run mbuf_autotest. It failed on a aarch64 platform. I am not sure if it can be reproduced on other platforms.
> Regards,
> Jie Liu
> 
> -----Original Message-----
> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
> Sent: 2017年10月13日 1:06
> To: Olivier MATZ <olivier.matz@6wind.com>; Jia He <hejianet@gmail.com>
> Cc: dev@dpdk.org; He, Jia <jia.he@hxt-semitech.com>; Liu, Jie2 <jie2.liu@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt-semitech.com>; jerin.jacob@caviumnetworks.com
> Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
> 
> Hi guys,
> 
> > -----Original Message-----
> > From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> > Sent: Thursday, October 12, 2017 4:54 PM
> > To: Jia He <hejianet@gmail.com>
> > Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com;
> > bing.zhao@hxt-semitech.com; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com
> > Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading
> > when doing enqueue/dequeue
> >
> > Hi,
> >
> > On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote:
> > > Before this patch:
> > > In __rte_ring_move_cons_head()
> > > ...
> > >         do {
> > >                 /* Restore n as it may change every loop */
> > >                 n = max;
> > >
> > >                 *old_head = r->cons.head;                //1st load
> > >                 const uint32_t prod_tail = r->prod.tail; //2nd load
> > >
> > > In weak memory order architectures(powerpc,arm), the 2nd load might
> > > be reodered before the 1st load, that makes *entries is bigger than we wanted.
> > > This nasty reording messed enque/deque up.
> > >
> > > cpu1(producer)          cpu2(consumer)          cpu3(consumer)
> > >                         load r->prod.tail in enqueue:
> > > load r->cons.tail
> > > load r->prod.head
> > >
> > > store r->prod.tail
> > >
> > >                                                 load r->cons.head
> > >                                                 load r->prod.tail
> > >                                                 ...
> > >                                                 store r->cons.{head,tail}
> > >                         load r->cons.head
> > >
> > > THEN,r->cons.head will be bigger than prod_tail, then make *entries
> > > very big
> > >
> > > After this patch, the old cons.head will be recaculated after
> > > failure of rte_atomic32_cmpset
> > >
> > > There is no such issue in X86 cpu, because X86 is strong memory
> > > order model
> > >
> > > Signed-off-by: Jia He <hejianet@gmail.com>
> > > Signed-off-by: jia.he@hxt-semitech.com
> > > Signed-off-by: jie2.liu@hxt-semitech.com
> > > Signed-off-by: bing.zhao@hxt-semitech.com
> > >
> > > ---
> > >  lib/librte_ring/rte_ring.h | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> > > index 5e9b3b7..15c72e2 100644
> > > --- a/lib/librte_ring/rte_ring.h
> > > +++ b/lib/librte_ring/rte_ring.h
> > > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
> > >  n = max;
> > >
> > >  *old_head = r->prod.head;
> > > +
> > > +/* load of prod.tail can't be reordered before cons.head */
> > > +rte_smp_rmb();
> > > +
> > >  const uint32_t cons_tail = r->cons.tail;
> > >  /*
> > >   *  The subtraction is done between two unsigned 32bits value @@
> > > -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
> > >  n = max;
> > >
> > >  *old_head = r->cons.head;
> > > +
> > > +/* load of prod.tail can't be reordered before cons.head */
> > > +rte_smp_rmb();
> > > +
> > >  const uint32_t prod_tail = r->prod.tail;
> > >  /* The subtraction is done between two unsigned 32bits value
> > >   * (the result is always modulo 32 bits even if we have
> > > --
> > > 2.7.4
> > >
> >
> > The explanation convinces me.
> >
> > However, since it's in a critical path, it would be good to have other
> > opinions. This patch reminds me this discussion, that was also related
> > to memory barrier, but at another place:
> > http://dpdk.org/ml/archives/dev/2016-July/043765.html
> > Lead to that patch:
> > http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e
> > But finally reverted:
> > http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3
> >
> > Konstatin, Jerin, do you have any comment?
> 
> For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't make any difference, but  I can't see how read reordering would screw things up here...
> Probably just me and arm or ppc guys could explain what will be the problem if let say cons.tail will be read before prod.head in __rte_ring_move_prod_head().
> I wonder Is there a simple test-case to reproduce that problem (on arm or ppc)?
> Probably new test-case for rte_ring autotest is needed, or is it possible to reproduce it with existing one?
> Konstantin
> 
> 
> 
> This email is intended only for the named addressee. It may contain information that is confidential/private, legally privileged, or copyright-protected, and you should handle it accordingly. If you are not the intended recipient, you do not have legal rights to retain, copy, or distribute this email or its contents, and should promptly delete the email and all electronic copies in your system; do not retain copies in any media. If you have received this email in error, please notify the sender promptly. Thank you.
> 
>
  
Jianbo Liu Oct. 13, 2017, 7:33 a.m. UTC | #12
The 10/13/2017 07:19, Jerin Jacob wrote:
> -----Original Message-----
> > Date: Fri, 13 Oct 2017 09:16:31 +0800
> > From: Jia He <hejianet@gmail.com>
> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>, "Ananyev, Konstantin"
> >  <konstantin.ananyev@intel.com>
> > Cc: Olivier MATZ <olivier.matz@6wind.com>, "dev@dpdk.org" <dev@dpdk.org>,
> >  "jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>,
> >  "jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>,
> >  "bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com>
> > Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when
> >  doing enqueue/dequeue
> > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
> >  Thunderbird/52.3.0
> >
> > Hi
> >
> >
> > On 10/13/2017 9:02 AM, Jia He Wrote:
> > > Hi Jerin
> > >
> > >
> > > On 10/13/2017 1:23 AM, Jerin Jacob Wrote:
> > > > -----Original Message-----
> > > > > Date: Thu, 12 Oct 2017 17:05:50 +0000
> > > > >
> > [...]
> > > > On the same lines,
> > > >
> > > > Jia He, jie2.liu, bing.zhao,
> > > >
> > > > Is this patch based on code review or do you saw this issue on any
> > > > of the
> > > > arm/ppc target? arm64 will have performance impact with this change.
> > sorry, miss one important information
> > Our platform is an aarch64 server with 46 cpus.
>
> Is this an OOO(Out of order execution) aarch64 CPU implementation?
>
> > If we reduced the involved cpu numbers, the bug occurred less frequently.
> >
> > Yes, mb barrier impact the performance, but correctness is more important,
> > isn't it ;-)
>
> Yes.
>
> > Maybe we can  find any other lightweight barrier here?
>
> Yes, Regarding the lightweight barrier, arm64 has native support for acquire and release
> semantics, which is exposed through gcc as architecture agnostic
> functions.
> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> http://preshing.com/20130922/acquire-and-release-fences/
>
> Good to know,
> 1) How much overhead this patch in your platform? Just relative
> numbers are enough
> 2) As a prototype, Is Changing to acquire and release schematics
> reduces the overhead in your platform?
>

+1, can you try what ODP does in the link mentioned below?

> Reference FreeBSD ring/DPDK style ring implementation through acquire
> and release schematics
> https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c
>
> I will also spend on cycles on this.
>
>
> >
> > Cheers,
> > Jia
> > > Based on mbuf_autotest, the rte_panic will be invoked in seconds.
> > >
> > > PANIC in test_refcnt_iter():
> > > (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
> > > 1: [./test(rte_dump_stack+0x38) [0x58d868]]
> > > Aborted (core dumped)
> > >
> > > Cheers,
> > > Jia
> > > >
> > > >
> > > > > Konstantin
> > >
> >

--
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  
Jia He Oct. 13, 2017, 8:20 a.m. UTC | #13
Hi


On 10/13/2017 3:33 PM, Jianbo Liu Wrote:
> The 10/13/2017 07:19, Jerin Jacob wrote:
>> -----Original Message-----
>>> Date: Fri, 13 Oct 2017 09:16:31 +0800
>>> From: Jia He <hejianet@gmail.com>
>>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>, "Ananyev, Konstantin"
>>>   <konstantin.ananyev@intel.com>
>>> Cc: Olivier MATZ <olivier.matz@6wind.com>, "dev@dpdk.org" <dev@dpdk.org>,
>>>   "jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>,
>>>   "jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>,
>>>   "bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com>
>>> Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading when
>>>   doing enqueue/dequeue
>>> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>>>   Thunderbird/52.3.0
>>>
>>> Hi
>>>
>>>
>>> On 10/13/2017 9:02 AM, Jia He Wrote:
>>>> Hi Jerin
>>>>
>>>>
>>>> On 10/13/2017 1:23 AM, Jerin Jacob Wrote:
>>>>> -----Original Message-----
>>>>>> Date: Thu, 12 Oct 2017 17:05:50 +0000
>>>>>>
>>> [...]
>>>>> On the same lines,
>>>>>
>>>>> Jia He, jie2.liu, bing.zhao,
>>>>>
>>>>> Is this patch based on code review or do you saw this issue on any
>>>>> of the
>>>>> arm/ppc target? arm64 will have performance impact with this change.
>>> sorry, miss one important information
>>> Our platform is an aarch64 server with 46 cpus.
>> Is this an OOO(Out of order execution) aarch64 CPU implementation?
>>
>>> If we reduced the involved cpu numbers, the bug occurred less frequently.
>>>
>>> Yes, mb barrier impact the performance, but correctness is more important,
>>> isn't it ;-)
>> Yes.
>>
>>> Maybe we can  find any other lightweight barrier here?
>> Yes, Regarding the lightweight barrier, arm64 has native support for acquire and release
>> semantics, which is exposed through gcc as architecture agnostic
>> functions.
>> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
>> http://preshing.com/20130922/acquire-and-release-fences/
>>
>> Good to know,
>> 1) How much overhead this patch in your platform? Just relative
>> numbers are enough
>> 2) As a prototype, Is Changing to acquire and release schematics
>> reduces the overhead in your platform?
>>
> +1, can you try what ODP does in the link mentioned below?
Sure, pls see the result:
root@server:~/odp/test/linux-generic/ring# ./ring_main
HW time counter freq: 20000000 hz

_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
_ishm.c:880:_odp_ishm_reserve():No huge pages, fall back to normal 
pages. check: /proc/sys/vm/nr_hugepages.
_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
  PKTIO: initialized loop interface.
  PKTIO: initialized ipc interface.
  PKTIO: initialized socket mmap, use export 
ODP_PKTIO_DISABLE_SOCKET_MMAP=1 to disable.
  PKTIO: initialized socket mmsg,use export 
ODP_PKTIO_DISABLE_SOCKET_MMSG=1 to disable.
_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
         ODP API version: 1.15.0
         ODP implementation name:    "odp-linux"
         ODP implementation version: "odp-linux" 1.15.0-0 (v1.15.0) 1.15.0.0


      CUnit - A unit testing framework for C - Version 2.1-3
      http://cunit.sourceforge.net/


Suite: ring basic
   Test: ring_test_basic_create 
...pktio/ring.c:177:_ring_create():Requested size is invalid, must be 
power of 2,and do not exceed the size limit 268435455
_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate memory
passed
   Test: ring_test_basic_burst ...passed
   Test: ring_test_basic_bulk ...passed
   Test: ring_test_basic_watermark 
...passed_ishmphy.c:152:_odp_ishmphy_map():mmap failed:Cannot allocate 
memory

Suite: ring stress
   Test: ring_test_stress_1_1_producer_consumer ...passed
   Test: ring_test_stress_1_N_producer_consumer ...passed
   Test: ring_test_stress_N_1_producer_consumer ...passed
   Test: ring_test_stress_N_M_producer_consumer ...

<the test case has hung here for half an hour>

Cheers,
Jia

>> Reference FreeBSD ring/DPDK style ring implementation through acquire
>> and release schematics
>> https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c
>>
>> I will also spend on cycles on this.
>>
>>
>>> Cheers,
>>> Jia
>>>> Based on mbuf_autotest, the rte_panic will be invoked in seconds.
>>>>
>>>> PANIC in test_refcnt_iter():
>>>> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
>>>> 1: [./test(rte_dump_stack+0x38) [0x58d868]]
>>>> Aborted (core dumped)
>>>>
>>>> Cheers,
>>>> Jia
>>>>>
>>>>>> Konstantin
> --
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
>
  
Kuusisaari, Juhamatti (Infinera - FI/Espoo) Oct. 16, 2017, 10:51 a.m. UTC | #14
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Liu, Jie2
> Sent: Friday, October 13, 2017 3:25 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Olivier MATZ
> <olivier.matz@6wind.com>; dev@dpdk.org;
> jerin.jacob@caviumnetworks.com
> Cc: He, Jia <jia.he@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt-
> semitech.com>; Jia He <hejianet@gmail.com>
> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
> loading when doing enqueue/dequeue
> 
> Hi guys,
> We found this issue when we run mbuf_autotest. It failed on a aarch64
> platform. I am not sure if it can be reproduced on other platforms.
> Regards,
> Jie Liu
> 
> -----Original Message-----
> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
> Sent: 2017年10月13日 1:06
> To: Olivier MATZ <olivier.matz@6wind.com>; Jia He <hejianet@gmail.com>
> Cc: dev@dpdk.org; He, Jia <jia.he@hxt-semitech.com>; Liu, Jie2
> <jie2.liu@hxt-semitech.com>; Zhao, Bing <bing.zhao@hxt-semitech.com>;
> jerin.jacob@caviumnetworks.com
> Subject: RE: [PATCH] ring: guarantee ordering of cons/prod loading when
> doing enqueue/dequeue
> 
> Hi guys,
> 
> > -----Original Message-----
> > From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> > Sent: Thursday, October 12, 2017 4:54 PM
> > To: Jia He <hejianet@gmail.com>
> > Cc: dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com;
> > bing.zhao@hxt-semitech.com; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com
> > Subject: Re: [PATCH] ring: guarantee ordering of cons/prod loading
> > when doing enqueue/dequeue
> >
> > Hi,
> >
> > On Tue, Oct 10, 2017 at 05:56:36PM +0800, Jia He wrote:
> > > Before this patch:
> > > In __rte_ring_move_cons_head()
> > > ...
> > >         do {
> > >                 /* Restore n as it may change every loop */
> > >                 n = max;
> > >
> > >                 *old_head = r->cons.head;                //1st load
> > >                 const uint32_t prod_tail = r->prod.tail; //2nd load
> > >
> > > In weak memory order architectures(powerpc,arm), the 2nd load might
> > > be reodered before the 1st load, that makes *entries is bigger than we
> wanted.
> > > This nasty reording messed enque/deque up.
> > >
> > > cpu1(producer)          cpu2(consumer)          cpu3(consumer)
> > >                         load r->prod.tail in enqueue:
> > > load r->cons.tail
> > > load r->prod.head
> > >
> > > store r->prod.tail
> > >
> > >                                                 load r->cons.head
> > >                                                 load r->prod.tail
> > >                                                 ...
> > >                                                 store r->cons.{head,tail}
> > >                         load r->cons.head
> > >
> > > THEN,r->cons.head will be bigger than prod_tail, then make *entries
> > > very big
> > >
> > > After this patch, the old cons.head will be recaculated after
> > > failure of rte_atomic32_cmpset
> > >
> > > There is no such issue in X86 cpu, because X86 is strong memory
> > > order model
> > >
> > > Signed-off-by: Jia He <hejianet@gmail.com>
> > > Signed-off-by: jia.he@hxt-semitech.com
> > > Signed-off-by: jie2.liu@hxt-semitech.com
> > > Signed-off-by: bing.zhao@hxt-semitech.com
> > >
> > > ---
> > >  lib/librte_ring/rte_ring.h | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> > > index 5e9b3b7..15c72e2 100644
> > > --- a/lib/librte_ring/rte_ring.h
> > > +++ b/lib/librte_ring/rte_ring.h
> > > @@ -409,6 +409,10 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > > int is_sp,  n = max;
> > >
> > >  *old_head = r->prod.head;
> > > +
> > > +/* load of prod.tail can't be reordered before cons.head */
> > > +rte_smp_rmb();
> > > +
> > >  const uint32_t cons_tail = r->cons.tail;
> > >  /*
> > >   *  The subtraction is done between two unsigned 32bits value @@
> > > -517,6 +521,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> > > is_sc,  n = max;
> > >
> > >  *old_head = r->cons.head;
> > > +
> > > +/* load of prod.tail can't be reordered before cons.head */
> > > +rte_smp_rmb();
> > > +
> > >  const uint32_t prod_tail = r->prod.tail;
> > >  /* The subtraction is done between two unsigned 32bits value
> > >   * (the result is always modulo 32 bits even if we have
> > > --
> > > 2.7.4
> > >
> >
> > The explanation convinces me.
> >
> > However, since it's in a critical path, it would be good to have other
> > opinions. This patch reminds me this discussion, that was also related
> > to memory barrier, but at another place:
> > http://dpdk.org/ml/archives/dev/2016-July/043765.html
> > Lead to that patch:
> > http://dpdk.org/browse/dpdk/commit/?id=ecc7d10e448e
> > But finally reverted:
> > http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3
> >
> > Konstatin, Jerin, do you have any comment?
> 
> For IA, as rte_smp_rmb() is just a compiler_barrier, that patch shouldn't
> make any difference, but  I can't see how read reordering would screw things
> up here...
> Probably just me and arm or ppc guys could explain what will be the problem
> if let say cons.tail will be read before prod.head in
> __rte_ring_move_prod_head().
> I wonder Is there a simple test-case to reproduce that problem (on arm or
> ppc)?
> Probably new test-case for rte_ring autotest is needed, or is it possible to
> reproduce it with existing one?
> Konstantin


Hi,

I think this is a real problem here. We have fixed it so that we read both head and tail atomically as otherwise you may have this problem you are seeing. Works only on 64, of course.

Thanks for pointing out the revert, I was unaware of it:
http://dpdk.org/browse/dpdk/commit/?id=c3acd92746c3

Nevertheless, the original fix was as such correct on higher level, both PPC and ARM just happened to use strong enough default read barrier which already guaranteed the ordering to be good enough. As we optimize cycles here, I am of course just fine with the revert as long as we all remember what was going on there.

BR,
--
 Juhamatti
  
Ananyev, Konstantin Oct. 19, 2017, 10:02 a.m. UTC | #15
Hi Jia,

> 

> Hi

> 

> 

> On 10/13/2017 9:02 AM, Jia He Wrote:

> > Hi Jerin

> >

> >

> > On 10/13/2017 1:23 AM, Jerin Jacob Wrote:

> >> -----Original Message-----

> >>> Date: Thu, 12 Oct 2017 17:05:50 +0000

> >>>

> [...]

> >> On the same lines,

> >>

> >> Jia He, jie2.liu, bing.zhao,

> >>

> >> Is this patch based on code review or do you saw this issue on any of

> >> the

> >> arm/ppc target? arm64 will have performance impact with this change.

> sorry, miss one important information

> Our platform is an aarch64 server with 46 cpus.

> If we reduced the involved cpu numbers, the bug occurred less frequently.

> 

> Yes, mb barrier impact the performance, but correctness is more

> important, isn't it ;-)

> Maybe we can  find any other lightweight barrier here?

> 

> Cheers,

> Jia

> > Based on mbuf_autotest, the rte_panic will be invoked in seconds.

> >

> > PANIC in test_refcnt_iter():

> > (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free

> > 1: [./test(rte_dump_stack+0x38) [0x58d868]]

> > Aborted (core dumped)

> >


So is it only reproducible with mbuf refcnt test?
Could it be reproduced with some 'pure' ring test
(no mempools/mbufs refcnt, etc.)?
The reason I am asking - in that test we also have mbuf refcnt updates
(that's what for that test was created) and we are doing some optimizations here too
to avoid excessive atomic updates.
BTW, if the problem is not reproducible without mbuf refcnt,
can I suggest to extend the test  with:
  - add a check that enqueue() operation was successful
  - walk through the pool and check/printf refcnt of each mbuf.
Hopefully that would give us some extra information what is going wrong here. 
Konstantin
  
Bing Zhao Oct. 19, 2017, 11:18 a.m. UTC | #16
Hi,

On 2017/10/19 18:02, Ananyev, Konstantin wrote:
> 
> Hi Jia,
> 
>>
>> Hi
>>
>>
>> On 10/13/2017 9:02 AM, Jia He Wrote:
>>> Hi Jerin
>>>
>>>
>>> On 10/13/2017 1:23 AM, Jerin Jacob Wrote:
>>>> -----Original Message-----
>>>>> Date: Thu, 12 Oct 2017 17:05:50 +0000
>>>>>
>> [...]
>>>> On the same lines,
>>>>
>>>> Jia He, jie2.liu, bing.zhao,
>>>>
>>>> Is this patch based on code review or do you saw this issue on any of
>>>> the
>>>> arm/ppc target? arm64 will have performance impact with this change.
>> sorry, miss one important information
>> Our platform is an aarch64 server with 46 cpus.
>> If we reduced the involved cpu numbers, the bug occurred less frequently.
>>
>> Yes, mb barrier impact the performance, but correctness is more
>> important, isn't it ;-)
>> Maybe we can  find any other lightweight barrier here?
>>
>> Cheers,
>> Jia
>>> Based on mbuf_autotest, the rte_panic will be invoked in seconds.
>>>
>>> PANIC in test_refcnt_iter():
>>> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
>>> 1: [./test(rte_dump_stack+0x38) [0x58d868]]
>>> Aborted (core dumped)
>>>
> 
> So is it only reproducible with mbuf refcnt test?
> Could it be reproduced with some 'pure' ring test
> (no mempools/mbufs refcnt, etc.)?
> The reason I am asking - in that test we also have mbuf refcnt updates
> (that's what for that test was created) and we are doing some optimizations here too
> to avoid excessive atomic updates.
> BTW, if the problem is not reproducible without mbuf refcnt,
> can I suggest to extend the test  with:
>    - add a check that enqueue() operation was successful
>    - walk through the pool and check/printf refcnt of each mbuf.
> Hopefully that would give us some extra information what is going wrong here.
> Konstantin
>    
> 
Currently, the issue is only found in this case here on the ARM 
platform, not sure how it is going with the X86_64 platform. In another 
mail of this thread, we've made a simple test based on this and captured 
some information and I pasted there.(I pasted the patch there :-)) And 
it seems that Juhamatti & Jacod found some reverting action several 
months ago.

BR. Bing
  
Ananyev, Konstantin Oct. 19, 2017, 2:15 p.m. UTC | #17
> 

> Hi,

> 

> On 2017/10/19 18:02, Ananyev, Konstantin wrote:

> >

> > Hi Jia,

> >

> >>

> >> Hi

> >>

> >>

> >> On 10/13/2017 9:02 AM, Jia He Wrote:

> >>> Hi Jerin

> >>>

> >>>

> >>> On 10/13/2017 1:23 AM, Jerin Jacob Wrote:

> >>>> -----Original Message-----

> >>>>> Date: Thu, 12 Oct 2017 17:05:50 +0000

> >>>>>

> >> [...]

> >>>> On the same lines,

> >>>>

> >>>> Jia He, jie2.liu, bing.zhao,

> >>>>

> >>>> Is this patch based on code review or do you saw this issue on any of

> >>>> the

> >>>> arm/ppc target? arm64 will have performance impact with this change.

> >> sorry, miss one important information

> >> Our platform is an aarch64 server with 46 cpus.

> >> If we reduced the involved cpu numbers, the bug occurred less frequently.

> >>

> >> Yes, mb barrier impact the performance, but correctness is more

> >> important, isn't it ;-)

> >> Maybe we can  find any other lightweight barrier here?

> >>

> >> Cheers,

> >> Jia

> >>> Based on mbuf_autotest, the rte_panic will be invoked in seconds.

> >>>

> >>> PANIC in test_refcnt_iter():

> >>> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free

> >>> 1: [./test(rte_dump_stack+0x38) [0x58d868]]

> >>> Aborted (core dumped)

> >>>

> >

> > So is it only reproducible with mbuf refcnt test?

> > Could it be reproduced with some 'pure' ring test

> > (no mempools/mbufs refcnt, etc.)?

> > The reason I am asking - in that test we also have mbuf refcnt updates

> > (that's what for that test was created) and we are doing some optimizations here too

> > to avoid excessive atomic updates.

> > BTW, if the problem is not reproducible without mbuf refcnt,

> > can I suggest to extend the test  with:

> >    - add a check that enqueue() operation was successful

> >    - walk through the pool and check/printf refcnt of each mbuf.

> > Hopefully that would give us some extra information what is going wrong here.

> > Konstantin

> >

> >

> Currently, the issue is only found in this case here on the ARM

> platform, not sure how it is going with the X86_64 platform


I understand that it is only reproducible on arm so far.
What I am asking - with dpdk is there any other way to reproduce it (on arm)
except then running mbuf_autotest?
Something really simple that not using mbuf/mempool etc?
Just do dequeue/enqueue from multiple threads and check data integrity at the end? 
If not  - what makes you think that the problem is precisely in rte_ring code?
Why not in rte_mbuf let say?

>. In another

> mail of this thread, we've made a simple test based on this and captured

> some information and I pasted there.(I pasted the patch there :-))


Are you talking about that one:
http://dpdk.org/dev/patchwork/patch/30405/
?
It still uses test/test/test_mbuf.c..., 
but anyway I don't really understand how mbuf_autotest supposed 
to work with these changes:
@@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,
rte_ring_enqueue(refcnt_mbuf_ring, m);
                         }
                 }
-               rte_pktmbuf_free(m);
+               // rte_pktmbuf_free(m);
         }
@@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,
         while (!rte_ring_empty(refcnt_mbuf_ring))
                 ;

+       if (NULL != m) {
+               if (1 != rte_mbuf_refcnt_read(m))
+                       printf("m ref is %u\n", rte_mbuf_refcnt_read(m));
+               rte_pktmbuf_free(m);
+       }
+
         /* check that all mbufs are back into mempool by now */
         for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) {
                 if ((i = rte_mempool_avail_count(refcnt_pool)) == n) {

That means all your mbufs (except the last one) will still be allocated.
So the test would fail - as it should, I think.

> And

> it seems that Juhamatti & Jacod found some reverting action several

> months ago.


Didn't get that one either.
Konstantin
  
Ananyev, Konstantin Oct. 19, 2017, 8:02 p.m. UTC | #18
> -----Original Message-----

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin

> Sent: Thursday, October 19, 2017 3:15 PM

> To: Zhao, Bing <ilovethull@163.com>; Jia He <hejianet@gmail.com>; Jerin Jacob <jerin.jacob@caviumnetworks.com>

> Cc: Olivier MATZ <olivier.matz@6wind.com>; dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; bing.zhao@hxt-

> semitech.com

> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

> 

> >

> > Hi,

> >

> > On 2017/10/19 18:02, Ananyev, Konstantin wrote:

> > >

> > > Hi Jia,

> > >

> > >>

> > >> Hi

> > >>

> > >>

> > >> On 10/13/2017 9:02 AM, Jia He Wrote:

> > >>> Hi Jerin

> > >>>

> > >>>

> > >>> On 10/13/2017 1:23 AM, Jerin Jacob Wrote:

> > >>>> -----Original Message-----

> > >>>>> Date: Thu, 12 Oct 2017 17:05:50 +0000

> > >>>>>

> > >> [...]

> > >>>> On the same lines,

> > >>>>

> > >>>> Jia He, jie2.liu, bing.zhao,

> > >>>>

> > >>>> Is this patch based on code review or do you saw this issue on any of

> > >>>> the

> > >>>> arm/ppc target? arm64 will have performance impact with this change.

> > >> sorry, miss one important information

> > >> Our platform is an aarch64 server with 46 cpus.

> > >> If we reduced the involved cpu numbers, the bug occurred less frequently.

> > >>

> > >> Yes, mb barrier impact the performance, but correctness is more

> > >> important, isn't it ;-)

> > >> Maybe we can  find any other lightweight barrier here?

> > >>

> > >> Cheers,

> > >> Jia

> > >>> Based on mbuf_autotest, the rte_panic will be invoked in seconds.

> > >>>

> > >>> PANIC in test_refcnt_iter():

> > >>> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free

> > >>> 1: [./test(rte_dump_stack+0x38) [0x58d868]]

> > >>> Aborted (core dumped)

> > >>>

> > >

> > > So is it only reproducible with mbuf refcnt test?

> > > Could it be reproduced with some 'pure' ring test

> > > (no mempools/mbufs refcnt, etc.)?

> > > The reason I am asking - in that test we also have mbuf refcnt updates

> > > (that's what for that test was created) and we are doing some optimizations here too

> > > to avoid excessive atomic updates.

> > > BTW, if the problem is not reproducible without mbuf refcnt,

> > > can I suggest to extend the test  with:

> > >    - add a check that enqueue() operation was successful

> > >    - walk through the pool and check/printf refcnt of each mbuf.

> > > Hopefully that would give us some extra information what is going wrong here.

> > > Konstantin

> > >

> > >

> > Currently, the issue is only found in this case here on the ARM

> > platform, not sure how it is going with the X86_64 platform

> 

> I understand that it is only reproducible on arm so far.

> What I am asking - with dpdk is there any other way to reproduce it (on arm)

> except then running mbuf_autotest?

> Something really simple that not using mbuf/mempool etc?

> Just do dequeue/enqueue from multiple threads and check data integrity at the end?

> If not  - what makes you think that the problem is precisely in rte_ring code?

> Why not in rte_mbuf let say?


Actually I reread your original mail and finally get your point.
If I understand you correctly the problem with read reordering here is that
after we read prot.tail but before we read cons.head
both cons.head and prod.tail might be updated,
but for us prod.tail change might be unnoticed. 
As an example:
time 0 (cons.head == 0, prod.tail == 0):
prod_tail = r->prod.tail; /* due read reordering */
/* prod_tail == 0 */

 time 1 (cons.head ==5, prod.tail == 5):
*old_head = r->cons.head;
/* cons.head == 5 */
*entries = (prod_tail - *old_head);
/* *entries == (0 - 5) == 0xFFFFFFFB */

And we will move our cons.head forward, even though there are no filled entries in the ring.
Is that right?

As I side notice, I wonder why we have here:
*entries = (prod_tail - *old_head);
instead of:
*entries = r->size + prod_tail - *old_head;
?

Konstantin

> 

> >. In another

> > mail of this thread, we've made a simple test based on this and captured

> > some information and I pasted there.(I pasted the patch there :-))

> 

> Are you talking about that one:

> http://dpdk.org/dev/patchwork/patch/30405/

> ?

> It still uses test/test/test_mbuf.c...,

> but anyway I don't really understand how mbuf_autotest supposed

> to work with these changes:

> @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,

> rte_ring_enqueue(refcnt_mbuf_ring, m);

>                          }

>                  }

> -               rte_pktmbuf_free(m);

> +               // rte_pktmbuf_free(m);

>          }

> @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,

>          while (!rte_ring_empty(refcnt_mbuf_ring))

>                  ;

> 

> +       if (NULL != m) {

> +               if (1 != rte_mbuf_refcnt_read(m))

> +                       printf("m ref is %u\n", rte_mbuf_refcnt_read(m));

> +               rte_pktmbuf_free(m);

> +       }

> +

>          /* check that all mbufs are back into mempool by now */

>          for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) {

>                  if ((i = rte_mempool_avail_count(refcnt_pool)) == n) {

> 

> That means all your mbufs (except the last one) will still be allocated.

> So the test would fail - as it should, I think.

> 

> > And

> > it seems that Juhamatti & Jacod found some reverting action several

> > months ago.

> 

> Didn't get that one either.

> Konstantin
  
Jia He Oct. 20, 2017, 1:57 a.m. UTC | #19
On 10/20/2017 4:02 AM, Ananyev, Konstantin Wrote:
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin
>> Sent: Thursday, October 19, 2017 3:15 PM
>> To: Zhao, Bing <ilovethull@163.com>; Jia He <hejianet@gmail.com>; Jerin Jacob <jerin.jacob@caviumnetworks.com>
>> Cc: Olivier MATZ <olivier.matz@6wind.com>; dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; bing.zhao@hxt-
>> semitech.com
>> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
>>
>>> Hi,
>>>
>>> On 2017/10/19 18:02, Ananyev, Konstantin wrote:
>>>> Hi Jia,
>>>>
>>>>> Hi
>>>>>
>>>>>
>>>>> On 10/13/2017 9:02 AM, Jia He Wrote:
>>>>>> Hi Jerin
>>>>>>
>>>>>>
>>>>>> On 10/13/2017 1:23 AM, Jerin Jacob Wrote:
>>>>>>> -----Original Message-----
>>>>>>>> Date: Thu, 12 Oct 2017 17:05:50 +0000
>>>>>>>>
>>>>> [...]
>>>>>>> On the same lines,
>>>>>>>
>>>>>>> Jia He, jie2.liu, bing.zhao,
>>>>>>>
>>>>>>> Is this patch based on code review or do you saw this issue on any of
>>>>>>> the
>>>>>>> arm/ppc target? arm64 will have performance impact with this change.
>>>>> sorry, miss one important information
>>>>> Our platform is an aarch64 server with 46 cpus.
>>>>> If we reduced the involved cpu numbers, the bug occurred less frequently.
>>>>>
>>>>> Yes, mb barrier impact the performance, but correctness is more
>>>>> important, isn't it ;-)
>>>>> Maybe we can  find any other lightweight barrier here?
>>>>>
>>>>> Cheers,
>>>>> Jia
>>>>>> Based on mbuf_autotest, the rte_panic will be invoked in seconds.
>>>>>>
>>>>>> PANIC in test_refcnt_iter():
>>>>>> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
>>>>>> 1: [./test(rte_dump_stack+0x38) [0x58d868]]
>>>>>> Aborted (core dumped)
>>>>>>
>>>> So is it only reproducible with mbuf refcnt test?
>>>> Could it be reproduced with some 'pure' ring test
>>>> (no mempools/mbufs refcnt, etc.)?
>>>> The reason I am asking - in that test we also have mbuf refcnt updates
>>>> (that's what for that test was created) and we are doing some optimizations here too
>>>> to avoid excessive atomic updates.
>>>> BTW, if the problem is not reproducible without mbuf refcnt,
>>>> can I suggest to extend the test  with:
>>>>     - add a check that enqueue() operation was successful
>>>>     - walk through the pool and check/printf refcnt of each mbuf.
>>>> Hopefully that would give us some extra information what is going wrong here.
>>>> Konstantin
>>>>
>>>>
>>> Currently, the issue is only found in this case here on the ARM
>>> platform, not sure how it is going with the X86_64 platform
>> I understand that it is only reproducible on arm so far.
>> What I am asking - with dpdk is there any other way to reproduce it (on arm)
>> except then running mbuf_autotest?
>> Something really simple that not using mbuf/mempool etc?
>> Just do dequeue/enqueue from multiple threads and check data integrity at the end?
>> If not  - what makes you think that the problem is precisely in rte_ring code?
>> Why not in rte_mbuf let say?
> Actually I reread your original mail and finally get your point.
> If I understand you correctly the problem with read reordering here is that
> after we read prot.tail but before we read cons.head
> both cons.head and prod.tail might be updated,
> but for us prod.tail change might be unnoticed.
> As an example:
> time 0 (cons.head == 0, prod.tail == 0):
> prod_tail = r->prod.tail; /* due read reordering */
> /* prod_tail == 0 */
>
>   time 1 (cons.head ==5, prod.tail == 5):
> *old_head = r->cons.head;
> /* cons.head == 5 */
> *entries = (prod_tail - *old_head);
> /* *entries == (0 - 5) == 0xFFFFFFFB */
>
> And we will move our cons.head forward, even though there are no filled entries in the ring.
> Is that right?
Yes
> As I side notice, I wonder why we have here:
> *entries = (prod_tail - *old_head);
> instead of:
> *entries = r->size + prod_tail - *old_head;
> ?
Yes, I agree with you at this code line.
But reordering will still mess up things even after this change(I have 
tested, still the same as before)
I thought the *entries is a door to prevent consumer from moving forward 
too fast than the producer.
But in some cases, it is correct that prod_tail is smaller than 
*old_head due to  the cirular queue.
In other cases, it is incorrect because of read/read reordering.

AFAICT, the root cause here is the producer tail and cosumer head are 
dependant on each other.
Thus a memory barrier is neccessary.

Cheers,
Jia

>
> Konstantin
>
>>> . In another
>>> mail of this thread, we've made a simple test based on this and captured
>>> some information and I pasted there.(I pasted the patch there :-))
>> Are you talking about that one:
>> http://dpdk.org/dev/patchwork/patch/30405/
>> ?
>> It still uses test/test/test_mbuf.c...,
>> but anyway I don't really understand how mbuf_autotest supposed
>> to work with these changes:
>> @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,
>> rte_ring_enqueue(refcnt_mbuf_ring, m);
>>                           }
>>                   }
>> -               rte_pktmbuf_free(m);
>> +               // rte_pktmbuf_free(m);
>>           }
>> @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,
>>           while (!rte_ring_empty(refcnt_mbuf_ring))
>>                   ;
>>
>> +       if (NULL != m) {
>> +               if (1 != rte_mbuf_refcnt_read(m))
>> +                       printf("m ref is %u\n", rte_mbuf_refcnt_read(m));
>> +               rte_pktmbuf_free(m);
>> +       }
>> +
>>           /* check that all mbufs are back into mempool by now */
>>           for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) {
>>                   if ((i = rte_mempool_avail_count(refcnt_pool)) == n) {
>>
>> That means all your mbufs (except the last one) will still be allocated.
>> So the test would fail - as it should, I think.
>>
>>> And
>>> it seems that Juhamatti & Jacod found some reverting action several
>>> months ago.
>> Didn't get that one either.
>> Konstantin
  
Jerin Jacob Oct. 20, 2017, 5:43 a.m. UTC | #20
-----Original Message-----
> Date: Fri, 20 Oct 2017 09:57:58 +0800
> From: Jia He <hejianet@gmail.com>
> To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
>  <ilovethull@163.com>, Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: Olivier MATZ <olivier.matz@6wind.com>, "dev@dpdk.org" <dev@dpdk.org>,
>  "jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>,
>  "jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>,
>  "bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com>, "Richardson,
>  Bruce" <bruce.richardson@intel.com>
> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
>  loading when doing enqueue/dequeue
> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>  Thunderbird/52.4.0
> 
> 
> 
> On 10/20/2017 4:02 AM, Ananyev, Konstantin Wrote:
> > 
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin
> > > Sent: Thursday, October 19, 2017 3:15 PM
> > > To: Zhao, Bing <ilovethull@163.com>; Jia He <hejianet@gmail.com>; Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > Cc: Olivier MATZ <olivier.matz@6wind.com>; dev@dpdk.org; jia.he@hxt-semitech.com; jie2.liu@hxt-semitech.com; bing.zhao@hxt-
> > > semitech.com
> > > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue
> > > 
> > > > Hi,
> > > > 
> > > > On 2017/10/19 18:02, Ananyev, Konstantin wrote:
> > > > > Hi Jia,
> > > > > 
> > > > > > Hi
> > > > > > 
> > > > > > 
> > > > > > On 10/13/2017 9:02 AM, Jia He Wrote:
> > > > > > > Hi Jerin
> > > > > > > 
> > > > > > > 
> > > > > > > On 10/13/2017 1:23 AM, Jerin Jacob Wrote:
> > > > > > > > -----Original Message-----
> > > > > > > > > Date: Thu, 12 Oct 2017 17:05:50 +0000
> > > > > > > > > 
> > > > > > [...]
> > > > > > > > On the same lines,
> > > > > > > > 
> > > > > > > > Jia He, jie2.liu, bing.zhao,
> > > > > > > > 
> > > > > > > > Is this patch based on code review or do you saw this issue on any of
> > > > > > > > the
> > > > > > > > arm/ppc target? arm64 will have performance impact with this change.
> > > > > > sorry, miss one important information
> > > > > > Our platform is an aarch64 server with 46 cpus.
> > > > > > If we reduced the involved cpu numbers, the bug occurred less frequently.
> > > > > > 
> > > > > > Yes, mb barrier impact the performance, but correctness is more
> > > > > > important, isn't it ;-)
> > > > > > Maybe we can  find any other lightweight barrier here?
> > > > > > 
> > > > > > Cheers,
> > > > > > Jia
> > > > > > > Based on mbuf_autotest, the rte_panic will be invoked in seconds.
> > > > > > > 
> > > > > > > PANIC in test_refcnt_iter():
> > > > > > > (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free
> > > > > > > 1: [./test(rte_dump_stack+0x38) [0x58d868]]
> > > > > > > Aborted (core dumped)
> > > > > > > 
> > > > > So is it only reproducible with mbuf refcnt test?
> > > > > Could it be reproduced with some 'pure' ring test
> > > > > (no mempools/mbufs refcnt, etc.)?
> > > > > The reason I am asking - in that test we also have mbuf refcnt updates
> > > > > (that's what for that test was created) and we are doing some optimizations here too
> > > > > to avoid excessive atomic updates.
> > > > > BTW, if the problem is not reproducible without mbuf refcnt,
> > > > > can I suggest to extend the test  with:
> > > > >     - add a check that enqueue() operation was successful
> > > > >     - walk through the pool and check/printf refcnt of each mbuf.
> > > > > Hopefully that would give us some extra information what is going wrong here.
> > > > > Konstantin
> > > > > 
> > > > > 
> > > > Currently, the issue is only found in this case here on the ARM
> > > > platform, not sure how it is going with the X86_64 platform
> > > I understand that it is only reproducible on arm so far.
> > > What I am asking - with dpdk is there any other way to reproduce it (on arm)
> > > except then running mbuf_autotest?
> > > Something really simple that not using mbuf/mempool etc?
> > > Just do dequeue/enqueue from multiple threads and check data integrity at the end?
> > > If not  - what makes you think that the problem is precisely in rte_ring code?
> > > Why not in rte_mbuf let say?
> > Actually I reread your original mail and finally get your point.
> > If I understand you correctly the problem with read reordering here is that
> > after we read prot.tail but before we read cons.head
> > both cons.head and prod.tail might be updated,
> > but for us prod.tail change might be unnoticed.
> > As an example:
> > time 0 (cons.head == 0, prod.tail == 0):
> > prod_tail = r->prod.tail; /* due read reordering */
> > /* prod_tail == 0 */
> > 
> >   time 1 (cons.head ==5, prod.tail == 5):
> > *old_head = r->cons.head;
> > /* cons.head == 5 */
> > *entries = (prod_tail - *old_head);
> > /* *entries == (0 - 5) == 0xFFFFFFFB */
> > 
> > And we will move our cons.head forward, even though there are no filled entries in the ring.
> > Is that right?
> Yes
> > As I side notice, I wonder why we have here:
> > *entries = (prod_tail - *old_head);
> > instead of:
> > *entries = r->size + prod_tail - *old_head;
> > ?
> Yes, I agree with you at this code line.
> But reordering will still mess up things even after this change(I have
> tested, still the same as before)
> I thought the *entries is a door to prevent consumer from moving forward too
> fast than the producer.
> But in some cases, it is correct that prod_tail is smaller than *old_head
> due to  the cirular queue.
> In other cases, it is incorrect because of read/read reordering.
> 
> AFAICT, the root cause here is the producer tail and cosumer head are
> dependant on each other.
> Thus a memory barrier is neccessary.

Yes. The barrier is necessary.
In fact, upstream freebsd fixed this issue for arm64. DPDK ring
implementation is derived from freebsd's buf_ring.h.
https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166

I think, the only outstanding issue is, how to reduce the performance
impact for arm64. I believe using accurate/release semantics instead
of rte_smp_rmb() will reduce the performance overhead like similar ring implementations below,
freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 
odp: https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c

Jia,
1) Can you verify the use of accurate/release semantics fixes the problem in your
platform? like use of atomic_load_acq* in the reference code.
2) If so, What is the overhead between accurate/release and plane smp_smb()
barriers. Based on that we need decide what path to take.

Note:
This issue wont come in all the arm64 implementation. it comes on arm64
implementation with OOO(out of order) implementations.


> 
> Cheers,
> Jia
> 
> > 
> > Konstantin
> > 
> > > > . In another
> > > > mail of this thread, we've made a simple test based on this and captured
> > > > some information and I pasted there.(I pasted the patch there :-))
> > > Are you talking about that one:
> > > http://dpdk.org/dev/patchwork/patch/30405/
> > > ?
> > > It still uses test/test/test_mbuf.c...,
> > > but anyway I don't really understand how mbuf_autotest supposed
> > > to work with these changes:
> > > @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,
> > > rte_ring_enqueue(refcnt_mbuf_ring, m);
> > >                           }
> > >                   }
> > > -               rte_pktmbuf_free(m);
> > > +               // rte_pktmbuf_free(m);
> > >           }
> > > @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,
> > >           while (!rte_ring_empty(refcnt_mbuf_ring))
> > >                   ;
> > > 
> > > +       if (NULL != m) {
> > > +               if (1 != rte_mbuf_refcnt_read(m))
> > > +                       printf("m ref is %u\n", rte_mbuf_refcnt_read(m));
> > > +               rte_pktmbuf_free(m);
> > > +       }
> > > +
> > >           /* check that all mbufs are back into mempool by now */
> > >           for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) {
> > >                   if ((i = rte_mempool_avail_count(refcnt_pool)) == n) {
> > > 
> > > That means all your mbufs (except the last one) will still be allocated.
> > > So the test would fail - as it should, I think.
> > > 
> > > > And
> > > > it seems that Juhamatti & Jacod found some reverting action several
> > > > months ago.
> > > Didn't get that one either.
> > > Konstantin
>
  
Ananyev, Konstantin Oct. 20, 2017, 7:03 a.m. UTC | #21
> >>

> >>> Hi,

> >>>

> >>> On 2017/10/19 18:02, Ananyev, Konstantin wrote:

> >>>> Hi Jia,

> >>>>

> >>>>> Hi

> >>>>>

> >>>>>

> >>>>> On 10/13/2017 9:02 AM, Jia He Wrote:

> >>>>>> Hi Jerin

> >>>>>>

> >>>>>>

> >>>>>> On 10/13/2017 1:23 AM, Jerin Jacob Wrote:

> >>>>>>> -----Original Message-----

> >>>>>>>> Date: Thu, 12 Oct 2017 17:05:50 +0000

> >>>>>>>>

> >>>>> [...]

> >>>>>>> On the same lines,

> >>>>>>>

> >>>>>>> Jia He, jie2.liu, bing.zhao,

> >>>>>>>

> >>>>>>> Is this patch based on code review or do you saw this issue on any of

> >>>>>>> the

> >>>>>>> arm/ppc target? arm64 will have performance impact with this change.

> >>>>> sorry, miss one important information

> >>>>> Our platform is an aarch64 server with 46 cpus.

> >>>>> If we reduced the involved cpu numbers, the bug occurred less frequently.

> >>>>>

> >>>>> Yes, mb barrier impact the performance, but correctness is more

> >>>>> important, isn't it ;-)

> >>>>> Maybe we can  find any other lightweight barrier here?

> >>>>>

> >>>>> Cheers,

> >>>>> Jia

> >>>>>> Based on mbuf_autotest, the rte_panic will be invoked in seconds.

> >>>>>>

> >>>>>> PANIC in test_refcnt_iter():

> >>>>>> (lcore=0, iter=0): after 10s only 61 of 64 mbufs left free

> >>>>>> 1: [./test(rte_dump_stack+0x38) [0x58d868]]

> >>>>>> Aborted (core dumped)

> >>>>>>

> >>>> So is it only reproducible with mbuf refcnt test?

> >>>> Could it be reproduced with some 'pure' ring test

> >>>> (no mempools/mbufs refcnt, etc.)?

> >>>> The reason I am asking - in that test we also have mbuf refcnt updates

> >>>> (that's what for that test was created) and we are doing some optimizations here too

> >>>> to avoid excessive atomic updates.

> >>>> BTW, if the problem is not reproducible without mbuf refcnt,

> >>>> can I suggest to extend the test  with:

> >>>>     - add a check that enqueue() operation was successful

> >>>>     - walk through the pool and check/printf refcnt of each mbuf.

> >>>> Hopefully that would give us some extra information what is going wrong here.

> >>>> Konstantin

> >>>>

> >>>>

> >>> Currently, the issue is only found in this case here on the ARM

> >>> platform, not sure how it is going with the X86_64 platform

> >> I understand that it is only reproducible on arm so far.

> >> What I am asking - with dpdk is there any other way to reproduce it (on arm)

> >> except then running mbuf_autotest?

> >> Something really simple that not using mbuf/mempool etc?

> >> Just do dequeue/enqueue from multiple threads and check data integrity at the end?

> >> If not  - what makes you think that the problem is precisely in rte_ring code?

> >> Why not in rte_mbuf let say?

> > Actually I reread your original mail and finally get your point.

> > If I understand you correctly the problem with read reordering here is that

> > after we read prot.tail but before we read cons.head

> > both cons.head and prod.tail might be updated,

> > but for us prod.tail change might be unnoticed.

> > As an example:

> > time 0 (cons.head == 0, prod.tail == 0):

> > prod_tail = r->prod.tail; /* due read reordering */

> > /* prod_tail == 0 */

> >

> >   time 1 (cons.head ==5, prod.tail == 5):

> > *old_head = r->cons.head;

> > /* cons.head == 5 */

> > *entries = (prod_tail - *old_head);

> > /* *entries == (0 - 5) == 0xFFFFFFFB */

> >

> > And we will move our cons.head forward, even though there are no filled entries in the ring.

> > Is that right?

> Yes


Ok, thanks for your patience - it took me a while to understand what
you guys are talking about here.

> > As I side notice, I wonder why we have here:

> > *entries = (prod_tail - *old_head);

> > instead of:

> > *entries = r->size + prod_tail - *old_head;

> > ?

> Yes, I agree with you at this code line.

> But reordering will still mess up things even after this change(I have

> tested, still the same as before)


Yes, I am convinced now that rmb is necessary here :)
As I said this is just a different thing I noticed while looking at the code.
Probably should discuss it in a different thread.
Konstantin 

> I thought the *entries is a door to prevent consumer from moving forward

> too fast than the producer.

> But in some cases, it is correct that prod_tail is smaller than

> *old_head due to  the cirular queue.

> In other cases, it is incorrect because of read/read reordering.

> 

> AFAICT, the root cause here is the producer tail and cosumer head are

> dependant on each other.

> Thus a memory barrier is neccessary.

> 

> Cheers,

> Jia

> 

> >

> > Konstantin

> >

> >>> . In another

> >>> mail of this thread, we've made a simple test based on this and captured

> >>> some information and I pasted there.(I pasted the patch there :-))

> >> Are you talking about that one:

> >> http://dpdk.org/dev/patchwork/patch/30405/

> >> ?

> >> It still uses test/test/test_mbuf.c...,

> >> but anyway I don't really understand how mbuf_autotest supposed

> >> to work with these changes:

> >> @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,

> >> rte_ring_enqueue(refcnt_mbuf_ring, m);

> >>                           }

> >>                   }

> >> -               rte_pktmbuf_free(m);

> >> +               // rte_pktmbuf_free(m);

> >>           }

> >> @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,

> >>           while (!rte_ring_empty(refcnt_mbuf_ring))

> >>                   ;

> >>

> >> +       if (NULL != m) {

> >> +               if (1 != rte_mbuf_refcnt_read(m))

> >> +                       printf("m ref is %u\n", rte_mbuf_refcnt_read(m));

> >> +               rte_pktmbuf_free(m);

> >> +       }

> >> +

> >>           /* check that all mbufs are back into mempool by now */

> >>           for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) {

> >>                   if ((i = rte_mempool_avail_count(refcnt_pool)) == n) {

> >>

> >> That means all your mbufs (except the last one) will still be allocated.

> >> So the test would fail - as it should, I think.

> >>

> >>> And

> >>> it seems that Juhamatti & Jacod found some reverting action several

> >>> months ago.

> >> Didn't get that one either.

> >> Konstantin
  
Jia He Oct. 23, 2017, 8:49 a.m. UTC | #22
Hi Jerin


On 10/20/2017 1:43 PM, Jerin Jacob Wrote:
> -----Original Message-----
>>
[...]
>> dependant on each other.
>> Thus a memory barrier is neccessary.
> Yes. The barrier is necessary.
> In fact, upstream freebsd fixed this issue for arm64. DPDK ring
> implementation is derived from freebsd's buf_ring.h.
> https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
>
> I think, the only outstanding issue is, how to reduce the performance
> impact for arm64. I believe using accurate/release semantics instead
> of rte_smp_rmb() will reduce the performance overhead like similar ring implementations below,
> freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
> odp: https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c
>
> Jia,
> 1) Can you verify the use of accurate/release semantics fixes the problem in your
> platform? like use of atomic_load_acq* in the reference code.
> 2) If so, What is the overhead between accurate/release and plane smp_smb()
> barriers. Based on that we need decide what path to take.
I've tested 3 cases.  The new 3rd case is to use the load_acquire 
barrier (half barrier) you mentioned
at above link.
The patch seems like:
@@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
                 /* Reset n to the initial burst count */
                 n = max;

-               *old_head = r->prod.head;
-               const uint32_t cons_tail = r->cons.tail;
+               *old_head = atomic_load_acq_32(&r->prod.head);
+               const uint32_t cons_tail = 
atomic_load_acq_32(&r->cons.tail);

@@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s
                 /* Restore n as it may change every loop */
                 n = max;

-               *old_head = r->cons.head;
-               const uint32_t prod_tail = r->prod.tail;
+               *old_head = atomic_load_acq_32(&r->cons.head);
+               const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail)
                 /* The subtraction is done between two unsigned 32bits 
value
                  * (the result is always modulo 32 bits even if we have
                  * cons_head > prod_tail). So 'entries' is always between 0
                  * and size(ring)-1. */

The half barrier patch passed the fuctional test.

As for the performance comparision on *arm64*(the debug patch is at
http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see 
the test results
below:

[case 1] old codes, no barrier
============================================
  Performance counter stats for './test --no-huge -l 1-10':

      689275.001200      task-clock (msec)         #    9.771 CPUs utilized
               6223      context-switches          #    0.009 K/sec
                 10      cpu-migrations            #    0.000 K/sec
                653      page-faults               #    0.001 K/sec
      1721190914583      cycles                    #    2.497 GHz
      3363238266430      instructions              #    1.95  insn per 
cycle
    <not supported> branches
           27804740      branch-misses             #    0.00% of all 
branches

       70.540618825 seconds time elapsed

[case 2] full barrier with rte_smp_rmb()
============================================
  Performance counter stats for './test --no-huge -l 1-10':

      582557.895850      task-clock (msec)         #    9.752 CPUs utilized
               5242      context-switches          #    0.009 K/sec
                 10      cpu-migrations            #    0.000 K/sec
                665      page-faults               #    0.001 K/sec
      1454360730055      cycles                    #    2.497 GHz
       587197839907      instructions              #    0.40  insn per 
cycle
    <not supported> branches
           27799687      branch-misses             #    0.00% of all 
branches

       59.735582356 seconds time elapse

[case 1] half barrier with load_acquire
============================================
  Performance counter stats for './test --no-huge -l 1-10':

      660758.877050      task-clock (msec)         #    9.764 CPUs utilized
               5982      context-switches          #    0.009 K/sec
                 11      cpu-migrations            #    0.000 K/sec
                657      page-faults               #    0.001 K/sec
      1649875318044      cycles                    #    2.497 GHz
       591583257765      instructions              #    0.36  insn per 
cycle
    <not supported> branches
           27994903      branch-misses             #    0.00% of all 
branches

       67.672855107 seconds time elapsed

Please see the context-switches in the perf results
test result  sorted by time is:
full barrier < half barrier < no barrier

AFAICT, in this case ,the cpu reordering will add the possibility for 
context switching and
increase the running time.

Any ideas?

Cheers,
Jia

>
> Note:
> This issue wont come in all the arm64 implementation. it comes on arm64
> implementation with OOO(out of order) implementations.
>
>
>> Cheers,
>> Jia
>>
>>> Konstantin
>>>
>>>>> . In another
>>>>> mail of this thread, we've made a simple test based on this and captured
>>>>> some information and I pasted there.(I pasted the patch there :-))
>>>> Are you talking about that one:
>>>> http://dpdk.org/dev/patchwork/patch/30405/
>>>> ?
>>>> It still uses test/test/test_mbuf.c...,
>>>> but anyway I don't really understand how mbuf_autotest supposed
>>>> to work with these changes:
>>>> @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,
>>>> rte_ring_enqueue(refcnt_mbuf_ring, m);
>>>>                            }
>>>>                    }
>>>> -               rte_pktmbuf_free(m);
>>>> +               // rte_pktmbuf_free(m);
>>>>            }
>>>> @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter,
>>>>            while (!rte_ring_empty(refcnt_mbuf_ring))
>>>>                    ;
>>>>
>>>> +       if (NULL != m) {
>>>> +               if (1 != rte_mbuf_refcnt_read(m))
>>>> +                       printf("m ref is %u\n", rte_mbuf_refcnt_read(m));
>>>> +               rte_pktmbuf_free(m);
>>>> +       }
>>>> +
>>>>            /* check that all mbufs are back into mempool by now */
>>>>            for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) {
>>>>                    if ((i = rte_mempool_avail_count(refcnt_pool)) == n) {
>>>>
>>>> That means all your mbufs (except the last one) will still be allocated.
>>>> So the test would fail - as it should, I think.
>>>>
>>>>> And
>>>>> it seems that Juhamatti & Jacod found some reverting action several
>>>>> months ago.
>>>> Didn't get that one either.
>>>> Konstantin
  
Kuusisaari, Juhamatti (Infinera - FI/Espoo) Oct. 23, 2017, 9:05 a.m. UTC | #23
Hi,

> On 10/20/2017 1:43 PM, Jerin Jacob Wrote:

> > -----Original Message-----

> >>

> [...]

> >> dependant on each other.

> >> Thus a memory barrier is neccessary.

> > Yes. The barrier is necessary.

> > In fact, upstream freebsd fixed this issue for arm64. DPDK ring

> > implementation is derived from freebsd's buf_ring.h.

> >

> https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166

> >

> > I think, the only outstanding issue is, how to reduce the performance

> > impact for arm64. I believe using accurate/release semantics instead

> > of rte_smp_rmb() will reduce the performance overhead like similar

> > ring implementations below,

> > freebsd:

> >

> https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166

> > odp:

> > https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio

> > /ring.c

> >

> > Jia,

> > 1) Can you verify the use of accurate/release semantics fixes the

> > problem in your platform? like use of atomic_load_acq* in the reference

> code.

> > 2) If so, What is the overhead between accurate/release and plane

> > smp_smb() barriers. Based on that we need decide what path to take.

> I've tested 3 cases.  The new 3rd case is to use the load_acquire barrier (half

> barrier) you mentioned at above link.

> The patch seems like:

> @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int

> is_sp,

>                  /* Reset n to the initial burst count */

>                  n = max;

> 

> -               *old_head = r->prod.head;

> -               const uint32_t cons_tail = r->cons.tail;

> +               *old_head = atomic_load_acq_32(&r->prod.head);

> +               const uint32_t cons_tail =

> atomic_load_acq_32(&r->cons.tail);

> 

> @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int

> is_s

>                  /* Restore n as it may change every loop */

>                  n = max;

> 

> -               *old_head = r->cons.head;

> -               const uint32_t prod_tail = r->prod.tail;

> +               *old_head = atomic_load_acq_32(&r->cons.head);

> +               const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail)

>                  /* The subtraction is done between two unsigned 32bits

> value

>                   * (the result is always modulo 32 bits even if we have

>                   * cons_head > prod_tail). So 'entries' is always between 0

>                   * and size(ring)-1. */

>

> The half barrier patch passed the fuctional test.

> 

> As for the performance comparision on *arm64*(the debug patch is at

> http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see

> the test results

> below:

> 

> [case 1] old codes, no barrier

> ============================================

>   Performance counter stats for './test --no-huge -l 1-10':

> 

>       689275.001200      task-clock (msec)         #    9.771 CPUs utilized

>                6223      context-switches          #    0.009 K/sec

>                  10      cpu-migrations            #    0.000 K/sec

>                 653      page-faults               #    0.001 K/sec

>       1721190914583      cycles                    #    2.497 GHz

>       3363238266430      instructions              #    1.95  insn per

> cycle

>     <not supported> branches

>            27804740      branch-misses             #    0.00% of all

> branches

> 

>        70.540618825 seconds time elapsed

> 

> [case 2] full barrier with rte_smp_rmb()

> ============================================

>   Performance counter stats for './test --no-huge -l 1-10':

> 

>       582557.895850      task-clock (msec)         #    9.752 CPUs utilized

>                5242      context-switches          #    0.009 K/sec

>                  10      cpu-migrations            #    0.000 K/sec

>                 665      page-faults               #    0.001 K/sec

>       1454360730055      cycles                    #    2.497 GHz

>        587197839907      instructions              #    0.40  insn per

> cycle

>     <not supported> branches

>            27799687      branch-misses             #    0.00% of all

> branches

> 

>        59.735582356 seconds time elapse

> 

> [case 1] half barrier with load_acquire

> ============================================

>   Performance counter stats for './test --no-huge -l 1-10':

> 

>       660758.877050      task-clock (msec)         #    9.764 CPUs utilized

>                5982      context-switches          #    0.009 K/sec

>                  11      cpu-migrations            #    0.000 K/sec

>                 657      page-faults               #    0.001 K/sec

>       1649875318044      cycles                    #    2.497 GHz

>        591583257765      instructions              #    0.36  insn per

> cycle

>     <not supported> branches

>            27994903      branch-misses             #    0.00% of all

> branches

> 

>        67.672855107 seconds time elapsed

> 

> Please see the context-switches in the perf results

> test result  sorted by time is:

> full barrier < half barrier < no barrier

> 

> AFAICT, in this case ,the cpu reordering will add the possibility for

> context switching and

> increase the running time.

> 

> Any ideas?


You could also try a case where you rearrange the rte_ring structure so that prod.head and cons.tail are parts of an union of a 64-bit variable and then read this 64-bit variable with one atomic read. I do not think that half barrier is even needed here with this approach, as long as you can really read the 64-bit variable fully at once. This should speed up. 

Cheers,
--
 Juhamatti
 
> Cheers,

> Jia

> 

> >

> > Note:

> > This issue wont come in all the arm64 implementation. it comes on arm64

> > implementation with OOO(out of order) implementations.

> >

> >

> >> Cheers,

> >> Jia

> >>

> >>> Konstantin

> >>>

> >>>>> . In another

> >>>>> mail of this thread, we've made a simple test based on this and

> captured

> >>>>> some information and I pasted there.(I pasted the patch there :-))

> >>>> Are you talking about that one:

> >>>> http://dpdk.org/dev/patchwork/patch/30405/

> >>>> ?

> >>>> It still uses test/test/test_mbuf.c...,

> >>>> but anyway I don't really understand how mbuf_autotest supposed

> >>>> to work with these changes:

> >>>> @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int

> iter,

> >>>> rte_ring_enqueue(refcnt_mbuf_ring, m);

> >>>>                            }

> >>>>                    }

> >>>> -               rte_pktmbuf_free(m);

> >>>> +               // rte_pktmbuf_free(m);

> >>>>            }

> >>>> @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned

> int iter,

> >>>>            while (!rte_ring_empty(refcnt_mbuf_ring))

> >>>>                    ;

> >>>>

> >>>> +       if (NULL != m) {

> >>>> +               if (1 != rte_mbuf_refcnt_read(m))

> >>>> +                       printf("m ref is %u\n", rte_mbuf_refcnt_read(m));

> >>>> +               rte_pktmbuf_free(m);

> >>>> +       }

> >>>> +

> >>>>            /* check that all mbufs are back into mempool by now */

> >>>>            for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) {

> >>>>                    if ((i = rte_mempool_avail_count(refcnt_pool)) == n) {

> >>>>

> >>>> That means all your mbufs (except the last one) will still be allocated.

> >>>> So the test would fail - as it should, I think.

> >>>>

> >>>>> And

> >>>>> it seems that Juhamatti & Jacod found some reverting action several

> >>>>> months ago.

> >>>> Didn't get that one either.

> >>>> Konstantin

> 

> --

> Cheers,

> Jia
  
Bruce Richardson Oct. 23, 2017, 9:10 a.m. UTC | #24
On Mon, Oct 23, 2017 at 09:05:24AM +0000, Kuusisaari, Juhamatti wrote:
> 
> Hi,
> 
> > On 10/20/2017 1:43 PM, Jerin Jacob Wrote:
> > > -----Original Message-----
> > >>
> > [...]
> > >> dependant on each other.  Thus a memory barrier is neccessary.
> > > Yes. The barrier is necessary.  In fact, upstream freebsd fixed
> > > this issue for arm64. DPDK ring implementation is derived from
> > > freebsd's buf_ring.h.
> > >
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
> > >
> > > I think, the only outstanding issue is, how to reduce the
> > > performance impact for arm64. I believe using accurate/release
> > > semantics instead of rte_smp_rmb() will reduce the performance
> > > overhead like similar ring implementations below, freebsd:
> > >
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
> > > odp:
> > > https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio
> > > /ring.c
> > >
> > > Jia, 1) Can you verify the use of accurate/release semantics fixes
> > > the problem in your platform? like use of atomic_load_acq* in the
> > > reference
> > code.
> > > 2) If so, What is the overhead between accurate/release and plane
> > > smp_smb() barriers. Based on that we need decide what path to
> > > take.
> > I've tested 3 cases.  The new 3rd case is to use the load_acquire
> > barrier (half barrier) you mentioned at above link.  The patch seems
> > like: @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring
> > *r, int is_sp,                 /* Reset n to the initial burst count
> > */                 n = max;
> > 
> > -               *old_head = r->prod.head; -               const
> > uint32_t cons_tail = r->cons.tail; +               *old_head =
> > atomic_load_acq_32(&r->prod.head); +               const uint32_t
> > cons_tail = atomic_load_acq_32(&r->cons.tail);
> > 
> > @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> > int is_s                 /* Restore n as it may change every loop */
> >                 n = max;
> > 
> > -               *old_head = r->cons.head; -               const
> > uint32_t prod_tail = r->prod.tail; +               *old_head =
> > atomic_load_acq_32(&r->cons.head); +               const uint32_t
> > prod_tail = atomic_load_acq_32(&r->prod.tail)                 /* The
> > subtraction is done between two unsigned 32bits value
> >                  * (the result is always modulo 32 bits even if we
> > have                  * cons_head > prod_tail). So 'entries' is
> > always between 0                  * and size(ring)-1. */
> >
> > The half barrier patch passed the fuctional test.
> > 
> > As for the performance comparision on *arm64*(the debug patch is at
> > http://dpdk.org/ml/archives/dev/2017-October/079012.html), please
> > see the test results below:
> > 
> > [case 1] old codes, no barrier
> > ============================================  Performance counter
> > stats for './test --no-huge -l 1-10':
> > 
> >       689275.001200      task-clock (msec)         #    9.771 CPUs
> >  utilized               6223      context-switches          #   
> >  0.009 K/sec                 10      cpu-migrations            #   
> >  0.000 K/sec                653      page-faults               #   
> >  0.001 K/sec      1721190914583      cycles                    #   
> >  2.497 GHz      3363238266430      instructions              #   
> >  1.95  insn per cycle    <not supported> branches          
> >  27804740      branch-misses             #    0.00% of all branches
> > 
> >        70.540618825 seconds time elapsed
> > 
> > [case 2] full barrier with rte_smp_rmb()
> > ============================================  Performance counter
> > stats for './test --no-huge -l 1-10':
> > 
> >       582557.895850      task-clock (msec)         #    9.752 CPUs
> >  utilized               5242      context-switches          #   
> >  0.009 K/sec                 10      cpu-migrations            #   
> >  0.000 K/sec                665      page-faults               #   
> >  0.001 K/sec      1454360730055      cycles                    #   
> >  2.497 GHz       587197839907      instructions              #   
> >  0.40  insn per cycle    <not supported> branches          
> >  27799687      branch-misses             #    0.00% of all branches
> > 
> >        59.735582356 seconds time elapse
> > 
> > [case 1] half barrier with load_acquire
> > ============================================  Performance counter
> > stats for './test --no-huge -l 1-10':
> > 
> >       660758.877050      task-clock (msec)         #    9.764 CPUs
> >  utilized               5982      context-switches          #   
> >  0.009 K/sec                 11      cpu-migrations            #   
> >  0.000 K/sec                657      page-faults               #   
> >  0.001 K/sec      1649875318044      cycles                    #   
> >  2.497 GHz       591583257765      instructions              #   
> >  0.36  insn per cycle    <not supported> branches          
> >  27994903      branch-misses             #    0.00% of all branches
> > 
> >        67.672855107 seconds time elapsed
> > 
> > Please see the context-switches in the perf results test result 
> > sorted by time is: full barrier < half barrier < no barrier
> > 
> > AFAICT, in this case ,the cpu reordering will add the possibility
> > for context switching and increase the running time.
> > 
> > Any ideas?
> 
> You could also try a case where you rearrange the rte_ring structure
> so that prod.head and cons.tail are parts of an union of a 64-bit
> variable and then read this 64-bit variable with one atomic read. I do
> not think that half barrier is even needed here with this approach, as
> long as you can really read the 64-bit variable fully at once. This
> should speed up. 
> 
> Cheers, --
Yes, that would work, but unfortunately it means mixing producer and
consumer data onto the same cacheline, which may have other negative
effects on performance (depending on arch/platform) due to cores
invalidating each other's cachelines.

/Bruce
  
Jerin Jacob Oct. 23, 2017, 10:06 a.m. UTC | #25
-----Original Message-----
> Date: Mon, 23 Oct 2017 16:49:01 +0800
> From: Jia He <hejianet@gmail.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
>  <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
>  "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>  <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>  <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>  <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
>  <bruce.richardson@intel.com>
> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
>  loading when doing enqueue/dequeue
> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>  Thunderbird/52.4.0
> 
> Hi Jerin
> 
> 
> On 10/20/2017 1:43 PM, Jerin Jacob Wrote:
> > -----Original Message-----
> > > 
> [...]
> > > dependant on each other.
> > > Thus a memory barrier is neccessary.
> > Yes. The barrier is necessary.
> > In fact, upstream freebsd fixed this issue for arm64. DPDK ring
> > implementation is derived from freebsd's buf_ring.h.
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
> > 
> > I think, the only outstanding issue is, how to reduce the performance
> > impact for arm64. I believe using accurate/release semantics instead
> > of rte_smp_rmb() will reduce the performance overhead like similar ring implementations below,
> > freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
> > odp: https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c
> > 
> > Jia,
> > 1) Can you verify the use of accurate/release semantics fixes the problem in your
> > platform? like use of atomic_load_acq* in the reference code.
> > 2) If so, What is the overhead between accurate/release and plane smp_smb()
> > barriers. Based on that we need decide what path to take.
> I've tested 3 cases.  The new 3rd case is to use the load_acquire barrier
> (half barrier) you mentioned
> at above link.
> The patch seems like:
> @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
>                 /* Reset n to the initial burst count */
>                 n = max;
> 
> -               *old_head = r->prod.head;
> -               const uint32_t cons_tail = r->cons.tail;
> +               *old_head = atomic_load_acq_32(&r->prod.head);
> +               const uint32_t cons_tail =
> atomic_load_acq_32(&r->cons.tail);
> 
> @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s
>                 /* Restore n as it may change every loop */
>                 n = max;
> 
> -               *old_head = r->cons.head;
> -               const uint32_t prod_tail = r->prod.tail;
> +               *old_head = atomic_load_acq_32(&r->cons.head);
> +               const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail)
>                 /* The subtraction is done between two unsigned 32bits value
>                  * (the result is always modulo 32 bits even if we have
>                  * cons_head > prod_tail). So 'entries' is always between 0
>                  * and size(ring)-1. */
> 
> The half barrier patch passed the fuctional test.
> 
> As for the performance comparision on *arm64*(the debug patch is at
> http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see the
> test results
> below:
> 
> [case 1] old codes, no barrier
> ============================================
>  Performance counter stats for './test --no-huge -l 1-10':
> 
>      689275.001200      task-clock (msec)         #    9.771 CPUs utilized
>               6223      context-switches          #    0.009 K/sec
>                 10      cpu-migrations            #    0.000 K/sec
>                653      page-faults               #    0.001 K/sec
>      1721190914583      cycles                    #    2.497 GHz
>      3363238266430      instructions              #    1.95  insn per cycle
>    <not supported> branches
>           27804740      branch-misses             #    0.00% of all branches
> 
>       70.540618825 seconds time elapsed
> 
> [case 2] full barrier with rte_smp_rmb()
> ============================================
>  Performance counter stats for './test --no-huge -l 1-10':
> 
>      582557.895850      task-clock (msec)         #    9.752 CPUs utilized
>               5242      context-switches          #    0.009 K/sec
>                 10      cpu-migrations            #    0.000 K/sec
>                665      page-faults               #    0.001 K/sec
>      1454360730055      cycles                    #    2.497 GHz
>       587197839907      instructions              #    0.40  insn per cycle
>    <not supported> branches
>           27799687      branch-misses             #    0.00% of all branches
> 
>       59.735582356 seconds time elapse
> 
> [case 1] half barrier with load_acquire
> ============================================
>  Performance counter stats for './test --no-huge -l 1-10':
> 
>      660758.877050      task-clock (msec)         #    9.764 CPUs utilized
>               5982      context-switches          #    0.009 K/sec
>                 11      cpu-migrations            #    0.000 K/sec
>                657      page-faults               #    0.001 K/sec
>      1649875318044      cycles                    #    2.497 GHz
>       591583257765      instructions              #    0.36  insn per cycle
>    <not supported> branches
>           27994903      branch-misses             #    0.00% of all branches
> 
>       67.672855107 seconds time elapsed
> 
> Please see the context-switches in the perf results
> test result  sorted by time is:
> full barrier < half barrier < no barrier
> 
> AFAICT, in this case ,the cpu reordering will add the possibility for
> context switching and
> increase the running time.

> Any ideas?

Regarding performance test, it better to use ring perf test case
on _isolated_ cores to measure impact on number of enqueue/dequeue operations.

example:
./build/app/test -c 0xff -n 4
>>ring_perf_autotest

By default, arm64+dpdk will be using el0 counter to measure the cycles. I
think, in your SoC, it will be running at 50MHz or 100MHz.So, You can
follow the below scheme to get accurate cycle measurement scheme:

See: http://dpdk.org/doc/guides/prog_guide/profile_app.html
check: 44.2.2. High-resolution cycle counter
  
Jia He Oct. 24, 2017, 2:04 a.m. UTC | #26
Hi Jerin


On 10/23/2017 6:06 PM, Jerin Jacob Wrote:
> -----Original Message-----
>> Date: Mon, 23 Oct 2017 16:49:01 +0800
>> From: Jia He <hejianet@gmail.com>
>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
>>   <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
>>   "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>>   <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>>   <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>>   <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
>>   <bruce.richardson@intel.com>
>> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
>>   loading when doing enqueue/dequeue
>> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>>   Thunderbird/52.4.0
>>
>> Hi Jerin
>>
>>
>> On 10/20/2017 1:43 PM, Jerin Jacob Wrote:
>>> -----Original Message-----
>> [...]
>>>> dependant on each other.
>>>> Thus a memory barrier is neccessary.
>>> Yes. The barrier is necessary.
>>> In fact, upstream freebsd fixed this issue for arm64. DPDK ring
>>> implementation is derived from freebsd's buf_ring.h.
>>> https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
>>>
>>> I think, the only outstanding issue is, how to reduce the performance
>>> impact for arm64. I believe using accurate/release semantics instead
>>> of rte_smp_rmb() will reduce the performance overhead like similar ring implementations below,
>>> freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
>>> odp: https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c
>>>
>>> Jia,
>>> 1) Can you verify the use of accurate/release semantics fixes the problem in your
>>> platform? like use of atomic_load_acq* in the reference code.
>>> 2) If so, What is the overhead between accurate/release and plane smp_smb()
>>> barriers. Based on that we need decide what path to take.
>> I've tested 3 cases.  The new 3rd case is to use the load_acquire barrier
>> (half barrier) you mentioned
>> at above link.
>> The patch seems like:
>> @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
>>                  /* Reset n to the initial burst count */
>>                  n = max;
>>
>> -               *old_head = r->prod.head;
>> -               const uint32_t cons_tail = r->cons.tail;
>> +               *old_head = atomic_load_acq_32(&r->prod.head);
>> +               const uint32_t cons_tail =
>> atomic_load_acq_32(&r->cons.tail);
>>
>> @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s
>>                  /* Restore n as it may change every loop */
>>                  n = max;
>>
>> -               *old_head = r->cons.head;
>> -               const uint32_t prod_tail = r->prod.tail;
>> +               *old_head = atomic_load_acq_32(&r->cons.head);
>> +               const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail)
>>                  /* The subtraction is done between two unsigned 32bits value
>>                   * (the result is always modulo 32 bits even if we have
>>                   * cons_head > prod_tail). So 'entries' is always between 0
>>                   * and size(ring)-1. */
>>
>> The half barrier patch passed the fuctional test.
>>
>> As for the performance comparision on *arm64*(the debug patch is at
>> http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see the
>> test results
>> below:
>>
>> [case 1] old codes, no barrier
>> ============================================
>>   Performance counter stats for './test --no-huge -l 1-10':
>>
>>       689275.001200      task-clock (msec)         #    9.771 CPUs utilized
>>                6223      context-switches          #    0.009 K/sec
>>                  10      cpu-migrations            #    0.000 K/sec
>>                 653      page-faults               #    0.001 K/sec
>>       1721190914583      cycles                    #    2.497 GHz
>>       3363238266430      instructions              #    1.95  insn per cycle
>>     <not supported> branches
>>            27804740      branch-misses             #    0.00% of all branches
>>
>>        70.540618825 seconds time elapsed
>>
>> [case 2] full barrier with rte_smp_rmb()
>> ============================================
>>   Performance counter stats for './test --no-huge -l 1-10':
>>
>>       582557.895850      task-clock (msec)         #    9.752 CPUs utilized
>>                5242      context-switches          #    0.009 K/sec
>>                  10      cpu-migrations            #    0.000 K/sec
>>                 665      page-faults               #    0.001 K/sec
>>       1454360730055      cycles                    #    2.497 GHz
>>        587197839907      instructions              #    0.40  insn per cycle
>>     <not supported> branches
>>            27799687      branch-misses             #    0.00% of all branches
>>
>>        59.735582356 seconds time elapse
>>
>> [case 1] half barrier with load_acquire
>> ============================================
>>   Performance counter stats for './test --no-huge -l 1-10':
>>
>>       660758.877050      task-clock (msec)         #    9.764 CPUs utilized
>>                5982      context-switches          #    0.009 K/sec
>>                  11      cpu-migrations            #    0.000 K/sec
>>                 657      page-faults               #    0.001 K/sec
>>       1649875318044      cycles                    #    2.497 GHz
>>        591583257765      instructions              #    0.36  insn per cycle
>>     <not supported> branches
>>            27994903      branch-misses             #    0.00% of all branches
>>
>>        67.672855107 seconds time elapsed
>>
>> Please see the context-switches in the perf results
>> test result  sorted by time is:
>> full barrier < half barrier < no barrier
>>
>> AFAICT, in this case ,the cpu reordering will add the possibility for
>> context switching and
>> increase the running time.
>> Any ideas?
> Regarding performance test, it better to use ring perf test case
> on _isolated_ cores to measure impact on number of enqueue/dequeue operations.
>
> example:
> ./build/app/test -c 0xff -n 4
>>> ring_perf_autotest
Seem in our arm64 server, the ring_perf_autotest will be finished in a 
few seconds:
Anything wrong about configuration or environment setup?

root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4
EAL: Detected 44 lcore(s)
EAL: Probing VFIO support...
APP: HPET is not enabled, using TSC as default timer
RTE>>per_lcore_autotest
RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 0
MP/MC single enq/dequeue: 2
SP/SC burst enq/dequeue (size: 8): 0
MP/MC burst enq/dequeue (size: 8): 0
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.02
MC empty dequeue: 0.04

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 0.12
MP/MC bulk enq/dequeue (size: 8): 0.31
SP/SC bulk enq/dequeue (size: 32): 0.05
MP/MC bulk enq/dequeue (size: 32): 0.09

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 0.12
MP/MC bulk enq/dequeue (size: 8): 0.39
SP/SC bulk enq/dequeue (size: 32): 0.04
MP/MC bulk enq/dequeue (size: 32): 0.12

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 0.37
MP/MC bulk enq/dequeue (size: 8): 0.92
SP/SC bulk enq/dequeue (size: 32): 0.12
MP/MC bulk enq/dequeue (size: 32): 0.26
Test OK
RTE>>

Cheers,
Jia
> By default, arm64+dpdk will be using el0 counter to measure the cycles. I
> think, in your SoC, it will be running at 50MHz or 100MHz.So, You can
> follow the below scheme to get accurate cycle measurement scheme:
>
> See: http://dpdk.org/doc/guides/prog_guide/profile_app.html
> check: 44.2.2. High-resolution cycle counter
  
Jerin Jacob Oct. 25, 2017, 1:26 p.m. UTC | #27
-----Original Message-----
> Date: Tue, 24 Oct 2017 10:04:26 +0800
> From: Jia He <hejianet@gmail.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
>  <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
>  "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>  <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>  <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>  <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
>  <bruce.richardson@intel.com>
> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
>  loading when doing enqueue/dequeue
> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>  Thunderbird/52.4.0
> 
> Hi Jerin

Hi Jia,


> > example:
> > ./build/app/test -c 0xff -n 4
> > > > ring_perf_autotest
> Seem in our arm64 server, the ring_perf_autotest will be finished in a few
> seconds:

Yes. It just need a few seconds.

> Anything wrong about configuration or environment setup?

By default, arm64+dpdk will be using el0 counter to measure the cycles. I
think, in your SoC, it will be running at 50MHz or 100MHz.So, You can
follow the below scheme to get accurate cycle measurement scheme:

See: http://dpdk.org/doc/guides/prog_guide/profile_app.html
check: 44.2.2. High-resolution cycle counter

> 
> root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4
> EAL: Detected 44 lcore(s)
> EAL: Probing VFIO support...
> APP: HPET is not enabled, using TSC as default timer
> RTE>>per_lcore_autotest
> RTE>>ring_perf_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 0
> MP/MC single enq/dequeue: 2
> SP/SC burst enq/dequeue (size: 8): 0

If you follow the above link, The value '0' will be replaced with more meaning full data.

> MP/MC burst enq/dequeue (size: 8): 0
> SP/SC burst enq/dequeue (size: 32): 0
> MP/MC burst enq/dequeue (size: 32): 0
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 0.02
> MC empty dequeue: 0.04
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 0.12
> MP/MC bulk enq/dequeue (size: 8): 0.31
> SP/SC bulk enq/dequeue (size: 32): 0.05
> MP/MC bulk enq/dequeue (size: 32): 0.09
> 
> ### Testing using two hyperthreads ###
> SP/SC bulk enq/dequeue (size: 8): 0.12
> MP/MC bulk enq/dequeue (size: 8): 0.39
> SP/SC bulk enq/dequeue (size: 32): 0.04
> MP/MC bulk enq/dequeue (size: 32): 0.12
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 0.37
> MP/MC bulk enq/dequeue (size: 8): 0.92
> SP/SC bulk enq/dequeue (size: 32): 0.12
> MP/MC bulk enq/dequeue (size: 32): 0.26
> Test OK
> RTE>>
> 
> Cheers,
> Jia
> > By default, arm64+dpdk will be using el0 counter to measure the cycles. I
> > think, in your SoC, it will be running at 50MHz or 100MHz.So, You can
> > follow the below scheme to get accurate cycle measurement scheme:
> > 
> > See: http://dpdk.org/doc/guides/prog_guide/profile_app.html
> > check: 44.2.2. High-resolution cycle counter
>
  
Jia He Oct. 26, 2017, 2:27 a.m. UTC | #28
Hi Jerin


On 10/25/2017 9:26 PM, Jerin Jacob Wrote:
> -----Original Message-----
>> Date: Tue, 24 Oct 2017 10:04:26 +0800
>> From: Jia He <hejianet@gmail.com>
>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
>>   <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
>>   "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>>   <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>>   <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>>   <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
>>   <bruce.richardson@intel.com>
>> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
>>   loading when doing enqueue/dequeue
>> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>>   Thunderbird/52.4.0
>>
>> Hi Jerin
> Hi Jia,
>
>
>>> example:
>>> ./build/app/test -c 0xff -n 4
>>>>> ring_perf_autotest
>> Seem in our arm64 server, the ring_perf_autotest will be finished in a few
>> seconds:
> Yes. It just need a few seconds.
>
>> Anything wrong about configuration or environment setup?
> By default, arm64+dpdk will be using el0 counter to measure the cycles. I
> think, in your SoC, it will be running at 50MHz or 100MHz.So, You can
> follow the below scheme to get accurate cycle measurement scheme:
>
> See: http://dpdk.org/doc/guides/prog_guide/profile_app.html
> check: 44.2.2. High-resolution cycle counter
Thank you for the suggestions.
But I tried your provided ko module to enable the accurate cycle 
measurement in user space, the
test result is as below:

root@nfv-demo01:~/dpdk/build/build/test/test# lsmod |grep pmu
pmu_el0_cycle_counter   262144  0
[old codes, without any patches]
============================================
RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 0
MP/MC single enq/dequeue: 0
SP/SC burst enq/dequeue (size: 8): 0
MP/MC burst enq/dequeue (size: 8): 0
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.00
MC empty dequeue: 0.00

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 0.00
MP/MC bulk enq/dequeue (size: 8): 0.00
SP/SC bulk enq/dequeue (size: 32): 0.00
MP/MC bulk enq/dequeue (size: 32): 0.00

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 0.00
MP/MC bulk enq/dequeue (size: 8): 0.00
SP/SC bulk enq/dequeue (size: 32): 0.00
MP/MC bulk enq/dequeue (size: 32): 0.00
Test OK

[with full rte_smp_rmb barrier patch]
======================================
RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 0
MP/MC single enq/dequeue: 0
SP/SC burst enq/dequeue (size: 8): 0
MP/MC burst enq/dequeue (size: 8): 0
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.00
MC empty dequeue: 0.00

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 0.00
MP/MC bulk enq/dequeue (size: 8): 0.00
SP/SC bulk enq/dequeue (size: 32): 0.00
MP/MC bulk enq/dequeue (size: 32): 0.00

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 0.00
MP/MC bulk enq/dequeue (size: 8): 0.00
SP/SC bulk enq/dequeue (size: 32): 0.00
MP/MC bulk enq/dequeue (size: 32): 0.00
Test OK
RTE>>

No difference,all time is 0 ?

If I rmmod pmu_el0_cycle_counter and revise the ./build/.config to 
comment the config line
#CONFIG_RTE_ARM_EAL_RDTSC_USE_PMU=y

Then the time is bigger than 0

>> root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4
>> EAL: Detected 44 lcore(s)
>> EAL: Probing VFIO support...
>> APP: HPET is not enabled, using TSC as default timer
>> RTE>>per_lcore_autotest
>> RTE>>ring_perf_autotest
>> ### Testing single element and burst enq/deq ###
>> SP/SC single enq/dequeue: 0
>> MP/MC single enq/dequeue: 2
>> SP/SC burst enq/dequeue (size: 8): 0
> If you follow the above link, The value '0' will be replaced with more meaning full data.
>
>> MP/MC burst enq/dequeue (size: 8): 0
>> SP/SC burst enq/dequeue (size: 32): 0
>> MP/MC burst enq/dequeue (size: 32): 0
>>
>> ### Testing empty dequeue ###
>> SC empty dequeue: 0.02
>> MC empty dequeue: 0.04
>>
>> ### Testing using a single lcore ###
>> SP/SC bulk enq/dequeue (size: 8): 0.12
>> MP/MC bulk enq/dequeue (size: 8): 0.31
>> SP/SC bulk enq/dequeue (size: 32): 0.05
>> MP/MC bulk enq/dequeue (size: 32): 0.09
>>
>> ### Testing using two hyperthreads ###
>> SP/SC bulk enq/dequeue (size: 8): 0.12
>> MP/MC bulk enq/dequeue (size: 8): 0.39
>> SP/SC bulk enq/dequeue (size: 32): 0.04
>> MP/MC bulk enq/dequeue (size: 32): 0.12
>>
>> ### Testing using two physical cores ###
>> SP/SC bulk enq/dequeue (size: 8): 0.37
>> MP/MC bulk enq/dequeue (size: 8): 0.92
>> SP/SC bulk enq/dequeue (size: 32): 0.12
>> MP/MC bulk enq/dequeue (size: 32): 0.26
>> Test OK
>> RTE>>
>>
>> Cheers,
>> Jia
>>> By default, arm64+dpdk will be using el0 counter to measure the cycles. I
>>> think, in your SoC, it will be running at 50MHz or 100MHz.So, You can
>>> follow the below scheme to get accurate cycle measurement scheme:
>>>
>>> See: http://dpdk.org/doc/guides/prog_guide/profile_app.html
>>> check: 44.2.2. High-resolution cycle counter
  
Jia He Oct. 31, 2017, 2:55 a.m. UTC | #29
Hi Jerin

Do you think  next step whether I need to implement the load_acquire 
half barrier as per freebsd

or find any other performance test case to compare the performance impact?
Thanks for any suggestions.

Cheers,
Jia

On 10/25/2017 9:26 PM, Jerin Jacob Wrote:
> -----Original Message-----
>> Date: Tue, 24 Oct 2017 10:04:26 +0800
>> From: Jia He <hejianet@gmail.com>
>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
>>   <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
>>   "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>>   <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>>   <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>>   <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
>>   <bruce.richardson@intel.com>
>> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
>>   loading when doing enqueue/dequeue
>> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>>   Thunderbird/52.4.0
>>
>> Hi Jerin
> Hi Jia,
>
>
>>> example:
>>> ./build/app/test -c 0xff -n 4
>>>>> ring_perf_autotest
>> Seem in our arm64 server, the ring_perf_autotest will be finished in a few
>> seconds:
> Yes. It just need a few seconds.
>
>> Anything wrong about configuration or environment setup?
> By default, arm64+dpdk will be using el0 counter to measure the cycles. I
> think, in your SoC, it will be running at 50MHz or 100MHz.So, You can
> follow the below scheme to get accurate cycle measurement scheme:
>
> See: http://dpdk.org/doc/guides/prog_guide/profile_app.html
> check: 44.2.2. High-resolution cycle counter
>
>> root@ubuntu:/home/hj/dpdk/build/build/test/test# ./test -c 0xff -n 4
>> EAL: Detected 44 lcore(s)
>> EAL: Probing VFIO support...
>> APP: HPET is not enabled, using TSC as default timer
>> RTE>>per_lcore_autotest
>> RTE>>ring_perf_autotest
>> ### Testing single element and burst enq/deq ###
>> SP/SC single enq/dequeue: 0
>> MP/MC single enq/dequeue: 2
>> SP/SC burst enq/dequeue (size: 8): 0
> If you follow the above link, The value '0' will be replaced with more meaning full data.
>
>> MP/MC burst enq/dequeue (size: 8): 0
>> SP/SC burst enq/dequeue (size: 32): 0
>> MP/MC burst enq/dequeue (size: 32): 0
>>
>> ### Testing empty dequeue ###
>> SC empty dequeue: 0.02
>> MC empty dequeue: 0.04
>>
>> ### Testing using a single lcore ###
>> SP/SC bulk enq/dequeue (size: 8): 0.12
>> MP/MC bulk enq/dequeue (size: 8): 0.31
>> SP/SC bulk enq/dequeue (size: 32): 0.05
>> MP/MC bulk enq/dequeue (size: 32): 0.09
>>
>> ### Testing using two hyperthreads ###
>> SP/SC bulk enq/dequeue (size: 8): 0.12
>> MP/MC bulk enq/dequeue (size: 8): 0.39
>> SP/SC bulk enq/dequeue (size: 32): 0.04
>> MP/MC bulk enq/dequeue (size: 32): 0.12
>>
>> ### Testing using two physical cores ###
>> SP/SC bulk enq/dequeue (size: 8): 0.37
>> MP/MC bulk enq/dequeue (size: 8): 0.92
>> SP/SC bulk enq/dequeue (size: 32): 0.12
>> MP/MC bulk enq/dequeue (size: 32): 0.26
>> Test OK
>> RTE>>
>>
>> Cheers,
>> Jia
>>> By default, arm64+dpdk will be using el0 counter to measure the cycles. I
>>> think, in your SoC, it will be running at 50MHz or 100MHz.So, You can
>>> follow the below scheme to get accurate cycle measurement scheme:
>>>
>>> See: http://dpdk.org/doc/guides/prog_guide/profile_app.html
>>> check: 44.2.2. High-resolution cycle counter
  
Jerin Jacob Oct. 31, 2017, 11:14 a.m. UTC | #30
-----Original Message-----
> Date: Tue, 31 Oct 2017 10:55:15 +0800
> From: Jia He <hejianet@gmail.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
>  <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
>  "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>  <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>  <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>  <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
>  <bruce.richardson@intel.com>
> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
>  loading when doing enqueue/dequeue
> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>  Thunderbird/52.4.0
> 
> Hi Jerin

Hi Jia,

> 
> Do you think  next step whether I need to implement the load_acquire half
> barrier as per freebsd

I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics
and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine)
and Platform B: arm64 OOO machine)

smp_rmb() performs better in Platform A:
acquire/release semantics perform better in platform B:

Here is the patch:
https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch

In terms of next step:
- I am not sure the cost associated with acquire/release semantics on x86 or ppc.
IMO, We need to have both options under conditional compilation
flags and let the target platform choose the best one.

Thoughts?

Here is the performance numbers:
- Both platforms are running at different frequency, So absolute numbers does not
  matter, Just check the relative numbers.

Platform A: Performance numbers:
================================
no patch(Non arm64 OOO machine)
-------------------------------

SP/SC single enq/dequeue: 40
MP/MC single enq/dequeue: 282
SP/SC burst enq/dequeue (size: 8): 11
MP/MC burst enq/dequeue (size: 8): 42
SP/SC burst enq/dequeue (size: 32): 8
MP/MC burst enq/dequeue (size: 32): 16

### Testing empty dequeue ###
SC empty dequeue: 8.01
MC empty dequeue: 11.01

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 11.30
MP/MC bulk enq/dequeue (size: 8): 42.85
SP/SC bulk enq/dequeue (size: 32): 8.25
MP/MC bulk enq/dequeue (size: 32): 16.46

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 20.62
MP/MC bulk enq/dequeue (size: 8): 56.30
SP/SC bulk enq/dequeue (size: 32): 10.94
MP/MC bulk enq/dequeue (size: 32): 18.66
Test OK

# smp_rmb() patch((Non OOO arm64 machine)
http://dpdk.org/dev/patchwork/patch/30029/
-----------------------------------------

SP/SC single enq/dequeue: 42
MP/MC single enq/dequeue: 291
SP/SC burst enq/dequeue (size: 8): 12
MP/MC burst enq/dequeue (size: 8): 44
SP/SC burst enq/dequeue (size: 32): 8
MP/MC burst enq/dequeue (size: 32): 16

### Testing empty dequeue ###
SC empty dequeue: 13.01
MC empty dequeue: 15.01

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 11.60
MP/MC bulk enq/dequeue (size: 8): 44.32
SP/SC bulk enq/dequeue (size: 32): 8.60
MP/MC bulk enq/dequeue (size: 32): 16.50

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 20.95
MP/MC bulk enq/dequeue (size: 8): 56.90
SP/SC bulk enq/dequeue (size: 32): 10.90
MP/MC bulk enq/dequeue (size: 32): 18.78
Test OK
RTE>>

# c11 memory model patch((Non OOO arm64 machine)
https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
---------------------------------------------------------------------------------------------
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 197
MP/MC single enq/dequeue: 328
SP/SC burst enq/dequeue (size: 8): 31
MP/MC burst enq/dequeue (size: 8): 50
SP/SC burst enq/dequeue (size: 32): 13
MP/MC burst enq/dequeue (size: 32): 18

### Testing empty dequeue ###
SC empty dequeue: 13.01
MC empty dequeue: 18.02

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 30.95
MP/MC bulk enq/dequeue (size: 8): 50.30
SP/SC bulk enq/dequeue (size: 32): 13.27
MP/MC bulk enq/dequeue (size: 32): 18.11

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 43.38
MP/MC bulk enq/dequeue (size: 8): 64.42
SP/SC bulk enq/dequeue (size: 32): 16.71
MP/MC bulk enq/dequeue (size: 32): 22.21


Platform B: Performance numbers:
==============================
#no patch(OOO arm64 machine)
----------------------------

### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 81
MP/MC single enq/dequeue: 207
SP/SC burst enq/dequeue (size: 8): 15
MP/MC burst enq/dequeue (size: 8): 31
SP/SC burst enq/dequeue (size: 32): 7
MP/MC burst enq/dequeue (size: 32): 11

### Testing empty dequeue ###
SC empty dequeue: 3.00
MC empty dequeue: 5.00

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 15.38
MP/MC bulk enq/dequeue (size: 8): 30.64
SP/SC bulk enq/dequeue (size: 32): 7.25
MP/MC bulk enq/dequeue (size: 32): 11.06

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 31.51
MP/MC bulk enq/dequeue (size: 8): 49.38
SP/SC bulk enq/dequeue (size: 32): 14.32
MP/MC bulk enq/dequeue (size: 32): 15.89

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 72.66
MP/MC bulk enq/dequeue (size: 8): 121.89
SP/SC bulk enq/dequeue (size: 32): 16.88
MP/MC bulk enq/dequeue (size: 32): 24.23
Test OK
RTE>>


# smp_rmb() patch((OOO arm64 machine)
http://dpdk.org/dev/patchwork/patch/30029/
-------------------------------------------

### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 152
MP/MC single enq/dequeue: 265
SP/SC burst enq/dequeue (size: 8): 24
MP/MC burst enq/dequeue (size: 8): 39
SP/SC burst enq/dequeue (size: 32): 9
MP/MC burst enq/dequeue (size: 32): 13

### Testing empty dequeue ###
SC empty dequeue: 31.01
MC empty dequeue: 32.01

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 24.26
MP/MC bulk enq/dequeue (size: 8): 39.52
SP/SC bulk enq/dequeue (size: 32): 9.47
MP/MC bulk enq/dequeue (size: 32): 13.31

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 40.29
MP/MC bulk enq/dequeue (size: 8): 59.57
SP/SC bulk enq/dequeue (size: 32): 17.34
MP/MC bulk enq/dequeue (size: 32): 21.58

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 79.05
MP/MC bulk enq/dequeue (size: 8): 153.46
SP/SC bulk enq/dequeue (size: 32): 26.41
MP/MC bulk enq/dequeue (size: 32): 38.37
Test OK
RTE>>


# c11 memory model patch((OOO arm64 machine)
https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
----------------------------------------------------------------------------------------------
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 98
MP/MC single enq/dequeue: 130
SP/SC burst enq/dequeue (size: 8): 18
MP/MC burst enq/dequeue (size: 8): 22
SP/SC burst enq/dequeue (size: 32): 7
MP/MC burst enq/dequeue (size: 32): 9

### Testing empty dequeue ###
SC empty dequeue: 4.00
MC empty dequeue: 5.00

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 17.40
MP/MC bulk enq/dequeue (size: 8): 22.88
SP/SC bulk enq/dequeue (size: 32): 7.62
MP/MC bulk enq/dequeue (size: 32): 8.96

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 20.24
MP/MC bulk enq/dequeue (size: 8): 25.83
SP/SC bulk enq/dequeue (size: 32): 12.21
MP/MC bulk enq/dequeue (size: 32): 13.20

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 67.54
MP/MC bulk enq/dequeue (size: 8): 124.63
SP/SC bulk enq/dequeue (size: 32): 21.13
MP/MC bulk enq/dequeue (size: 32): 28.44
Test OK
RTE>>quit


> or find any other performance test case to compare the performance impact?

As far as I know, ring_perf_autotest is the better performance test.
If you have trouble in using "High-resolution cycle counter" in your platform then also
you can use ring_perf_auto test to compare the performance(as relative
number matters)

Jerin

> Thanks for any suggestions.
> 
> Cheers,
> Jia
  
Jia He Nov. 1, 2017, 2:53 a.m. UTC | #31
Hi Jerin

Thanks for your suggestions. I will try to use config macro to let it be 
chosen by user.

I need to point out one possible issue in your load_acq/store_rel patch

at 
https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch

@@ -516,8 +541,13 @@ __rte_ring_move_cons_head(struct rte_ring *r, int 
is_sc,
          /* Restore n as it may change every loop */
          n = max;

+#if 0
          *old_head = r->cons.head;
          const uint32_t prod_tail = r->prod.tail;
+#else
+        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);   
                    --[1]
+        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, 
__ATOMIC_ACQUIRE);   --[2]
+#endif

line [1] __ATOMIC_RELAXED is not enough for this case(tested in our 
ARM64 server).

line [2] __ATOMIC_ACQUIRE guarantee the 2nd load will not be reorded 
before the 1st load, but will not

guarantee the 1st load will not be reordered after the 2nd load. Please 
also refer to your mentioned freebsd implementation. They use 
__ATOMIC_ACQUIRE at line [1].

Should it be like instead?

+#else
+        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
+        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, 
__ATOMIC_ACQUIRE);


Cheers,

Jia



On 10/31/2017 7:14 PM, Jerin Jacob Wrote:
> -----Original Message-----
>> Date: Tue, 31 Oct 2017 10:55:15 +0800
>> From: Jia He <hejianet@gmail.com>
>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
>>   <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
>>   "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>>   <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>>   <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>>   <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
>>   <bruce.richardson@intel.com>
>> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
>>   loading when doing enqueue/dequeue
>> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>>   Thunderbird/52.4.0
>>
>> Hi Jerin
> Hi Jia,
>
>> Do you think  next step whether I need to implement the load_acquire half
>> barrier as per freebsd
> I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics
> and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine)
> and Platform B: arm64 OOO machine)
>
> smp_rmb() performs better in Platform A:
> acquire/release semantics perform better in platform B:
>
> Here is the patch:
> https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
>
> In terms of next step:
> - I am not sure the cost associated with acquire/release semantics on x86 or ppc.
> IMO, We need to have both options under conditional compilation
> flags and let the target platform choose the best one.
>
> Thoughts?
>
> Here is the performance numbers:
> - Both platforms are running at different frequency, So absolute numbers does not
>    matter, Just check the relative numbers.
>
> Platform A: Performance numbers:
> ================================
> no patch(Non arm64 OOO machine)
> -------------------------------
>
> SP/SC single enq/dequeue: 40
> MP/MC single enq/dequeue: 282
> SP/SC burst enq/dequeue (size: 8): 11
> MP/MC burst enq/dequeue (size: 8): 42
> SP/SC burst enq/dequeue (size: 32): 8
> MP/MC burst enq/dequeue (size: 32): 16
>
> ### Testing empty dequeue ###
> SC empty dequeue: 8.01
> MC empty dequeue: 11.01
>
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 11.30
> MP/MC bulk enq/dequeue (size: 8): 42.85
> SP/SC bulk enq/dequeue (size: 32): 8.25
> MP/MC bulk enq/dequeue (size: 32): 16.46
>
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 20.62
> MP/MC bulk enq/dequeue (size: 8): 56.30
> SP/SC bulk enq/dequeue (size: 32): 10.94
> MP/MC bulk enq/dequeue (size: 32): 18.66
> Test OK
>
> # smp_rmb() patch((Non OOO arm64 machine)
> http://dpdk.org/dev/patchwork/patch/30029/
> -----------------------------------------
>
> SP/SC single enq/dequeue: 42
> MP/MC single enq/dequeue: 291
> SP/SC burst enq/dequeue (size: 8): 12
> MP/MC burst enq/dequeue (size: 8): 44
> SP/SC burst enq/dequeue (size: 32): 8
> MP/MC burst enq/dequeue (size: 32): 16
>
> ### Testing empty dequeue ###
> SC empty dequeue: 13.01
> MC empty dequeue: 15.01
>
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 11.60
> MP/MC bulk enq/dequeue (size: 8): 44.32
> SP/SC bulk enq/dequeue (size: 32): 8.60
> MP/MC bulk enq/dequeue (size: 32): 16.50
>
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 20.95
> MP/MC bulk enq/dequeue (size: 8): 56.90
> SP/SC bulk enq/dequeue (size: 32): 10.90
> MP/MC bulk enq/dequeue (size: 32): 18.78
> Test OK
> RTE>>
>
> # c11 memory model patch((Non OOO arm64 machine)
> https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
> ---------------------------------------------------------------------------------------------
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 197
> MP/MC single enq/dequeue: 328
> SP/SC burst enq/dequeue (size: 8): 31
> MP/MC burst enq/dequeue (size: 8): 50
> SP/SC burst enq/dequeue (size: 32): 13
> MP/MC burst enq/dequeue (size: 32): 18
>
> ### Testing empty dequeue ###
> SC empty dequeue: 13.01
> MC empty dequeue: 18.02
>
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 30.95
> MP/MC bulk enq/dequeue (size: 8): 50.30
> SP/SC bulk enq/dequeue (size: 32): 13.27
> MP/MC bulk enq/dequeue (size: 32): 18.11
>
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 43.38
> MP/MC bulk enq/dequeue (size: 8): 64.42
> SP/SC bulk enq/dequeue (size: 32): 16.71
> MP/MC bulk enq/dequeue (size: 32): 22.21
>
>
> Platform B: Performance numbers:
> ==============================
> #no patch(OOO arm64 machine)
> ----------------------------
>
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 81
> MP/MC single enq/dequeue: 207
> SP/SC burst enq/dequeue (size: 8): 15
> MP/MC burst enq/dequeue (size: 8): 31
> SP/SC burst enq/dequeue (size: 32): 7
> MP/MC burst enq/dequeue (size: 32): 11
>
> ### Testing empty dequeue ###
> SC empty dequeue: 3.00
> MC empty dequeue: 5.00
>
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 15.38
> MP/MC bulk enq/dequeue (size: 8): 30.64
> SP/SC bulk enq/dequeue (size: 32): 7.25
> MP/MC bulk enq/dequeue (size: 32): 11.06
>
> ### Testing using two hyperthreads ###
> SP/SC bulk enq/dequeue (size: 8): 31.51
> MP/MC bulk enq/dequeue (size: 8): 49.38
> SP/SC bulk enq/dequeue (size: 32): 14.32
> MP/MC bulk enq/dequeue (size: 32): 15.89
>
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 72.66
> MP/MC bulk enq/dequeue (size: 8): 121.89
> SP/SC bulk enq/dequeue (size: 32): 16.88
> MP/MC bulk enq/dequeue (size: 32): 24.23
> Test OK
> RTE>>
>
>
> # smp_rmb() patch((OOO arm64 machine)
> http://dpdk.org/dev/patchwork/patch/30029/
> -------------------------------------------
>
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 152
> MP/MC single enq/dequeue: 265
> SP/SC burst enq/dequeue (size: 8): 24
> MP/MC burst enq/dequeue (size: 8): 39
> SP/SC burst enq/dequeue (size: 32): 9
> MP/MC burst enq/dequeue (size: 32): 13
>
> ### Testing empty dequeue ###
> SC empty dequeue: 31.01
> MC empty dequeue: 32.01
>
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 24.26
> MP/MC bulk enq/dequeue (size: 8): 39.52
> SP/SC bulk enq/dequeue (size: 32): 9.47
> MP/MC bulk enq/dequeue (size: 32): 13.31
>
> ### Testing using two hyperthreads ###
> SP/SC bulk enq/dequeue (size: 8): 40.29
> MP/MC bulk enq/dequeue (size: 8): 59.57
> SP/SC bulk enq/dequeue (size: 32): 17.34
> MP/MC bulk enq/dequeue (size: 32): 21.58
>
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 79.05
> MP/MC bulk enq/dequeue (size: 8): 153.46
> SP/SC bulk enq/dequeue (size: 32): 26.41
> MP/MC bulk enq/dequeue (size: 32): 38.37
> Test OK
> RTE>>
>
>
> # c11 memory model patch((OOO arm64 machine)
> https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
> ----------------------------------------------------------------------------------------------
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 98
> MP/MC single enq/dequeue: 130
> SP/SC burst enq/dequeue (size: 8): 18
> MP/MC burst enq/dequeue (size: 8): 22
> SP/SC burst enq/dequeue (size: 32): 7
> MP/MC burst enq/dequeue (size: 32): 9
>
> ### Testing empty dequeue ###
> SC empty dequeue: 4.00
> MC empty dequeue: 5.00
>
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 17.40
> MP/MC bulk enq/dequeue (size: 8): 22.88
> SP/SC bulk enq/dequeue (size: 32): 7.62
> MP/MC bulk enq/dequeue (size: 32): 8.96
>
> ### Testing using two hyperthreads ###
> SP/SC bulk enq/dequeue (size: 8): 20.24
> MP/MC bulk enq/dequeue (size: 8): 25.83
> SP/SC bulk enq/dequeue (size: 32): 12.21
> MP/MC bulk enq/dequeue (size: 32): 13.20
>
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 67.54
> MP/MC bulk enq/dequeue (size: 8): 124.63
> SP/SC bulk enq/dequeue (size: 32): 21.13
> MP/MC bulk enq/dequeue (size: 32): 28.44
> Test OK
> RTE>>quit
>
>
>> or find any other performance test case to compare the performance impact?
> As far as I know, ring_perf_autotest is the better performance test.
> If you have trouble in using "High-resolution cycle counter" in your platform then also
> you can use ring_perf_auto test to compare the performance(as relative
> number matters)
>
> Jerin
>
>> Thanks for any suggestions.
>>
>> Cheers,
>> Jia
  
Jia He Nov. 1, 2017, 4:48 a.m. UTC | #32
Hi Jerin


On 10/31/2017 7:14 PM, Jerin Jacob Wrote:
> -----Original Message-----
>> Date: Tue, 31 Oct 2017 10:55:15 +0800
>> From: Jia He <hejianet@gmail.com>
>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
>>   <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
>>   "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>>   <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>>   <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>>   <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
>>   <bruce.richardson@intel.com>
>> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
>>   loading when doing enqueue/dequeue
>> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>>   Thunderbird/52.4.0
>>
>> Hi Jerin
> Hi Jia,
>
>> Do you think  next step whether I need to implement the load_acquire half
>> barrier as per freebsd
> I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics
> and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine)
> and Platform B: arm64 OOO machine)
Can you elaborate anything about your Non arm64 OOO machine? As I know, 
all arm64 server is strong
memory order. Am I missed anything?
  
Jerin Jacob Nov. 1, 2017, 7:04 p.m. UTC | #33
Date: Thu, 2 Nov 2017 00:27:46 +0530
From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
To: Jia He <hejianet@gmail.com>
Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
	"Zhao, Bing" <ilovethull@163.com>,
	Olivier MATZ <olivier.matz@6wind.com>,
	"dev@dpdk.org" <dev@dpdk.org>,
	"jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>,
	"jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>,
	"bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com>,
	"Richardson, Bruce" <bruce.richardson@intel.com>,
	jianbo.liu@arm.com, hemant.agrawal@nxp.com
Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading
 when doing enqueue/dequeue
Message-ID: <20171101185723.GA18759@jerin>
References: <2601191342CEEE43887BDE71AB9772585FAAB570@IRSMSX103.ger.corp.intel.com>
 <3e580cd7-2854-d855-be9c-7c4ce06e3ed5@gmail.com>
 <20171020054319.GA4249@jerin>
 <ab7154a2-a9f8-f12e-b6a0-2805c2065e2e@gmail.com>
 <20171023100617.GA17957@jerin>
 <b0e6eae2-e61b-1946-ac44-d781225778e5@gmail.com>
 <20171025132642.GA13977@jerin>
 <b54ce545-b75d-9813-209f-4125bd76c7db@gmail.com>
 <20171031111433.GA21742@jerin>
 <69adfb00-4582-b362-0540-d1d9d6bcf6aa@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <69adfb00-4582-b362-0540-d1d9d6bcf6aa@gmail.com>
User-Agent: Mutt/1.9.1 (2017-09-22)

-----Original Message-----
> Date: Wed, 1 Nov 2017 10:53:12 +0800
> From: Jia He <hejianet@gmail.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
>  <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
>  "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>  <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>  <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>  <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
>  <bruce.richardson@intel.com>, jianbo.liu@arm.com, hemant.agrawal@nxp.com
> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
>  loading when doing enqueue/dequeue
> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>  Thunderbird/52.4.0
> 
> Hi Jerin
> 
> Thanks for your suggestions. I will try to use config macro to let it be
> chosen by user.

It is better, if we avoid #ifdef's in the common code. I think, you can
do the scheme like examples/l3fwd/l3fwd_em_hlm_neon.h. Where,
the common code will have all generic routine, difference will be
moved to a separate files to reduce #ifdef clutter.

> 
> I need to point out one possible issue in your load_acq/store_rel patch
> 
> at https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
> 
> @@ -516,8 +541,13 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> is_sc,
>          /* Restore n as it may change every loop */
>          n = max;
> 
> +#if 0
>          *old_head = r->cons.head;
>          const uint32_t prod_tail = r->prod.tail;
> +#else
> +        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);      
>                --[1]
> +        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> __ATOMIC_ACQUIRE);   --[2]
> +#endif
> 
> line [1] __ATOMIC_RELAXED is not enough for this case(tested in our ARM64
> server).
> 
> line [2] __ATOMIC_ACQUIRE guarantee the 2nd load will not be reorded before
> the 1st load, but will not
> 
> guarantee the 1st load will not be reordered after the 2nd load. Please also

For me it looks same. [1] can not cross [2] is same as [2] cannot cross
[1], if [1] are [2] at back to back. No ?

> refer to your mentioned freebsd implementation. They use __ATOMIC_ACQUIRE at
> line [1].

If I understand it correctly. in freebsd, they are planning to revisit
the usage of first __ATOMIC_ACQUIRE in Freebsd-12
https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170

> 
> Should it be like instead?
> 
> +#else
> +        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
> +        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> __ATOMIC_ACQUIRE);

It would be nice to see how much overhead it gives.ie back to back
__ATOMIC_ACQUIRE.


> 
> 
> Cheers,
> 
> Jia
> 
> 
> 
> On 10/31/2017 7:14 PM, Jerin Jacob Wrote:
> > -----Original Message-----
> > > Date: Tue, 31 Oct 2017 10:55:15 +0800
> > > From: Jia He <hejianet@gmail.com>
> > > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
> > >   <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
> > >   "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
> > >   <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
> > >   <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
> > >   <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
> > >   <bruce.richardson@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
> > >   loading when doing enqueue/dequeue
> > > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
> > >   Thunderbird/52.4.0
> > > 
> > > Hi Jerin
> > Hi Jia,
> > 
> > > Do you think  next step whether I need to implement the load_acquire half
> > > barrier as per freebsd
> > I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics
> > and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine)
> > and Platform B: arm64 OOO machine)
> > 
> > smp_rmb() performs better in Platform A:
> > acquire/release semantics perform better in platform B:
> > 
> > Here is the patch:
> > https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
> > 
> > In terms of next step:
> > - I am not sure the cost associated with acquire/release semantics on x86 or ppc.
> > IMO, We need to have both options under conditional compilation
> > flags and let the target platform choose the best one.
> > 
> > Thoughts?
> > 
> > Here is the performance numbers:
> > - Both platforms are running at different frequency, So absolute numbers does not
> >    matter, Just check the relative numbers.
> > 
> > Platform A: Performance numbers:
> > ================================
> > no patch(Non arm64 OOO machine)
> > -------------------------------
> > 
> > SP/SC single enq/dequeue: 40
> > MP/MC single enq/dequeue: 282
> > SP/SC burst enq/dequeue (size: 8): 11
> > MP/MC burst enq/dequeue (size: 8): 42
> > SP/SC burst enq/dequeue (size: 32): 8
> > MP/MC burst enq/dequeue (size: 32): 16
> > 
> > ### Testing empty dequeue ###
> > SC empty dequeue: 8.01
> > MC empty dequeue: 11.01
> > 
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 11.30
> > MP/MC bulk enq/dequeue (size: 8): 42.85
> > SP/SC bulk enq/dequeue (size: 32): 8.25
> > MP/MC bulk enq/dequeue (size: 32): 16.46
> > 
> > ### Testing using two physical cores ###
> > SP/SC bulk enq/dequeue (size: 8): 20.62
> > MP/MC bulk enq/dequeue (size: 8): 56.30
> > SP/SC bulk enq/dequeue (size: 32): 10.94
> > MP/MC bulk enq/dequeue (size: 32): 18.66
> > Test OK
> > 
> > # smp_rmb() patch((Non OOO arm64 machine)
> > http://dpdk.org/dev/patchwork/patch/30029/
> > -----------------------------------------
> > 
> > SP/SC single enq/dequeue: 42
> > MP/MC single enq/dequeue: 291
> > SP/SC burst enq/dequeue (size: 8): 12
> > MP/MC burst enq/dequeue (size: 8): 44
> > SP/SC burst enq/dequeue (size: 32): 8
> > MP/MC burst enq/dequeue (size: 32): 16
> > 
> > ### Testing empty dequeue ###
> > SC empty dequeue: 13.01
> > MC empty dequeue: 15.01
> > 
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 11.60
> > MP/MC bulk enq/dequeue (size: 8): 44.32
> > SP/SC bulk enq/dequeue (size: 32): 8.60
> > MP/MC bulk enq/dequeue (size: 32): 16.50
> > 
> > ### Testing using two physical cores ###
> > SP/SC bulk enq/dequeue (size: 8): 20.95
> > MP/MC bulk enq/dequeue (size: 8): 56.90
> > SP/SC bulk enq/dequeue (size: 32): 10.90
> > MP/MC bulk enq/dequeue (size: 32): 18.78
> > Test OK
> > RTE>>
> > 
> > # c11 memory model patch((Non OOO arm64 machine)
> > https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
> > ---------------------------------------------------------------------------------------------
> > ### Testing single element and burst enq/deq ###
> > SP/SC single enq/dequeue: 197
> > MP/MC single enq/dequeue: 328
> > SP/SC burst enq/dequeue (size: 8): 31
> > MP/MC burst enq/dequeue (size: 8): 50
> > SP/SC burst enq/dequeue (size: 32): 13
> > MP/MC burst enq/dequeue (size: 32): 18
> > 
> > ### Testing empty dequeue ###
> > SC empty dequeue: 13.01
> > MC empty dequeue: 18.02
> > 
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 30.95
> > MP/MC bulk enq/dequeue (size: 8): 50.30
> > SP/SC bulk enq/dequeue (size: 32): 13.27
> > MP/MC bulk enq/dequeue (size: 32): 18.11
> > 
> > ### Testing using two physical cores ###
> > SP/SC bulk enq/dequeue (size: 8): 43.38
> > MP/MC bulk enq/dequeue (size: 8): 64.42
> > SP/SC bulk enq/dequeue (size: 32): 16.71
> > MP/MC bulk enq/dequeue (size: 32): 22.21
> > 
> > 
> > Platform B: Performance numbers:
> > ==============================
> > #no patch(OOO arm64 machine)
> > ----------------------------
> > 
> > ### Testing single element and burst enq/deq ###
> > SP/SC single enq/dequeue: 81
> > MP/MC single enq/dequeue: 207
> > SP/SC burst enq/dequeue (size: 8): 15
> > MP/MC burst enq/dequeue (size: 8): 31
> > SP/SC burst enq/dequeue (size: 32): 7
> > MP/MC burst enq/dequeue (size: 32): 11
> > 
> > ### Testing empty dequeue ###
> > SC empty dequeue: 3.00
> > MC empty dequeue: 5.00
> > 
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 15.38
> > MP/MC bulk enq/dequeue (size: 8): 30.64
> > SP/SC bulk enq/dequeue (size: 32): 7.25
> > MP/MC bulk enq/dequeue (size: 32): 11.06
> > 
> > ### Testing using two hyperthreads ###
> > SP/SC bulk enq/dequeue (size: 8): 31.51
> > MP/MC bulk enq/dequeue (size: 8): 49.38
> > SP/SC bulk enq/dequeue (size: 32): 14.32
> > MP/MC bulk enq/dequeue (size: 32): 15.89
> > 
> > ### Testing using two physical cores ###
> > SP/SC bulk enq/dequeue (size: 8): 72.66
> > MP/MC bulk enq/dequeue (size: 8): 121.89
> > SP/SC bulk enq/dequeue (size: 32): 16.88
> > MP/MC bulk enq/dequeue (size: 32): 24.23
> > Test OK
> > RTE>>
> > 
> > 
> > # smp_rmb() patch((OOO arm64 machine)
> > http://dpdk.org/dev/patchwork/patch/30029/
> > -------------------------------------------
> > 
> > ### Testing single element and burst enq/deq ###
> > SP/SC single enq/dequeue: 152
> > MP/MC single enq/dequeue: 265
> > SP/SC burst enq/dequeue (size: 8): 24
> > MP/MC burst enq/dequeue (size: 8): 39
> > SP/SC burst enq/dequeue (size: 32): 9
> > MP/MC burst enq/dequeue (size: 32): 13
> > 
> > ### Testing empty dequeue ###
> > SC empty dequeue: 31.01
> > MC empty dequeue: 32.01
> > 
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 24.26
> > MP/MC bulk enq/dequeue (size: 8): 39.52
> > SP/SC bulk enq/dequeue (size: 32): 9.47
> > MP/MC bulk enq/dequeue (size: 32): 13.31
> > 
> > ### Testing using two hyperthreads ###
> > SP/SC bulk enq/dequeue (size: 8): 40.29
> > MP/MC bulk enq/dequeue (size: 8): 59.57
> > SP/SC bulk enq/dequeue (size: 32): 17.34
> > MP/MC bulk enq/dequeue (size: 32): 21.58
> > 
> > ### Testing using two physical cores ###
> > SP/SC bulk enq/dequeue (size: 8): 79.05
> > MP/MC bulk enq/dequeue (size: 8): 153.46
> > SP/SC bulk enq/dequeue (size: 32): 26.41
> > MP/MC bulk enq/dequeue (size: 32): 38.37
> > Test OK
> > RTE>>
> > 
> > 
> > # c11 memory model patch((OOO arm64 machine)
> > https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
> > ----------------------------------------------------------------------------------------------
> > ### Testing single element and burst enq/deq ###
> > SP/SC single enq/dequeue: 98
> > MP/MC single enq/dequeue: 130
> > SP/SC burst enq/dequeue (size: 8): 18
> > MP/MC burst enq/dequeue (size: 8): 22
> > SP/SC burst enq/dequeue (size: 32): 7
> > MP/MC burst enq/dequeue (size: 32): 9
> > 
> > ### Testing empty dequeue ###
> > SC empty dequeue: 4.00
> > MC empty dequeue: 5.00
> > 
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 17.40
> > MP/MC bulk enq/dequeue (size: 8): 22.88
> > SP/SC bulk enq/dequeue (size: 32): 7.62
> > MP/MC bulk enq/dequeue (size: 32): 8.96
> > 
> > ### Testing using two hyperthreads ###
> > SP/SC bulk enq/dequeue (size: 8): 20.24
> > MP/MC bulk enq/dequeue (size: 8): 25.83
> > SP/SC bulk enq/dequeue (size: 32): 12.21
> > MP/MC bulk enq/dequeue (size: 32): 13.20
> > 
> > ### Testing using two physical cores ###
> > SP/SC bulk enq/dequeue (size: 8): 67.54
> > MP/MC bulk enq/dequeue (size: 8): 124.63
> > SP/SC bulk enq/dequeue (size: 32): 21.13
> > MP/MC bulk enq/dequeue (size: 32): 28.44
> > Test OK
> > RTE>>quit
> > 
> > 
> > > or find any other performance test case to compare the performance impact?
> > As far as I know, ring_perf_autotest is the better performance test.
> > If you have trouble in using "High-resolution cycle counter" in your platform then also
> > you can use ring_perf_auto test to compare the performance(as relative
> > number matters)
> > 
> > Jerin
> > 
> > > Thanks for any suggestions.
> > > 
> > > Cheers,
> > > Jia
>
  
Jerin Jacob Nov. 1, 2017, 7:10 p.m. UTC | #34
-----Original Message-----
> Date: Wed, 1 Nov 2017 12:48:31 +0800
> From: Jia He <hejianet@gmail.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
>  <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
>  "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>  <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>  <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>  <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
>  <bruce.richardson@intel.com>, jianbo.liu@arm.com, hemant.agrawal@nxp.com
> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
>  loading when doing enqueue/dequeue
> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>  Thunderbird/52.4.0
> 
> Hi Jerin
> 
> 
> On 10/31/2017 7:14 PM, Jerin Jacob Wrote:
> > -----Original Message-----
> > > Date: Tue, 31 Oct 2017 10:55:15 +0800
> > > From: Jia He <hejianet@gmail.com>
> > > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
> > >   <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
> > >   "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
> > >   <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
> > >   <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
> > >   <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
> > >   <bruce.richardson@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
> > >   loading when doing enqueue/dequeue
> > > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
> > >   Thunderbird/52.4.0
> > > 
> > > Hi Jerin
> > Hi Jia,
> > 
> > > Do you think  next step whether I need to implement the load_acquire half
> > > barrier as per freebsd
> > I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics
> > and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine)
> > and Platform B: arm64 OOO machine)
> Can you elaborate anything about your Non arm64 OOO machine? As I know, all
> arm64 server is strong
> memory order. Am I missed anything?

It is implementation defined. The low end arm64 boxes(non server) may not have the
complete OOO support.
  
Jia He Nov. 2, 2017, 1:09 a.m. UTC | #35
Hi Jerin


On 11/2/2017 3:04 AM, Jerin Jacob Wrote:
> Date: Thu, 2 Nov 2017 00:27:46 +0530
> From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> To: Jia He <hejianet@gmail.com>
> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
> 	"Zhao, Bing" <ilovethull@163.com>,
> 	Olivier MATZ <olivier.matz@6wind.com>,
> 	"dev@dpdk.org" <dev@dpdk.org>,
> 	"jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>,
> 	"jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>,
> 	"bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com>,
> 	"Richardson, Bruce" <bruce.richardson@intel.com>,
> 	jianbo.liu@arm.com, hemant.agrawal@nxp.com
> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading
>   when doing enqueue/dequeue
> Message-ID: <20171101185723.GA18759@jerin>
> References: <2601191342CEEE43887BDE71AB9772585FAAB570@IRSMSX103.ger.corp.intel.com>
>   <3e580cd7-2854-d855-be9c-7c4ce06e3ed5@gmail.com>
>   <20171020054319.GA4249@jerin>
>   <ab7154a2-a9f8-f12e-b6a0-2805c2065e2e@gmail.com>
>   <20171023100617.GA17957@jerin>
>   <b0e6eae2-e61b-1946-ac44-d781225778e5@gmail.com>
>   <20171025132642.GA13977@jerin>
>   <b54ce545-b75d-9813-209f-4125bd76c7db@gmail.com>
>   <20171031111433.GA21742@jerin>
>   <69adfb00-4582-b362-0540-d1d9d6bcf6aa@gmail.com>
> MIME-Version: 1.0
> Content-Type: text/plain; charset=iso-8859-1
> Content-Disposition: inline
> Content-Transfer-Encoding: 8bit
> In-Reply-To: <69adfb00-4582-b362-0540-d1d9d6bcf6aa@gmail.com>
> User-Agent: Mutt/1.9.1 (2017-09-22)
>
> -----Original Message-----
>> Date: Wed, 1 Nov 2017 10:53:12 +0800
>> From: Jia He <hejianet@gmail.com>
>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
>>   <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
>>   "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>>   <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>>   <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>>   <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
>>   <bruce.richardson@intel.com>, jianbo.liu@arm.com, hemant.agrawal@nxp.com
>> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
>>   loading when doing enqueue/dequeue
>> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>>   Thunderbird/52.4.0
>>
>> Hi Jerin
>>
>> Thanks for your suggestions. I will try to use config macro to let it be
>> chosen by user.
> It is better, if we avoid #ifdef's in the common code. I think, you can
> do the scheme like examples/l3fwd/l3fwd_em_hlm_neon.h. Where,
> the common code will have all generic routine, difference will be
> moved to a separate files to reduce #ifdef clutter.
>
>> I need to point out one possible issue in your load_acq/store_rel patch
>>
>> at https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
>>
>> @@ -516,8 +541,13 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
>> is_sc,
>>           /* Restore n as it may change every loop */
>>           n = max;
>>
>> +#if 0
>>           *old_head = r->cons.head;
>>           const uint32_t prod_tail = r->prod.tail;
>> +#else
>> +        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);
>>                 --[1]
>> +        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
>> __ATOMIC_ACQUIRE);   --[2]
>> +#endif
>>
>> line [1] __ATOMIC_RELAXED is not enough for this case(tested in our ARM64
>> server).
>>
>> line [2] __ATOMIC_ACQUIRE guarantee the 2nd load will not be reorded before
>> the 1st load, but will not
>>
>> guarantee the 1st load will not be reordered after the 2nd load. Please also
> For me it looks same. [1] can not cross [2] is same as [2] cannot cross
> [1], if [1] are [2] at back to back. No ?
No,
IIUC, load_acquire(a) is equal to the pseudo codes:
load(a)
one-direction-barrier()

instead of
one-direction-barrier()
load(a)

Thus, in below cases, load(a) and load(b) can still be reordered, this 
is not the semantic violation of
the load_acquire:
load(a)
load_acquire(b)

IIUC, the orginal implementation in 
https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170 
is not optimal.
I tested the changes as follow and it works fine:

+        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
          const uint32_t prod_tail = r->prod.tail;

i.e.
load_acquire(a)
load(b)


Cheers,
Jia
  
Jia He Nov. 2, 2017, 8:57 a.m. UTC | #36
Hi, Jerin
please see my performance test below
On 11/2/2017 3:04 AM, Jerin Jacob Wrote:
[...]
> Should it be like instead?
>
> +#else
> +        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
> +        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> __ATOMIC_ACQUIRE);
> It would be nice to see how much overhead it gives.ie back to back
> __ATOMIC_ACQUIRE.
I can NOT test ring_perf_autotest in our server because of the something 
wrong in PMU counter.
All the return value of rte_rdtsc is 0 with and without your provided ko 
module. I am still
investigating the reason.

  I ever tested the difference with my debug patch, the difference is 
minor, less than +-1%
  
Jia He Nov. 3, 2017, 2:55 a.m. UTC | #37
Hi Jerin


On 11/2/2017 4:57 PM, Jia He Wrote:
>
> Hi, Jerin
> please see my performance test below
> On 11/2/2017 3:04 AM, Jerin Jacob Wrote:
> [...]
>> Should it be like instead?
>>
>> +#else
>> +        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
>> +        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
>> __ATOMIC_ACQUIRE);
>> It would be nice to see how much overhead it gives.ie back to back
>> __ATOMIC_ACQUIRE.
> I can NOT test ring_perf_autotest in our server because of the 
> something wrong in PMU counter.
> All the return value of rte_rdtsc is 0 with and without your provided 
> ko module. I am still
> investigating the reason.
>

Hi Jerin

As for the root cause of rte_rdtsc issue, it might be due to the pmu 
counter frequency is too low

in our arm64 server("Amberwing" from qualcom)

[586990.057779] arch_timer_get_cntfrq()=20000000

Only 20MHz instead of 100M/200MHz, and CNTFRQ_EL0 is not even writable 
in kernel space.

Maybe the code in ring_perf_autotest needs to be changed?

e.g.

     printf("SC empty dequeue: %.2F\n",
             (double)(sc_end-sc_start) / iterations);
     printf("MC empty dequeue: %.2F\n",
             (double)(mc_end-mc_start) / iterations);

Otherwise it is always 0 if the time difference divides by iterations.
  
Jerin Jacob Nov. 3, 2017, 12:47 p.m. UTC | #38
-----Original Message-----
> Date: Fri, 3 Nov 2017 10:55:40 +0800
> From: Jia He <hejianet@gmail.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, "Zhao, Bing"
>  <ilovethull@163.com>, Olivier MATZ <olivier.matz@6wind.com>,
>  "dev@dpdk.org" <dev@dpdk.org>, "jia.he@hxt-semitech.com"
>  <jia.he@hxt-semitech.com>, "jie2.liu@hxt-semitech.com"
>  <jie2.liu@hxt-semitech.com>, "bing.zhao@hxt-semitech.com"
>  <bing.zhao@hxt-semitech.com>, "Richardson, Bruce"
>  <bruce.richardson@intel.com>, jianbo.liu@arm.com, hemant.agrawal@nxp.com
> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod
>  loading when doing enqueue/dequeue
> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
>  Thunderbird/52.4.0
> 
> Hi Jerin
> 
> 
> On 11/2/2017 4:57 PM, Jia He Wrote:
> > 
> > Hi, Jerin
> > please see my performance test below
> > On 11/2/2017 3:04 AM, Jerin Jacob Wrote:
> > [...]
> > > Should it be like instead?
> > > 
> > > +#else
> > > +        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
> > > +        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> > > __ATOMIC_ACQUIRE);
> > > It would be nice to see how much overhead it gives.ie back to back
> > > __ATOMIC_ACQUIRE.
> > I can NOT test ring_perf_autotest in our server because of the something
> > wrong in PMU counter.
> > All the return value of rte_rdtsc is 0 with and without your provided ko
> > module. I am still
> > investigating the reason.
> > 
> 
> Hi Jerin
> 
> As for the root cause of rte_rdtsc issue, it might be due to the pmu counter
> frequency is too low
> 
> in our arm64 server("Amberwing" from qualcom)
> 
> [586990.057779] arch_timer_get_cntfrq()=20000000
> 
> Only 20MHz instead of 100M/200MHz, and CNTFRQ_EL0 is not even writable in
> kernel space.

May not be true, as I guess, linux 'perf' write those register in kernel
space. Another option could be write from ATF/Secure boot loader if that is the case.

> 
> Maybe the code in ring_perf_autotest needs to be changed?

Increase the "iterations" to measure @ 200MHz.

> 
> e.g.
> 
>     printf("SC empty dequeue: %.2F\n",
>             (double)(sc_end-sc_start) / iterations);
>     printf("MC empty dequeue: %.2F\n",
>             (double)(mc_end-mc_start) / iterations);
> 
> Otherwise it is always 0 if the time difference divides by iterations.
> 
> 
> -- 
> Cheers,
> Jia
>
  

Patch

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7..15c72e2 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -409,6 +409,10 @@  __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
 		n = max;
 
 		*old_head = r->prod.head;
+
+		/* load of prod.tail can't be reordered before cons.head */
+		rte_smp_rmb();
+
 		const uint32_t cons_tail = r->cons.tail;
 		/*
 		 *  The subtraction is done between two unsigned 32bits value
@@ -517,6 +521,10 @@  __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		n = max;
 
 		*old_head = r->cons.head;
+
+		/* load of prod.tail can't be reordered before cons.head */
+		rte_smp_rmb();
+
 		const uint32_t prod_tail = r->prod.tail;
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have