[RFC] eal: adjust barriers for IO on Armv8-a
diff mbox series

Message ID 20200511180637.22200-1-honnappa.nagarahalli@arm.com
State New
Headers show
Series
  • [RFC] eal: adjust barriers for IO on Armv8-a
Related show

Checks

Context Check Description
ci/Intel-compilation success Compilation OK
ci/checkpatch warning coding style issues

Commit Message

Honnappa Nagarahalli May 11, 2020, 6:06 p.m. UTC
Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy
atomicity memory model.

Armv8-a memory model has been strengthened to require
other-multi-copy atomicity. This property requires memory accesses
from an observer to become visible to all other observers
simultaneously [3]. This means

a) A write arriving at an endpoint shared between multiple CPUs is
   visible to all CPUs
b) A write that is visible to all CPUs is also visible to all other
   observers in the shareability domain

This allows for using cheaper DMB instructions in the place of DSB
for devices that are visible to all CPUs (i.e. devices that DPDK
caters to).

Please refer to [1], [2] and [3] for more information.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f
[2] https://www.youtube.com/watch?v=i6DayghhA8Q
[3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Ruifeng Wang May 12, 2020, 6:18 a.m. UTC | #1
> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Sent: Tuesday, May 12, 2020 2:07 AM
> To: dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit
> Khaparde (ajit.khaparde@broadcom.com) <ajit.khaparde@broadcom.com>;
> igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com;
> arybchenko@solarflare.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> 
> Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy
> atomicity memory model.
> 
> Armv8-a memory model has been strengthened to require other-multi-copy
> atomicity. This property requires memory accesses from an observer to
> become visible to all other observers simultaneously [3]. This means
> 
> a) A write arriving at an endpoint shared between multiple CPUs is
>    visible to all CPUs
> b) A write that is visible to all CPUs is also visible to all other
>    observers in the shareability domain
> 
> This allows for using cheaper DMB instructions in the place of DSB for devices
> that are visible to all CPUs (i.e. devices that DPDK caters to).
> 
> Please refer to [1], [2] and [3] for more information.
> 
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> b/lib/librte_eal/arm/include/rte_atomic_64.h
> index 7b7099cdc..e406411bb 100644
> --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> @@ -19,11 +19,11 @@ extern "C" {
>  #include <rte_compat.h>
>  #include <rte_debug.h>
> 
> -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> 
> -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> 
> -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> 
>  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> 
> @@ -37,9 +37,9 @@ extern "C" {
> 
>  #define rte_io_rmb() rte_rmb()
> 
> -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> +#define rte_cio_wmb() rte_wmb()
> 
> -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> +#define rte_cio_rmb() rte_rmb()
> 
>  /*------------------------ 128 bit atomic operations -------------------------*/
> 
> --
> 2.17.1

This change showed about 7% performance gain in testpmd single core NDR test.
Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Jerin Jacob May 12, 2020, 6:42 a.m. UTC | #2
On Tue, May 12, 2020 at 11:48 AM Ruifeng Wang <Ruifeng.Wang@arm.com> wrote:
>
>
> > -----Original Message-----
> > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Sent: Tuesday, May 12, 2020 2:07 AM
> > To: dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit
> > Khaparde (ajit.khaparde@broadcom.com) <ajit.khaparde@broadcom.com>;
> > igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com;
> > arybchenko@solarflare.com; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>
> > Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> >
> > Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy
> > atomicity memory model.
> >
> > Armv8-a memory model has been strengthened to require other-multi-copy
> > atomicity. This property requires memory accesses from an observer to
> > become visible to all other observers simultaneously [3]. This means
> >
> > a) A write arriving at an endpoint shared between multiple CPUs is
> >    visible to all CPUs
> > b) A write that is visible to all CPUs is also visible to all other
> >    observers in the shareability domain
> >
> > This allows for using cheaper DMB instructions in the place of DSB for devices
> > that are visible to all CPUs (i.e. devices that DPDK caters to).
> >
> > Please refer to [1], [2] and [3] for more information.
> >
> > [1]
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> > d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> >  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > index 7b7099cdc..e406411bb 100644
> > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > @@ -19,11 +19,11 @@ extern "C" {
> >  #include <rte_compat.h>
> >  #include <rte_debug.h>
> >
> > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> >
> > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> >
> > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> >
> >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> >
> > @@ -37,9 +37,9 @@ extern "C" {
> >
> >  #define rte_io_rmb() rte_rmb()
> >
> > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> > +#define rte_cio_wmb() rte_wmb()
> >
> > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> > +#define rte_cio_rmb() rte_rmb()
> >
> >  /*------------------------ 128 bit atomic operations -------------------------*/
> >
> > --
> > 2.17.1
>
> This change showed about 7% performance gain in testpmd single core NDR test.

I am trying to understand this patch wrt DPDK current usage model?

1)  Is performance improvement due to the fact that the PMD that you
are using it for testing suppose to use existing rte_cio_* but it was
using rte_[rw]mb?
2) In my understanding :
a) CPU to CPU barrier requirements are addressed by rte_smp_*
b) CPU to DMA/Device barrier requirements are addressed by rte_cio_*
c) CPU to ANY(CPU or Device) are addressed by  rte_[rw]mb

If (c) is true then we are violating the DPDK spec with change. Right?
This change will not be required if fastpath (CPU to Device) is using
rte_cio_*. Right?



> Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
>
Ruifeng Wang May 12, 2020, 8:02 a.m. UTC | #3
> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Tuesday, May 12, 2020 2:42 PM
> To: Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit
> Khaparde (ajit.khaparde@broadcom.com) <ajit.khaparde@broadcom.com>;
> igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com;
> arybchenko@solarflare.com; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a
> 
> On Tue, May 12, 2020 at 11:48 AM Ruifeng Wang <Ruifeng.Wang@arm.com>
> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Sent: Tuesday, May 12, 2020 2:07 AM
> > > To: dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit
> > > Khaparde (ajit.khaparde@broadcom.com)
> <ajit.khaparde@broadcom.com>;
> > > igorch@amazon.com; thomas@monjalon.net;
> viacheslavo@mellanox.com;
> > > arybchenko@solarflare.com; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>
> > > Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> > >
> > > Change the barrier APIs for IO to reflect that Armv8-a is
> > > other-multi-copy atomicity memory model.
> > >
> > > Armv8-a memory model has been strengthened to require
> > > other-multi-copy atomicity. This property requires memory accesses
> > > from an observer to become visible to all other observers
> > > simultaneously [3]. This means
> > >
> > > a) A write arriving at an endpoint shared between multiple CPUs is
> > >    visible to all CPUs
> > > b) A write that is visible to all CPUs is also visible to all other
> > >    observers in the shareability domain
> > >
> > > This allows for using cheaper DMB instructions in the place of DSB
> > > for devices that are visible to all CPUs (i.e. devices that DPDK caters to).
> > >
> > > Please refer to [1], [2] and [3] for more information.
> > >
> > > [1]
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/c
> > > ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> > >
> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > ---
> > >  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > index 7b7099cdc..e406411bb 100644
> > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > @@ -19,11 +19,11 @@ extern "C" {
> > >  #include <rte_compat.h>
> > >  #include <rte_debug.h>
> > >
> > > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> > >
> > > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> > >
> > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> > >
> > >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> > >
> > > @@ -37,9 +37,9 @@ extern "C" {
> > >
> > >  #define rte_io_rmb() rte_rmb()
> > >
> > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> > > +#define rte_cio_wmb() rte_wmb()
> > >
> > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> > > +#define rte_cio_rmb() rte_rmb()
> > >
> > >  /*------------------------ 128 bit atomic operations
> > > -------------------------*/
> > >
> > > --
> > > 2.17.1
> >
> > This change showed about 7% performance gain in testpmd single core
> NDR test.
> 
> I am trying to understand this patch wrt DPDK current usage model?
> 
> 1)  Is performance improvement due to the fact that the PMD that you are
> using it for testing suppose to use existing rte_cio_* but it was using
> rte_[rw]mb?

This is part of the reason. There are also cases where rte_io_* was used and can be relaxed.
Such as: http://patches.dpdk.org/patch/68162/

> 2) In my understanding :
> a) CPU to CPU barrier requirements are addressed by rte_smp_*
> b) CPU to DMA/Device barrier requirements are addressed by rte_cio_*
> c) CPU to ANY(CPU or Device) are addressed by  rte_[rw]mb
> 
> If (c) is true then we are violating the DPDK spec with change. Right?

Developers are still required to use correct barrier APIs for different use cases.
I think this change mitigates performance penalty when non optimal barrier is used.

> This change will not be required if fastpath (CPU to Device) is using rte_cio_*.
> Right?

See 1). Correct usage of rte_cio_* is not the whole.  
For some other use cases, such as barrier between accesses of different memory types, we can also use lighter barrier 'dmb'.

> 
> 
> 
> > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >
Jerin Jacob May 12, 2020, 8:28 a.m. UTC | #4
On Tue, May 12, 2020 at 1:32 PM Ruifeng Wang <Ruifeng.Wang@arm.com> wrote:
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Tuesday, May 12, 2020 2:42 PM
> > To: Ruifeng Wang <Ruifeng.Wang@arm.com>
> > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> > dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit
> > Khaparde (ajit.khaparde@broadcom.com) <ajit.khaparde@broadcom.com>;
> > igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com;
> > arybchenko@solarflare.com; nd <nd@arm.com>
> > Subject: Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a
> >
> > On Tue, May 12, 2020 at 11:48 AM Ruifeng Wang <Ruifeng.Wang@arm.com>
> > wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > Sent: Tuesday, May 12, 2020 2:07 AM
> > > > To: dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit
> > > > Khaparde (ajit.khaparde@broadcom.com)
> > <ajit.khaparde@broadcom.com>;
> > > > igorch@amazon.com; thomas@monjalon.net;
> > viacheslavo@mellanox.com;
> > > > arybchenko@solarflare.com; Honnappa Nagarahalli
> > > > <Honnappa.Nagarahalli@arm.com>
> > > > Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> > > >
> > > > Change the barrier APIs for IO to reflect that Armv8-a is
> > > > other-multi-copy atomicity memory model.
> > > >
> > > > Armv8-a memory model has been strengthened to require
> > > > other-multi-copy atomicity. This property requires memory accesses
> > > > from an observer to become visible to all other observers
> > > > simultaneously [3]. This means
> > > >
> > > > a) A write arriving at an endpoint shared between multiple CPUs is
> > > >    visible to all CPUs
> > > > b) A write that is visible to all CPUs is also visible to all other
> > > >    observers in the shareability domain
> > > >
> > > > This allows for using cheaper DMB instructions in the place of DSB
> > > > for devices that are visible to all CPUs (i.e. devices that DPDK caters to).
> > > >
> > > > Please refer to [1], [2] and [3] for more information.
> > > >
> > > > [1]
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/c
> > > > ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> > > >
> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > ---
> > > >  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > index 7b7099cdc..e406411bb 100644
> > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > @@ -19,11 +19,11 @@ extern "C" {
> > > >  #include <rte_compat.h>
> > > >  #include <rte_debug.h>
> > > >
> > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> > > >
> > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> > > >
> > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> > > >
> > > >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> > > >
> > > > @@ -37,9 +37,9 @@ extern "C" {
> > > >
> > > >  #define rte_io_rmb() rte_rmb()
> > > >
> > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> > > > +#define rte_cio_wmb() rte_wmb()
> > > >
> > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> > > > +#define rte_cio_rmb() rte_rmb()
> > > >
> > > >  /*------------------------ 128 bit atomic operations
> > > > -------------------------*/
> > > >
> > > > --
> > > > 2.17.1
> > >
> > > This change showed about 7% performance gain in testpmd single core
> > NDR test.
> >
> > I am trying to understand this patch wrt DPDK current usage model?
> >
> > 1)  Is performance improvement due to the fact that the PMD that you are
> > using it for testing suppose to use existing rte_cio_* but it was using
> > rte_[rw]mb?
>
> This is part of the reason. There are also cases where rte_io_* was used and can be relaxed.
> Such as: http://patches.dpdk.org/patch/68162/
>
> > 2) In my understanding :
> > a) CPU to CPU barrier requirements are addressed by rte_smp_*
> > b) CPU to DMA/Device barrier requirements are addressed by rte_cio_*
> > c) CPU to ANY(CPU or Device) are addressed by  rte_[rw]mb
> >
> > If (c) is true then we are violating the DPDK spec with change. Right?
>
> Developers are still required to use correct barrier APIs for different use cases.
> I think this change mitigates performance penalty when non optimal barrier is used.

But does it violate the contract?  We are using rte_[rw]mb as a low
performance/heavyweight
for all the cases. I think that is the contract to DPDK consumers. For
different requirment,
We have a specific API. IMO, It makes sense to change the fastpath code for more
fine granted barriers based on the need rather than changing the
generic one to lightweight.
i.e rte_[rw]wb is the superset that works on all cases and use
customized one for the specific
use case.

>
> > This change will not be required if fastpath (CPU to Device) is using rte_cio_*.
> > Right?
>
> See 1). Correct usage of rte_cio_* is not the whole.
> For some other use cases, such as barrier between accesses of different memory types, we can also use lighter barrier 'dmb'.
>
> >
> >
> >
> > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > >
Honnappa Nagarahalli May 12, 2020, 9:44 p.m. UTC | #5
<snip>

> > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> > > >
> > > > Change the barrier APIs for IO to reflect that Armv8-a is
> > > > other-multi-copy atomicity memory model.
> > > >
> > > > Armv8-a memory model has been strengthened to require
> > > > other-multi-copy atomicity. This property requires memory accesses
> > > > from an observer to become visible to all other observers
> > > > simultaneously [3]. This means
> > > >
> > > > a) A write arriving at an endpoint shared between multiple CPUs is
> > > >    visible to all CPUs
> > > > b) A write that is visible to all CPUs is also visible to all other
> > > >    observers in the shareability domain
> > > >
> > > > This allows for using cheaper DMB instructions in the place of DSB
> > > > for devices that are visible to all CPUs (i.e. devices that DPDK caters to).
> > > >
> > > > Please refer to [1], [2] and [3] for more information.
> > > >
> > > > [1]
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > > > /c ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> > > >
> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > ---
> > > >  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > index 7b7099cdc..e406411bb 100644
> > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > @@ -19,11 +19,11 @@ extern "C" {
> > > >  #include <rte_compat.h>
> > > >  #include <rte_debug.h>
> > > >
> > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> > > >
> > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> > > >
> > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> > > >
> > > >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> > > >
> > > > @@ -37,9 +37,9 @@ extern "C" {
> > > >
> > > >  #define rte_io_rmb() rte_rmb()
> > > >
> > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> > > > +#define rte_cio_wmb() rte_wmb()
> > > >
> > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> > > > +#define rte_cio_rmb() rte_rmb()
> > > >
> > > >  /*------------------------ 128 bit atomic operations
> > > > -------------------------*/
> > > >
> > > > --
> > > > 2.17.1
> > >
> > > This change showed about 7% performance gain in testpmd single core
> > NDR test.
> >
> > I am trying to understand this patch wrt DPDK current usage model?
> >
> > 1)  Is performance improvement due to the fact that the PMD that you
> > are using it for testing suppose to use existing rte_cio_* but it was
> > using rte_[rw]mb?
No, it is supposed to use rte_[rw]mb for x86.

> 
> This is part of the reason. There are also cases where rte_io_* was used and
> can be relaxed.
> Such as: http://patches.dpdk.org/patch/68162/
> 
> > 2) In my understanding :
> > a) CPU to CPU barrier requirements are addressed by rte_smp_*
> > b) CPU to DMA/Device barrier requirements are addressed by rte_cio_*
> > c) CPU to ANY(CPU or Device) are addressed by  rte_[rw]mb
> >
> > If (c) is true then we are violating the DPDK spec with change. Right?
No, we are not. Essentially, due to the other-multi-copy atomicity behavior of the architecture, we are saying 'DMB OSH*' is enough to achieve (c).

> 
> Developers are still required to use correct barrier APIs for different use cases.
> I think this change mitigates performance penalty when non optimal barrier is
> used.
> 
> > This change will not be required if fastpath (CPU to Device) is using
> rte_cio_*.
> > Right?
Yes. It is required when the fastpath uses rte_[rw]mb.

> 
> See 1). Correct usage of rte_cio_* is not the whole.
> For some other use cases, such as barrier between accesses of different
> memory types, we can also use lighter barrier 'dmb'.
> 
> >
> >
> >
> > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > >
Jerin Jacob May 13, 2020, 2:49 p.m. UTC | #6
On Wed, May 13, 2020 at 3:14 AM Honnappa Nagarahalli
<Honnappa.Nagarahalli@arm.com> wrote:
>
> <snip>
>
> > > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> > > > >
> > > > > Change the barrier APIs for IO to reflect that Armv8-a is
> > > > > other-multi-copy atomicity memory model.
> > > > >
> > > > > Armv8-a memory model has been strengthened to require
> > > > > other-multi-copy atomicity. This property requires memory accesses
> > > > > from an observer to become visible to all other observers
> > > > > simultaneously [3]. This means
> > > > >
> > > > > a) A write arriving at an endpoint shared between multiple CPUs is
> > > > >    visible to all CPUs
> > > > > b) A write that is visible to all CPUs is also visible to all other
> > > > >    observers in the shareability domain
> > > > >
> > > > > This allows for using cheaper DMB instructions in the place of DSB
> > > > > for devices that are visible to all CPUs (i.e. devices that DPDK caters to).
> > > > >
> > > > > Please refer to [1], [2] and [3] for more information.
> > > > >
> > > > > [1]
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > > > > /c ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> > > > >
> > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > > ---
> > > > >  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> > > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > index 7b7099cdc..e406411bb 100644
> > > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > @@ -19,11 +19,11 @@ extern "C" {
> > > > >  #include <rte_compat.h>
> > > > >  #include <rte_debug.h>
> > > > >
> > > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> > > > >
> > > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> > > > >
> > > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> > > > >
> > > > >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> > > > >
> > > > > @@ -37,9 +37,9 @@ extern "C" {
> > > > >
> > > > >  #define rte_io_rmb() rte_rmb()
> > > > >
> > > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> > > > > +#define rte_cio_wmb() rte_wmb()
> > > > >
> > > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> > > > > +#define rte_cio_rmb() rte_rmb()
> > > > >
> > > > >  /*------------------------ 128 bit atomic operations
> > > > > -------------------------*/
> > > > >
> > > > > --
> > > > > 2.17.1
> > > >
> > > > This change showed about 7% performance gain in testpmd single core
> > > NDR test.
> > >
> > > I am trying to understand this patch wrt DPDK current usage model?
> > >
> > > 1)  Is performance improvement due to the fact that the PMD that you
> > > are using it for testing suppose to use existing rte_cio_* but it was
> > > using rte_[rw]mb?
> No, it is supposed to use rte_[rw]mb for x86.

Why drivers using rte_[rw]in fastpath, IMO, rte_io_[rw]b and rte_cio_[rw]b
created for this pupose.

But I understand, in x86 it is mapped to rte_compiler_barrier(). Is it
correct from x86 PoV?
@Ananyev, Konstantin @Richardson, Bruce ?

For x86:
#define rte_io_wmb() rte_compiler_barrier()
#define rte_io_rmb() rte_compiler_barrier()
#define rte_cio_wmb() rte_compiler_barrier()
#define rte_cio_rmb() rte_compiler_barrier()


>
> >
> > This is part of the reason. There are also cases where rte_io_* was used and
> > can be relaxed.
> > Such as: http://patches.dpdk.org/patch/68162/
> >
> > > 2) In my understanding :
> > > a) CPU to CPU barrier requirements are addressed by rte_smp_*
> > > b) CPU to DMA/Device barrier requirements are addressed by rte_cio_*
> > > c) CPU to ANY(CPU or Device) are addressed by  rte_[rw]mb
> > >
> > > If (c) is true then we are violating the DPDK spec with change. Right?
> No, we are not. Essentially, due to the other-multi-copy atomicity behavior of the architecture, we are saying 'DMB OSH*' is enough to achieve (c).

Yeah. Probably from userspace POV it should be OK to use "DMB OSH*" to
have the barrier between 4 of them?

1) Global memory (BSS and Data sections), Not mapped as a hugepage.
2) Hugepage memory
3) IOVA memory
4) PCI register read/write

Do we need to worry about anything else which is specific to DSB?
example, TLB related flush etc.

If we are talking this path then rte_cio_[rw]mb() has no meaning in
DPDK as an abstraction as it was created for arm64 for this specific
purpose.
If we can meet all DPDK usecse with DMB OSH then probably we can
deprecate rte_cio_wmb to avoid confusion.

>
> >
> > Developers are still required to use correct barrier APIs for different use cases.
> > I think this change mitigates performance penalty when non optimal barrier is
> > used.
> >
> > > This change will not be required if fastpath (CPU to Device) is using
> > rte_cio_*.
> > > Right?
> Yes. It is required when the fastpath uses rte_[rw]mb.
>
> >
> > See 1). Correct usage of rte_cio_* is not the whole.
> > For some other use cases, such as barrier between accesses of different
> > memory types, we can also use lighter barrier 'dmb'.
> >
> > >
> > >
> > >
> > > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > >
>
Honnappa Nagarahalli May 14, 2020, 1:02 a.m. UTC | #7
<snip>
> >
> > > > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> > > > > >
> > > > > > Change the barrier APIs for IO to reflect that Armv8-a is
> > > > > > other-multi-copy atomicity memory model.
> > > > > >
> > > > > > Armv8-a memory model has been strengthened to require
> > > > > > other-multi-copy atomicity. This property requires memory
> > > > > > accesses from an observer to become visible to all other
> > > > > > observers simultaneously [3]. This means
> > > > > >
> > > > > > a) A write arriving at an endpoint shared between multiple CPUs is
> > > > > >    visible to all CPUs
> > > > > > b) A write that is visible to all CPUs is also visible to all other
> > > > > >    observers in the shareability domain
> > > > > >
> > > > > > This allows for using cheaper DMB instructions in the place of
> > > > > > DSB for devices that are visible to all CPUs (i.e. devices that DPDK
> caters to).
> > > > > >
> > > > > > Please refer to [1], [2] and [3] for more information.
> > > > > >
> > > > > > [1]
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux
> > > > > > .git /c ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > > > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > > > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> > > > > >
> > > > > > Signed-off-by: Honnappa Nagarahalli
> > > > > > <honnappa.nagarahalli@arm.com>
> > > > > > ---
> > > > > >  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> > > > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > > index 7b7099cdc..e406411bb 100644
> > > > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > > @@ -19,11 +19,11 @@ extern "C" {  #include <rte_compat.h>
> > > > > > #include <rte_debug.h>
> > > > > >
> > > > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > > > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> > > > > >
> > > > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > > > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> > > > > >
> > > > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > > > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> > > > > >
> > > > > >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> > > > > >
> > > > > > @@ -37,9 +37,9 @@ extern "C" {
> > > > > >
> > > > > >  #define rte_io_rmb() rte_rmb()
> > > > > >
> > > > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : :
> > > > > > "memory")
> > > > > > +#define rte_cio_wmb() rte_wmb()
> > > > > >
> > > > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : :
> > > > > > "memory")
> > > > > > +#define rte_cio_rmb() rte_rmb()
> > > > > >
> > > > > >  /*------------------------ 128 bit atomic operations
> > > > > > -------------------------*/
> > > > > >
> > > > > > --
> > > > > > 2.17.1
> > > > >
> > > > > This change showed about 7% performance gain in testpmd single
> > > > > core
> > > > NDR test.
> > > >
> > > > I am trying to understand this patch wrt DPDK current usage model?
> > > >
> > > > 1)  Is performance improvement due to the fact that the PMD that
> > > > you are using it for testing suppose to use existing rte_cio_* but
> > > > it was using rte_[rw]mb?
> > No, it is supposed to use rte_[rw]mb for x86.
> 
> Why drivers using rte_[rw]in fastpath, IMO, rte_io_[rw]b and rte_cio_[rw]b
> created for this pupose.
> 
> But I understand, in x86 it is mapped to rte_compiler_barrier(). Is it correct
> from x86 PoV?
> @Ananyev, Konstantin @Richardson, Bruce ?
> 
> For x86:
> #define rte_io_wmb() rte_compiler_barrier() #define rte_io_rmb()
> rte_compiler_barrier() #define rte_cio_wmb() rte_compiler_barrier() #define
> rte_cio_rmb() rte_compiler_barrier()
We need a barrier API with 'DMB OSH*' for Arm and '*fence' for x86. My understanding is, '*fence' is required when WC memory is used in x86.

Also, from Arm architecture perspective, effectively we are saying that 'DSB' is not required for portable drivers.

> 
> 
> >
> > >
> > > This is part of the reason. There are also cases where rte_io_* was
> > > used and can be relaxed.
> > > Such as: http://patches.dpdk.org/patch/68162/
> > >
> > > > 2) In my understanding :
> > > > a) CPU to CPU barrier requirements are addressed by rte_smp_*
> > > > b) CPU to DMA/Device barrier requirements are addressed by
> > > > rte_cio_*
> > > > c) CPU to ANY(CPU or Device) are addressed by  rte_[rw]mb
> > > >
> > > > If (c) is true then we are violating the DPDK spec with change. Right?
> > No, we are not. Essentially, due to the other-multi-copy atomicity behavior
> of the architecture, we are saying 'DMB OSH*' is enough to achieve (c).
> 
> Yeah. Probably from userspace POV it should be OK to use "DMB OSH*" to
> have the barrier between 4 of them?
> 
> 1) Global memory (BSS and Data sections), Not mapped as a hugepage.
> 2) Hugepage memory
> 3) IOVA memory
> 4) PCI register read/write
> 
> Do we need to worry about anything else which is specific to DSB?
> example, TLB related flush etc.
Yes, things like TLB flush or self modifying code still need DSB. But, my understanding is we do not have such code in DPDK and such code will be platform specific.

> 
> If we are talking this path then rte_cio_[rw]mb() has no meaning in DPDK as
> an abstraction as it was created for arm64 for this specific purpose.
> If we can meet all DPDK usecse with DMB OSH then probably we can
> deprecate rte_cio_wmb to avoid confusion.
Agree, rte_cio_*mb is confusing to me. We could deprecate those.

I see Octeon TX/TX2 drivers using rte_*mb. Do you see any issues with this change in those drivers?

This is a very fundamental change, we need more feedback from others working with Arm platforms.

> 
> >
> > >
> > > Developers are still required to use correct barrier APIs for different use
> cases.
> > > I think this change mitigates performance penalty when non optimal
> > > barrier is used.
> > >
> > > > This change will not be required if fastpath (CPU to Device) is
> > > > using
> > > rte_cio_*.
> > > > Right?
> > Yes. It is required when the fastpath uses rte_[rw]mb.
> >
> > >
> > > See 1). Correct usage of rte_cio_* is not the whole.
> > > For some other use cases, such as barrier between accesses of
> > > different memory types, we can also use lighter barrier 'dmb'.
> > >
> > > >
> > > >
> > > >
> > > > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > >
> >

Patch
diff mbox series

diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h b/lib/librte_eal/arm/include/rte_atomic_64.h
index 7b7099cdc..e406411bb 100644
--- a/lib/librte_eal/arm/include/rte_atomic_64.h
+++ b/lib/librte_eal/arm/include/rte_atomic_64.h
@@ -19,11 +19,11 @@  extern "C" {
 #include <rte_compat.h>
 #include <rte_debug.h>
 
-#define rte_mb() asm volatile("dsb sy" : : : "memory")
+#define rte_mb() asm volatile("dmb osh" : : : "memory")
 
-#define rte_wmb() asm volatile("dsb st" : : : "memory")
+#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
 
-#define rte_rmb() asm volatile("dsb ld" : : : "memory")
+#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
 
 #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
 
@@ -37,9 +37,9 @@  extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
-#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
+#define rte_cio_wmb() rte_wmb()
 
-#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
+#define rte_cio_rmb() rte_rmb()
 
 /*------------------------ 128 bit atomic operations -------------------------*/