[v2,1/3] eal/arm64: relax the io barrier for aarch64

Message ID 1576811391-19131-2-git-send-email-gavin.hu@arm.com (mailing list archive)
State Superseded, archived
Delegated to: David Marchand
Headers
Series relax io barrier for aarch64 and use smp barriers for virtual pci memory |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/Intel-compilation success Compilation OK

Commit Message

Gavin Hu Dec. 20, 2019, 3:09 a.m. UTC
  Armv8's peripheral coherence order is a total order on all reads and writes
to that peripheral.[1]

The peripheral coherence order for a memory-mapped peripheral signifies the
order in which accesses arrive at the endpoint.  For a read or a write RW1
and a read or a write RW2 to the same peripheral, then RW1 will appear in
the peripheral coherence order for the peripheral before RW2 if either of
the following cases apply:
 1. RW1 and RW2 are accesses using Non-cacheable or Device attributes and
    RW1 is Ordered-before RW2.
 2. RW1 and RW2 are accesses using Device-nGnRE or Device-nGnRnE attributes
    and RW1 appears in program order before RW2.

On arm platforms, all the PCI resources are mapped to nGnRE device memory
[2], the above case 2 holds true, that means the peripheral coherence order
applies here and just a compiler barrier is sufficient for rte io barriers.

[1] Section B2.3.4 of ARMARM, https://developer.arm.com/docs/ddi0487/lates
t/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
tree/drivers/pci/pci-sysfs.c#n1204

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
  

Comments

Jerin Jacob Dec. 20, 2019, 3:33 a.m. UTC | #1
On Fri, Dec 20, 2019 at 8:40 AM Gavin Hu <gavin.hu@arm.com> wrote:
>
> Armv8's peripheral coherence order is a total order on all reads and writes
> to that peripheral.[1]
>
> The peripheral coherence order for a memory-mapped peripheral signifies the
> order in which accesses arrive at the endpoint.  For a read or a write RW1
> and a read or a write RW2 to the same peripheral, then RW1 will appear in
> the peripheral coherence order for the peripheral before RW2 if either of
> the following cases apply:
>  1. RW1 and RW2 are accesses using Non-cacheable or Device attributes and
>     RW1 is Ordered-before RW2.
>  2. RW1 and RW2 are accesses using Device-nGnRE or Device-nGnRnE attributes
>     and RW1 appears in program order before RW2.


This is true if RW1 and RW2 addresses are device memory. i.e the
registers in the  PCI bar address.
If RW1 is DDR address which is been used by the controller(say NIC
ring descriptor) then there will be an issue.
For example Intel i40e driver, the admin queue update in Host DDR
memory and it updates the doorbell.
In such a case, this patch will create an issue. Correct? Have you
checked this patch with ARM64 + XL710 controllers?

Some of the legacy code is missing such barriers, that's the reason
for adding rte_io_* barrier.

>
> On arm platforms, all the PCI resources are mapped to nGnRE device memory
> [2], the above case 2 holds true, that means the peripheral coherence order
> applies here and just a compiler barrier is sufficient for rte io barriers.
>
> [1] Section B2.3.4 of ARMARM, https://developer.arm.com/docs/ddi0487/lates
> t/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
> tree/drivers/pci/pci-sysfs.c#n1204
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> ---
>  lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> index 859ae12..fd63956 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> @@ -34,11 +34,11 @@ extern "C" {
>
>  #define rte_smp_rmb() dmb(ishld)
>
> -#define rte_io_mb() rte_mb()
> +#define rte_io_mb() rte_compiler_barrier()
>
> -#define rte_io_wmb() rte_wmb()
> +#define rte_io_wmb() rte_compiler_barrier()
>
> -#define rte_io_rmb() rte_rmb()
> +#define rte_io_rmb() rte_compiler_barrier()
>
>  #define rte_cio_wmb() dmb(oshst)
>
> --
> 2.7.4
>
  
Jerin Jacob Dec. 20, 2019, 3:38 a.m. UTC | #2
On Fri, Dec 20, 2019 at 9:03 AM Jerin Jacob <jerinjacobk@gmail.com> wrote:
>
> On Fri, Dec 20, 2019 at 8:40 AM Gavin Hu <gavin.hu@arm.com> wrote:
> >
> > Armv8's peripheral coherence order is a total order on all reads and writes
> > to that peripheral.[1]
> >
> > The peripheral coherence order for a memory-mapped peripheral signifies the
> > order in which accesses arrive at the endpoint.  For a read or a write RW1
> > and a read or a write RW2 to the same peripheral, then RW1 will appear in
> > the peripheral coherence order for the peripheral before RW2 if either of
> > the following cases apply:
> >  1. RW1 and RW2 are accesses using Non-cacheable or Device attributes and
> >     RW1 is Ordered-before RW2.
> >  2. RW1 and RW2 are accesses using Device-nGnRE or Device-nGnRnE attributes
> >     and RW1 appears in program order before RW2.
>
>
> This is true if RW1 and RW2 addresses are device memory. i.e the
> registers in the  PCI bar address.
> If RW1 is DDR address which is been used by the controller(say NIC
> ring descriptor) then there will be an issue.
> For example Intel i40e driver, the admin queue update in Host DDR
> memory and it updates the doorbell.
> In such a case, this patch will create an issue. Correct? Have you
> checked this patch with ARM64 + XL710 controllers?
>
> Some of the legacy code is missing such barriers, that's the reason
> for adding rte_io_* barrier.


More details:

https://dev.dpdk.narkive.com/DpIRqDuy/dpdk-dev-patch-v2-i40e-fix-eth-i40e-dev-init-sequence-on-thunderx

>
> >
> > On arm platforms, all the PCI resources are mapped to nGnRE device memory
> > [2], the above case 2 holds true, that means the peripheral coherence order
> > applies here and just a compiler barrier is sufficient for rte io barriers.
> >
> > [1] Section B2.3.4 of ARMARM, https://developer.arm.com/docs/ddi0487/lates
> > t/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile
> > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
> > tree/drivers/pci/pci-sysfs.c#n1204
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > ---
> >  lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > index 859ae12..fd63956 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > @@ -34,11 +34,11 @@ extern "C" {
> >
> >  #define rte_smp_rmb() dmb(ishld)
> >
> > -#define rte_io_mb() rte_mb()
> > +#define rte_io_mb() rte_compiler_barrier()
> >
> > -#define rte_io_wmb() rte_wmb()
> > +#define rte_io_wmb() rte_compiler_barrier()
> >
> > -#define rte_io_rmb() rte_rmb()
> > +#define rte_io_rmb() rte_compiler_barrier()
> >
> >  #define rte_cio_wmb() dmb(oshst)
> >
> > --
> > 2.7.4
> >
  
Gavin Hu Dec. 20, 2019, 4:19 a.m. UTC | #3
Hi Jerin,

Thanks for review, inline comments, 

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Friday, December 20, 2019 11:38 AM
> To: Gavin Hu <Gavin.Hu@arm.com>
> Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> <david.marchand@redhat.com>; thomas@monjalon.net;
> rasland@mellanox.com; maxime.coquelin@redhat.com;
> tiwei.bie@intel.com; hemant.agrawal@nxp.com; jerinj@marvell.com;
> Pavan Nikhilesh <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier for
> aarch64
> 
> On Fri, Dec 20, 2019 at 9:03 AM Jerin Jacob <jerinjacobk@gmail.com>
> wrote:
> >
> > On Fri, Dec 20, 2019 at 8:40 AM Gavin Hu <gavin.hu@arm.com> wrote:
> > >
> > > Armv8's peripheral coherence order is a total order on all reads and
> writes
> > > to that peripheral.[1]
> > >
> > > The peripheral coherence order for a memory-mapped peripheral
> signifies the
> > > order in which accesses arrive at the endpoint.  For a read or a write
> RW1
> > > and a read or a write RW2 to the same peripheral, then RW1 will appear
> in
> > > the peripheral coherence order for the peripheral before RW2 if either
> of
> > > the following cases apply:
> > >  1. RW1 and RW2 are accesses using Non-cacheable or Device attributes
> and
> > >     RW1 is Ordered-before RW2.
> > >  2. RW1 and RW2 are accesses using Device-nGnRE or Device-nGnRnE
> attributes
> > >     and RW1 appears in program order before RW2.
> >
> >
> > This is true if RW1 and RW2 addresses are device memory. i.e the
> > registers in the  PCI bar address.
> > If RW1 is DDR address which is been used by the controller(say NIC
> > ring descriptor) then there will be an issue.
> > For example Intel i40e driver, the admin queue update in Host DDR
> > memory and it updates the doorbell.
> > In such a case, this patch will create an issue. Correct? Have you
> > checked this patch with ARM64 + XL710 controllers?

This patch relaxes the rte_io_*mb barriers for pure PCI device memory accesses.

For mixed accesses of DDR and PCI device memory, rte_smp_*mb(DMB ISH) is not sufficient.
But rte_cio_*mb(DMB OSH) is sufficient and can be used.

> >
> > Some of the legacy code is missing such barriers, that's the reason
> > for adding rte_io_* barrier.
> 
> 
> More details:
> 
> https://dev.dpdk.narkive.com/DpIRqDuy/dpdk-dev-patch-v2-i40e-fix-eth-
> i40e-dev-init-sequence-on-thunderx
> 
> >
> > >
> > > On arm platforms, all the PCI resources are mapped to nGnRE device
> memory
> > > [2], the above case 2 holds true, that means the peripheral coherence
> order
> > > applies here and just a compiler barrier is sufficient for rte io barriers.
> > >
> > > [1] Section B2.3.4 of ARMARM,
> https://developer.arm.com/docs/ddi0487/lates
> > > t/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-
> profile
> > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
> > > tree/drivers/pci/pci-sysfs.c#n1204
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > ---
> > >  lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > > index 859ae12..fd63956 100644
> > > --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > > +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > > @@ -34,11 +34,11 @@ extern "C" {
> > >
> > >  #define rte_smp_rmb() dmb(ishld)
> > >
> > > -#define rte_io_mb() rte_mb()
> > > +#define rte_io_mb() rte_compiler_barrier()
> > >
> > > -#define rte_io_wmb() rte_wmb()
> > > +#define rte_io_wmb() rte_compiler_barrier()
> > >
> > > -#define rte_io_rmb() rte_rmb()
> > > +#define rte_io_rmb() rte_compiler_barrier()
> > >
> > >  #define rte_cio_wmb() dmb(oshst)
> > >
> > > --
> > > 2.7.4
> > >
  
Jerin Jacob Dec. 20, 2019, 4:34 a.m. UTC | #4
On Fri, Dec 20, 2019 at 9:49 AM Gavin Hu <Gavin.Hu@arm.com> wrote:
>
> Hi Jerin,
>
> Thanks for review, inline comments,
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Friday, December 20, 2019 11:38 AM
> > To: Gavin Hu <Gavin.Hu@arm.com>
> > Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> > <david.marchand@redhat.com>; thomas@monjalon.net;
> > rasland@mellanox.com; maxime.coquelin@redhat.com;
> > tiwei.bie@intel.com; hemant.agrawal@nxp.com; jerinj@marvell.com;
> > Pavan Nikhilesh <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> > <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier for
> > aarch64
> >
> > On Fri, Dec 20, 2019 at 9:03 AM Jerin Jacob <jerinjacobk@gmail.com>
> > wrote:
> > >
> > > On Fri, Dec 20, 2019 at 8:40 AM Gavin Hu <gavin.hu@arm.com> wrote:
> > > >
> > > > Armv8's peripheral coherence order is a total order on all reads and
> > writes
> > > > to that peripheral.[1]
> > > >
> > > > The peripheral coherence order for a memory-mapped peripheral
> > signifies the
> > > > order in which accesses arrive at the endpoint.  For a read or a write
> > RW1
> > > > and a read or a write RW2 to the same peripheral, then RW1 will appear
> > in
> > > > the peripheral coherence order for the peripheral before RW2 if either
> > of
> > > > the following cases apply:
> > > >  1. RW1 and RW2 are accesses using Non-cacheable or Device attributes
> > and
> > > >     RW1 is Ordered-before RW2.
> > > >  2. RW1 and RW2 are accesses using Device-nGnRE or Device-nGnRnE
> > attributes
> > > >     and RW1 appears in program order before RW2.
> > >
> > >
> > > This is true if RW1 and RW2 addresses are device memory. i.e the
> > > registers in the  PCI bar address.
> > > If RW1 is DDR address which is been used by the controller(say NIC
> > > ring descriptor) then there will be an issue.
> > > For example Intel i40e driver, the admin queue update in Host DDR
> > > memory and it updates the doorbell.
> > > In such a case, this patch will create an issue. Correct? Have you
> > > checked this patch with ARM64 + XL710 controllers?
>
> This patch relaxes the rte_io_*mb barriers for pure PCI device memory accesses.

Yes. This would break cases for mixed access fro i40e drivers.

>
> For mixed accesses of DDR and PCI device memory, rte_smp_*mb(DMB ISH) is not sufficient.
> But rte_cio_*mb(DMB OSH) is sufficient and can be used.

Yes. Let me share a bit of history.

1) There are a lot of drivers(initially developed in x86) that have
mixed access and don't have any barriers as x86 does not need it.
2) rte_io introduced to fix that
3) Item (2) introduced the performance issues in the fast path as an
optimization rte_cio_* introduced.

So in the current of the scheme of things, we have APIs to FIX
portability issue(rte_io) and performance issue(rte_cio).
IMO, we may not need any change in infra code now. If you think, the
documentation is missing then we can enhance it.
If we make infra change then again drivers needs to be updated and tested.
  
Gavin Hu Dec. 20, 2019, 6:32 a.m. UTC | #5
Hi Jerin,

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Friday, December 20, 2019 12:34 PM
> To: Gavin Hu <Gavin.Hu@arm.com>
> Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> <david.marchand@redhat.com>; thomas@monjalon.net;
> rasland@mellanox.com; maxime.coquelin@redhat.com;
> tiwei.bie@intel.com; hemant.agrawal@nxp.com; jerinj@marvell.com;
> Pavan Nikhilesh <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier for
> aarch64
> 
> On Fri, Dec 20, 2019 at 9:49 AM Gavin Hu <Gavin.Hu@arm.com> wrote:
> >
> > Hi Jerin,
> >
> > Thanks for review, inline comments,
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Friday, December 20, 2019 11:38 AM
> > > To: Gavin Hu <Gavin.Hu@arm.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> > > <david.marchand@redhat.com>; thomas@monjalon.net;
> > > rasland@mellanox.com; maxime.coquelin@redhat.com;
> > > tiwei.bie@intel.com; hemant.agrawal@nxp.com; jerinj@marvell.com;
> > > Pavan Nikhilesh <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> > > <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce
> Kong
> > > <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier
> for
> > > aarch64
> > >
> > > On Fri, Dec 20, 2019 at 9:03 AM Jerin Jacob <jerinjacobk@gmail.com>
> > > wrote:
> > > >
> > > > On Fri, Dec 20, 2019 at 8:40 AM Gavin Hu <gavin.hu@arm.com> wrote:
> > > > >
> > > > > Armv8's peripheral coherence order is a total order on all reads and
> > > writes
> > > > > to that peripheral.[1]
> > > > >
> > > > > The peripheral coherence order for a memory-mapped peripheral
> > > signifies the
> > > > > order in which accesses arrive at the endpoint.  For a read or a write
> > > RW1
> > > > > and a read or a write RW2 to the same peripheral, then RW1 will
> appear
> > > in
> > > > > the peripheral coherence order for the peripheral before RW2 if
> either
> > > of
> > > > > the following cases apply:
> > > > >  1. RW1 and RW2 are accesses using Non-cacheable or Device
> attributes
> > > and
> > > > >     RW1 is Ordered-before RW2.
> > > > >  2. RW1 and RW2 are accesses using Device-nGnRE or Device-
> nGnRnE
> > > attributes
> > > > >     and RW1 appears in program order before RW2.
> > > >
> > > >
> > > > This is true if RW1 and RW2 addresses are device memory. i.e the
> > > > registers in the  PCI bar address.
> > > > If RW1 is DDR address which is been used by the controller(say NIC
> > > > ring descriptor) then there will be an issue.
> > > > For example Intel i40e driver, the admin queue update in Host DDR
> > > > memory and it updates the doorbell.
> > > > In such a case, this patch will create an issue. Correct? Have you
> > > > checked this patch with ARM64 + XL710 controllers?
> >
> > This patch relaxes the rte_io_*mb barriers for pure PCI device memory
> accesses.
> 
> Yes. This would break cases for mixed access fro i40e drivers.
> 
> >
> > For mixed accesses of DDR and PCI device memory, rte_smp_*mb(DMB
> ISH) is not sufficient.
> > But rte_cio_*mb(DMB OSH) is sufficient and can be used.
> 
> Yes. Let me share a bit of history.
> 
> 1) There are a lot of drivers(initially developed in x86) that have
> mixed access and don't have any barriers as x86 does not need it.
> 2) rte_io introduced to fix that
> 3) Item (2) introduced the performance issues in the fast path as an
> optimization rte_cio_* introduced.
Exactly, this patch is to mitigate the performance issues introduced by rte_io('dsb' is too much and unnecessary here).
Rte_cio instead is definitely required for mixed access. 
> 
> So in the current of the scheme of things, we have APIs to FIX
> portability issue(rte_io) and performance issue(rte_cio).
> IMO, we may not need any change in infra code now. If you think, the
> documentation is missing then we can enhance it.
> If we make infra change then again drivers needs to be updated and tested.
No changes for rte_cio, the semantics, and definitions of rte_io does not change either, if limited the scope to PCI, which is the case in DPDK context(?).
The change lies only in the implementation, right? 

Just looked at the link you shared and found i40 driver is missing rte_cio_*mb in i40e_asq_send_command, but the old rte_io_*mb rescued. 
Will submit a new patch in this series to used rte_cio together with new relaxed rte_io and do more tests. 

Yes, this is a big change, also a big optimization, for aarch64, in our tests it has very positive results.
But as the case in i40e, we must pay attention to where rte_cio was missing but rescued by old rte_io(but not by new rte_io).
  
Jerin Jacob Dec. 20, 2019, 6:55 a.m. UTC | #6
On Fri, Dec 20, 2019 at 12:02 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
>
> Hi Jerin,

Hi Gavin,


> > > > > >
> > > > > > The peripheral coherence order for a memory-mapped peripheral
> > > > signifies the
> > > > > > order in which accesses arrive at the endpoint.  For a read or a write
> > > > RW1
> > > > > > and a read or a write RW2 to the same peripheral, then RW1 will
> > appear
> > > > in
> > > > > > the peripheral coherence order for the peripheral before RW2 if
> > either
> > > > of
> > > > > > the following cases apply:
> > > > > >  1. RW1 and RW2 are accesses using Non-cacheable or Device
> > attributes
> > > > and
> > > > > >     RW1 is Ordered-before RW2.
> > > > > >  2. RW1 and RW2 are accesses using Device-nGnRE or Device-
> > nGnRnE
> > > > attributes
> > > > > >     and RW1 appears in program order before RW2.
> > > > >
> > > > >
> > > > > This is true if RW1 and RW2 addresses are device memory. i.e the
> > > > > registers in the  PCI bar address.
> > > > > If RW1 is DDR address which is been used by the controller(say NIC
> > > > > ring descriptor) then there will be an issue.
> > > > > For example Intel i40e driver, the admin queue update in Host DDR
> > > > > memory and it updates the doorbell.
> > > > > In such a case, this patch will create an issue. Correct? Have you
> > > > > checked this patch with ARM64 + XL710 controllers?
> > >
> > > This patch relaxes the rte_io_*mb barriers for pure PCI device memory
> > accesses.
> >
> > Yes. This would break cases for mixed access fro i40e drivers.
> >
> > >
> > > For mixed accesses of DDR and PCI device memory, rte_smp_*mb(DMB
> > ISH) is not sufficient.
> > > But rte_cio_*mb(DMB OSH) is sufficient and can be used.
> >
> > Yes. Let me share a bit of history.
> >
> > 1) There are a lot of drivers(initially developed in x86) that have
> > mixed access and don't have any barriers as x86 does not need it.
> > 2) rte_io introduced to fix that
> > 3) Item (2) introduced the performance issues in the fast path as an
> > optimization rte_cio_* introduced.
> Exactly, this patch is to mitigate the performance issues introduced by rte_io('dsb' is too much and unnecessary here).
> Rte_cio instead is definitely required for mixed access.
> >
> > So in the current of the scheme of things, we have APIs to FIX
> > portability issue(rte_io) and performance issue(rte_cio).
> > IMO, we may not need any change in infra code now. If you think, the
> > documentation is missing then we can enhance it.
> > If we make infra change then again drivers needs to be updated and tested.
> No changes for rte_cio, the semantics, and definitions of rte_io does not change either, if limited the scope to PCI, which is the case in DPDK context(?).
> The change lies only in the implementation, right?
>
> Just looked at the link you shared and found i40 driver is missing rte_cio_*mb in i40e_asq_send_command, but the old rte_io_*mb rescued.
> Will submit a new patch in this series to used rte_cio together with new relaxed rte_io and do more tests.
>
> Yes, this is a big change, also a big optimization, for aarch64, in our tests it has very positive results.

It will be optimization only when if we are changing in the fast path.
In the slow path, it does not matter.
I think, the First step should be to use rte_cio_* wherever it is
coherent memory used in _fast path_. I think, Almost every driver
fixed that.

I am not against this patch(changing the slow path to use rte_cio*
from rte_io* and virtio changes associated with that).
If you are taking that patch, pay attention to all the drivers in the
tree which is using rte_io* for mixed access in slowpath.

> But as the case in i40e, we must pay attention to where rte_cio was missing but rescued by old rte_io(but not by new rte_io).
>
>
  
Gavin Hu Dec. 23, 2019, 9:14 a.m. UTC | #7
Hi Jerin,

I think we are on the same page with regard to the problem, and the situations, thanks for illuminating the historical background of the two barriers.
About the solution, I added inline comments. 

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Friday, December 20, 2019 2:56 PM
> To: Gavin Hu <Gavin.Hu@arm.com>
> Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> <david.marchand@redhat.com>; thomas@monjalon.net;
> rasland@mellanox.com; maxime.coquelin@redhat.com;
> tiwei.bie@intel.com; hemant.agrawal@nxp.com; jerinj@marvell.com;
> Pavan Nikhilesh <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier for
> aarch64
> 
> On Fri, Dec 20, 2019 at 12:02 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
> >
> > Hi Jerin,
> 
> Hi Gavin,
> 
> 
> > > > > > >
> > > > > > > The peripheral coherence order for a memory-mapped peripheral
> > > > > signifies the
> > > > > > > order in which accesses arrive at the endpoint.  For a read or a
> write
> > > > > RW1
> > > > > > > and a read or a write RW2 to the same peripheral, then RW1 will
> > > appear
> > > > > in
> > > > > > > the peripheral coherence order for the peripheral before RW2 if
> > > either
> > > > > of
> > > > > > > the following cases apply:
> > > > > > >  1. RW1 and RW2 are accesses using Non-cacheable or Device
> > > attributes
> > > > > and
> > > > > > >     RW1 is Ordered-before RW2.
> > > > > > >  2. RW1 and RW2 are accesses using Device-nGnRE or Device-
> > > nGnRnE
> > > > > attributes
> > > > > > >     and RW1 appears in program order before RW2.
> > > > > >
> > > > > >
> > > > > > This is true if RW1 and RW2 addresses are device memory. i.e the
> > > > > > registers in the  PCI bar address.
> > > > > > If RW1 is DDR address which is been used by the controller(say NIC
> > > > > > ring descriptor) then there will be an issue.
> > > > > > For example Intel i40e driver, the admin queue update in Host DDR
> > > > > > memory and it updates the doorbell.
> > > > > > In such a case, this patch will create an issue. Correct? Have you
> > > > > > checked this patch with ARM64 + XL710 controllers?
> > > >
> > > > This patch relaxes the rte_io_*mb barriers for pure PCI device memory
> > > accesses.
> > >
> > > Yes. This would break cases for mixed access fro i40e drivers.
> > >
> > > >
> > > > For mixed accesses of DDR and PCI device memory,
> rte_smp_*mb(DMB
> > > ISH) is not sufficient.
> > > > But rte_cio_*mb(DMB OSH) is sufficient and can be used.
> > >
> > > Yes. Let me share a bit of history.
> > >
> > > 1) There are a lot of drivers(initially developed in x86) that have
> > > mixed access and don't have any barriers as x86 does not need it.
> > > 2) rte_io introduced to fix that
> > > 3) Item (2) introduced the performance issues in the fast path as an
> > > optimization rte_cio_* introduced.
> > Exactly, this patch is to mitigate the performance issues introduced by
> rte_io('dsb' is too much and unnecessary here).
> > Rte_cio instead is definitely required for mixed access.
> > >
> > > So in the current of the scheme of things, we have APIs to FIX
> > > portability issue(rte_io) and performance issue(rte_cio).
> > > IMO, we may not need any change in infra code now. If you think, the
> > > documentation is missing then we can enhance it.
> > > If we make infra change then again drivers needs to be updated and
> tested.
> > No changes for rte_cio, the semantics, and definitions of rte_io does not
> change either, if limited the scope to PCI, which is the case in DPDK
> context(?).
> > The change lies only in the implementation, right?
> >
> > Just looked at the link you shared and found i40 driver is missing
> rte_cio_*mb in i40e_asq_send_command, but the old rte_io_*mb rescued.
> > Will submit a new patch in this series to used rte_cio together with new
> relaxed rte_io and do more tests.
> >
> > Yes, this is a big change, also a big optimization, for aarch64, in our tests it
> has very positive results.
> 
> It will be optimization only when if we are changing in the fast path.
> In the slow path, it does not matter.
> I think, the First step should be to use rte_cio_* wherever it is
> coherent memory used in _fast path_. I think, Almost every driver
> fixed that.
> 
> I am not against this patch(changing the slow path to use rte_cio*
> from rte_io* and virtio changes associated with that).
> If you are taking that patch, pay attention to all the drivers in the
> tree which is using rte_io* for mixed access in slowpath.
I see 30+ drivers has calling rte_io* directly or indirectly through rte_write/read*. 
It is hard for me to figure out all the mixed accesses in these drivers, and as you said, it makes no sense to change the _slow path_. 

How about we keep the old rte_io as is, and introduce 'fast path' version of rte_io for new code use? 
Then in future, we may merge the two? 
Another reason about this proposal is maybe there is rte_io calling in the fast path, but they are not mixed accesses and rte_cio is not suitable.

Any thoughts? 

> 
> > But as the case in i40e, we must pay attention to where rte_cio was
> missing but rescued by old rte_io(but not by new rte_io).
> >
> >
  
Jerin Jacob Dec. 23, 2019, 9:19 a.m. UTC | #8
On Mon, Dec 23, 2019 at 2:44 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
>
> Hi Jerin,

Hi Gavin,

>
> I think we are on the same page with regard to the problem, and the situations, thanks for illuminating the historical background of the two barriers.
> About the solution, I added inline comments.
> > It will be optimization only when if we are changing in the fast path.
> > In the slow path, it does not matter.
> > I think, the First step should be to use rte_cio_* wherever it is
> > coherent memory used in _fast path_. I think, Almost every driver
> > fixed that.
> >
> > I am not against this patch(changing the slow path to use rte_cio*
> > from rte_io* and virtio changes associated with that).
> > If you are taking that patch, pay attention to all the drivers in the
> > tree which is using rte_io* for mixed access in slowpath.
> I see 30+ drivers has calling rte_io* directly or indirectly through rte_write/read*.
> It is hard for me to figure out all the mixed accesses in these drivers, and as you said, it makes no sense to change the _slow path_.
>
> How about we keep the old rte_io as is, and introduce 'fast path' version of rte_io for new code use?
> Then in future, we may merge the two?
> Another reason about this proposal is maybe there is rte_io calling in the fast path, but they are not mixed accesses and rte_cio is not suitable.

Could you share more details about the case where fastpath + rte_io
needed + rte_cio is not suitable?

>
> Any thoughts?
>
> >
> > > But as the case in i40e, we must pay attention to where rte_cio was
> > missing but rescued by old rte_io(but not by new rte_io).
> > >
> > >
  
Gavin Hu Dec. 23, 2019, 10:16 a.m. UTC | #9
Hi Jerin,

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Monday, December 23, 2019 5:20 PM
> To: Gavin Hu <Gavin.Hu@arm.com>
> Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> <david.marchand@redhat.com>; thomas@monjalon.net;
> rasland@mellanox.com; maxime.coquelin@redhat.com;
> tiwei.bie@intel.com; hemant.agrawal@nxp.com; jerinj@marvell.com;
> Pavan Nikhilesh <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier for
> aarch64
> 
> On Mon, Dec 23, 2019 at 2:44 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
> >
> > Hi Jerin,
> 
> Hi Gavin,
> 
> >
> > I think we are on the same page with regard to the problem, and the
> situations, thanks for illuminating the historical background of the two
> barriers.
> > About the solution, I added inline comments.
> > > It will be optimization only when if we are changing in the fast path.
> > > In the slow path, it does not matter.
> > > I think, the First step should be to use rte_cio_* wherever it is
> > > coherent memory used in _fast path_. I think, Almost every driver
> > > fixed that.
> > >
> > > I am not against this patch(changing the slow path to use rte_cio*
> > > from rte_io* and virtio changes associated with that).
> > > If you are taking that patch, pay attention to all the drivers in the
> > > tree which is using rte_io* for mixed access in slowpath.
> > I see 30+ drivers has calling rte_io* directly or indirectly through
> rte_write/read*.
> > It is hard for me to figure out all the mixed accesses in these drivers, and
> as you said, it makes no sense to change the _slow path_.
> >
> > How about we keep the old rte_io as is, and introduce 'fast path' version
> of rte_io for new code use?
> > Then in future, we may merge the two?
> > Another reason about this proposal is maybe there is rte_io calling in the
> fast path, but they are not mixed accesses and rte_cio is not suitable.
> 
> Could you share more details about the case where fastpath + rte_io
> needed + rte_cio is not suitable?

Here is an example for i40e, in the fast path, but only a pure io memory access. 
https://code.dpdk.org/dpdk/v19.11/source/drivers/net/i40e/i40e_rxtx.c#L1208

I wanted two variants of rte_io, because also x86 requires two as indicated here, one for no-WC and another for WC.
http://inbox.dpdk.org/dev/20191204151916.12607-1-xiaoyun.li@intel.com/T/#ea8bb1b4a378ab09baedbf95b4542bcb92f4a396f 
> 
> >
> > Any thoughts?
> >
> > >
> > > > But as the case in i40e, we must pay attention to where rte_cio was
> > > missing but rescued by old rte_io(but not by new rte_io).
> > > >
> > > >
  
Jerin Jacob Jan. 2, 2020, 9:51 a.m. UTC | #10
On Mon, Dec 23, 2019 at 3:46 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
>
> Hi Jerin,
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Monday, December 23, 2019 5:20 PM
> > To: Gavin Hu <Gavin.Hu@arm.com>
> > Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> > <david.marchand@redhat.com>; thomas@monjalon.net;
> > rasland@mellanox.com; maxime.coquelin@redhat.com;
> > tiwei.bie@intel.com; hemant.agrawal@nxp.com; jerinj@marvell.com;
> > Pavan Nikhilesh <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> > <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier for
> > aarch64
> >
> > On Mon, Dec 23, 2019 at 2:44 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
> > >
> > > Hi Jerin,
> >
> > Hi Gavin,
> >
> > >
> > > I think we are on the same page with regard to the problem, and the
> > situations, thanks for illuminating the historical background of the two
> > barriers.
> > > About the solution, I added inline comments.
> > > > It will be optimization only when if we are changing in the fast path.
> > > > In the slow path, it does not matter.
> > > > I think, the First step should be to use rte_cio_* wherever it is
> > > > coherent memory used in _fast path_. I think, Almost every driver
> > > > fixed that.
> > > >
> > > > I am not against this patch(changing the slow path to use rte_cio*
> > > > from rte_io* and virtio changes associated with that).
> > > > If you are taking that patch, pay attention to all the drivers in the
> > > > tree which is using rte_io* for mixed access in slowpath.
> > > I see 30+ drivers has calling rte_io* directly or indirectly through
> > rte_write/read*.
> > > It is hard for me to figure out all the mixed accesses in these drivers, and
> > as you said, it makes no sense to change the _slow path_.
> > >
> > > How about we keep the old rte_io as is, and introduce 'fast path' version
> > of rte_io for new code use?
> > > Then in future, we may merge the two?
> > > Another reason about this proposal is maybe there is rte_io calling in the
> > fast path, but they are not mixed accesses and rte_cio is not suitable.
> >
> > Could you share more details about the case where fastpath + rte_io
> > needed + rte_cio is not suitable?
>
> Here is an example for i40e, in the fast path, but only a pure io memory access.
> https://code.dpdk.org/dpdk/v19.11/source/drivers/net/i40e/i40e_rxtx.c#L1208

Yes. That's a performance issue.

It could be changed to following for the fix that works on x86, arm64
with existing infra.

From:
I40E_PCI_REG_WRITE()

to:

rte_cio_wmb()
I40E_PCI_REG_WRITE_RELAXED()

>
> I wanted two variants of rte_io, because also x86 requires two as indicated here, one for no-WC and another for WC.
> http://inbox.dpdk.org/dev/20191204151916.12607-1-xiaoyun.li@intel.com/T/#ea8bb1b4a378ab09baedbf95b4542bcb92f4a396f
> >
> > >
> > > Any thoughts?
> > >
> > > >
> > > > > But as the case in i40e, we must pay attention to where rte_cio was
> > > > missing but rescued by old rte_io(but not by new rte_io).
> > > > >
> > > > >
  
Gavin Hu Jan. 3, 2020, 6:30 a.m. UTC | #11
Hi Jerin,

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Thursday, January 2, 2020 5:52 PM
> To: Gavin Hu <Gavin.Hu@arm.com>
> Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> <david.marchand@redhat.com>; thomas@monjalon.net;
> rasland@mellanox.com; maxime.coquelin@redhat.com; tiwei.bie@intel.com;
> hemant.agrawal@nxp.com; jerinj@marvell.com; Pavan Nikhilesh
> <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier for
> aarch64
> 
> On Mon, Dec 23, 2019 at 3:46 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
> >
> > Hi Jerin,
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Monday, December 23, 2019 5:20 PM
> > > To: Gavin Hu <Gavin.Hu@arm.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> > > <david.marchand@redhat.com>; thomas@monjalon.net;
> > > rasland@mellanox.com; maxime.coquelin@redhat.com;
> > > tiwei.bie@intel.com; hemant.agrawal@nxp.com; jerinj@marvell.com;
> > > Pavan Nikhilesh <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> > > <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> > > <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier for
> > > aarch64
> > >
> > > On Mon, Dec 23, 2019 at 2:44 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
> > > >
> > > > Hi Jerin,
> > >
> > > Hi Gavin,
> > >
> > > >
> > > > I think we are on the same page with regard to the problem, and the
> > > situations, thanks for illuminating the historical background of the two
> > > barriers.
> > > > About the solution, I added inline comments.
> > > > > It will be optimization only when if we are changing in the fast path.
> > > > > In the slow path, it does not matter.
> > > > > I think, the First step should be to use rte_cio_* wherever it is
> > > > > coherent memory used in _fast path_. I think, Almost every driver
> > > > > fixed that.
> > > > >
> > > > > I am not against this patch(changing the slow path to use rte_cio*
> > > > > from rte_io* and virtio changes associated with that).
> > > > > If you are taking that patch, pay attention to all the drivers in the
> > > > > tree which is using rte_io* for mixed access in slowpath.
> > > > I see 30+ drivers has calling rte_io* directly or indirectly through
> > > rte_write/read*.
> > > > It is hard for me to figure out all the mixed accesses in these drivers, and
> > > as you said, it makes no sense to change the _slow path_.
> > > >
> > > > How about we keep the old rte_io as is, and introduce 'fast path' version
> > > of rte_io for new code use?
> > > > Then in future, we may merge the two?
> > > > Another reason about this proposal is maybe there is rte_io calling in
> the
> > > fast path, but they are not mixed accesses and rte_cio is not suitable.
> > >
> > > Could you share more details about the case where fastpath + rte_io
> > > needed + rte_cio is not suitable?
> >
> > Here is an example for i40e, in the fast path, but only a pure io memory
> access.
> >
> https://code.dpdk.org/dpdk/v19.11/source/drivers/net/i40e/i40e_rxtx.c#L12
> 08
> 
> Yes. That's a performance issue.
> 
> It could be changed to following for the fix that works on x86, arm64
> with existing infra.
> 
> From:
> I40E_PCI_REG_WRITE()
> 
> to:
> 
> rte_cio_wmb()
> I40E_PCI_REG_WRITE_RELAXED()
Yes, this is correct, I will submit a new patch for this.
This is an example out of all the cases that I must fix before relaxing the rte_io barriers.
My plan is as follows, any comments are welcome!
1. replace rte_*mb and rte_io_*mb with rte_cio_*mb where applicable in the fastpath, this is an optimization, as the barriers are relaxed.
2. replace all the rte_io_*mb with rte_cio_*mb where applicable in the slowpath and control path
3. until *all* the occurrences in the step 1 and 2 are done, then this path can be re-activated.

Please advise if the above approach works from your viewpoint.
Maybe I will stop at step 1, step 2 and 3 are not necessary as they are not in the fastpath? 

> 
> >
> > I wanted two variants of rte_io, because also x86 requires two as indicated
> here, one for no-WC and another for WC.
> > http://inbox.dpdk.org/dev/20191204151916.12607-1-
> xiaoyun.li@intel.com/T/#ea8bb1b4a378ab09baedbf95b4542bcb92f4a396f
> > >
> > > >
> > > > Any thoughts?
> > > >
> > > > >
> > > > > > But as the case in i40e, we must pay attention to where rte_cio was
> > > > > missing but rescued by old rte_io(but not by new rte_io).
> > > > > >
> > > > > >
  
Jerin Jacob Jan. 3, 2020, 7:34 a.m. UTC | #12
On Fri, Jan 3, 2020 at 12:00 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
>
> Hi Jerin,

Hi Gavin,

>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Thursday, January 2, 2020 5:52 PM
> > To: Gavin Hu <Gavin.Hu@arm.com>
> > Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> > <david.marchand@redhat.com>; thomas@monjalon.net;
> > rasland@mellanox.com; maxime.coquelin@redhat.com; tiwei.bie@intel.com;
> > hemant.agrawal@nxp.com; jerinj@marvell.com; Pavan Nikhilesh
> > <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> > <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier for
> > aarch64
> >
> > On Mon, Dec 23, 2019 at 3:46 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
> > >
> > > Hi Jerin,
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Monday, December 23, 2019 5:20 PM
> > > > To: Gavin Hu <Gavin.Hu@arm.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> > > > <david.marchand@redhat.com>; thomas@monjalon.net;
> > > > rasland@mellanox.com; maxime.coquelin@redhat.com;
> > > > tiwei.bie@intel.com; hemant.agrawal@nxp.com; jerinj@marvell.com;
> > > > Pavan Nikhilesh <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> > > > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> > > > <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> > > > <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> > > > Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier for
> > > > aarch64
> > > >
> > > > On Mon, Dec 23, 2019 at 2:44 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
> > > > >
> > > > > Hi Jerin,
> > > >
> > > > Hi Gavin,
> > > >
> > > > >
> > > > > I think we are on the same page with regard to the problem, and the
> > > > situations, thanks for illuminating the historical background of the two
> > > > barriers.
> > > > > About the solution, I added inline comments.
> > > > > > It will be optimization only when if we are changing in the fast path.
> > > > > > In the slow path, it does not matter.
> > > > > > I think, the First step should be to use rte_cio_* wherever it is
> > > > > > coherent memory used in _fast path_. I think, Almost every driver
> > > > > > fixed that.
> > > > > >
> > > > > > I am not against this patch(changing the slow path to use rte_cio*
> > > > > > from rte_io* and virtio changes associated with that).
> > > > > > If you are taking that patch, pay attention to all the drivers in the
> > > > > > tree which is using rte_io* for mixed access in slowpath.
> > > > > I see 30+ drivers has calling rte_io* directly or indirectly through
> > > > rte_write/read*.
> > > > > It is hard for me to figure out all the mixed accesses in these drivers, and
> > > > as you said, it makes no sense to change the _slow path_.
> > > > >
> > > > > How about we keep the old rte_io as is, and introduce 'fast path' version
> > > > of rte_io for new code use?
> > > > > Then in future, we may merge the two?
> > > > > Another reason about this proposal is maybe there is rte_io calling in
> > the
> > > > fast path, but they are not mixed accesses and rte_cio is not suitable.
> > > >
> > > > Could you share more details about the case where fastpath + rte_io
> > > > needed + rte_cio is not suitable?
> > >
> > > Here is an example for i40e, in the fast path, but only a pure io memory
> > access.
> > >
> > https://code.dpdk.org/dpdk/v19.11/source/drivers/net/i40e/i40e_rxtx.c#L12
> > 08
> >
> > Yes. That's a performance issue.
> >
> > It could be changed to following for the fix that works on x86, arm64
> > with existing infra.
> >
> > From:
> > I40E_PCI_REG_WRITE()
> >
> > to:
> >
> > rte_cio_wmb()
> > I40E_PCI_REG_WRITE_RELAXED()
> Yes, this is correct, I will submit a new patch for this.
> This is an example out of all the cases that I must fix before relaxing the rte_io barriers.
> My plan is as follows, any comments are welcome!
> 1. replace rte_*mb and rte_io_*mb with rte_cio_*mb where applicable in the fastpath, this is an optimization, as the barriers are relaxed.
> 2. replace all the rte_io_*mb with rte_cio_*mb where applicable in the slowpath and control path
> 3. until *all* the occurrences in the step 1 and 2 are done, then this path can be re-activated.
>
> Please advise if the above approach works from your viewpoint.

I would prefer to have ONLY the step (1) and add a note for the same
in  https://doc.dpdk.org/guides/prog_guide/perf_opt_guidelines.html
as documentation reference.


> Maybe I will stop at step 1, step 2 and 3 are not necessary as they are not in the fastpath?

Yup.

>
> >
> > >
> > > I wanted two variants of rte_io, because also x86 requires two as indicated
> > here, one for no-WC and another for WC.
> > > http://inbox.dpdk.org/dev/20191204151916.12607-1-
> > xiaoyun.li@intel.com/T/#ea8bb1b4a378ab09baedbf95b4542bcb92f4a396f
> > > >
> > > > >
> > > > > Any thoughts?
> > > > >
> > > > > >
> > > > > > > But as the case in i40e, we must pay attention to where rte_cio was
> > > > > > missing but rescued by old rte_io(but not by new rte_io).
> > > > > > >
> > > > > > >
  
Gavin Hu Jan. 3, 2020, 9:12 a.m. UTC | #13
Hi Jerin,

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Friday, January 3, 2020 3:35 PM
> To: Gavin Hu <Gavin.Hu@arm.com>
> Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> <david.marchand@redhat.com>; thomas@monjalon.net;
> rasland@mellanox.com; maxime.coquelin@redhat.com; tiwei.bie@intel.com;
> hemant.agrawal@nxp.com; jerinj@marvell.com; Pavan Nikhilesh
> <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier for
> aarch64
> 
> On Fri, Jan 3, 2020 at 12:00 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
> >
> > Hi Jerin,
> 
> Hi Gavin,
> 
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Thursday, January 2, 2020 5:52 PM
> > > To: Gavin Hu <Gavin.Hu@arm.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> > > <david.marchand@redhat.com>; thomas@monjalon.net;
> > > rasland@mellanox.com; maxime.coquelin@redhat.com;
> tiwei.bie@intel.com;
> > > hemant.agrawal@nxp.com; jerinj@marvell.com; Pavan Nikhilesh
> > > <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> > > <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> > > <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier for
> > > aarch64
> > >
> > > On Mon, Dec 23, 2019 at 3:46 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
> > > >
> > > > Hi Jerin,
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Monday, December 23, 2019 5:20 PM
> > > > > To: Gavin Hu <Gavin.Hu@arm.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> > > > > <david.marchand@redhat.com>; thomas@monjalon.net;
> > > > > rasland@mellanox.com; maxime.coquelin@redhat.com;
> > > > > tiwei.bie@intel.com; hemant.agrawal@nxp.com; jerinj@marvell.com;
> > > > > Pavan Nikhilesh <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> > > > > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> > > > > <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce
> Kong
> > > > > <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> > > > > Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/arm64: relax the io barrier
> for
> > > > > aarch64
> > > > >
> > > > > On Mon, Dec 23, 2019 at 2:44 PM Gavin Hu <Gavin.Hu@arm.com>
> wrote:
> > > > > >
> > > > > > Hi Jerin,
> > > > >
> > > > > Hi Gavin,
> > > > >
> > > > > >
> > > > > > I think we are on the same page with regard to the problem, and the
> > > > > situations, thanks for illuminating the historical background of the two
> > > > > barriers.
> > > > > > About the solution, I added inline comments.
> > > > > > > It will be optimization only when if we are changing in the fast path.
> > > > > > > In the slow path, it does not matter.
> > > > > > > I think, the First step should be to use rte_cio_* wherever it is
> > > > > > > coherent memory used in _fast path_. I think, Almost every driver
> > > > > > > fixed that.
> > > > > > >
> > > > > > > I am not against this patch(changing the slow path to use rte_cio*
> > > > > > > from rte_io* and virtio changes associated with that).
> > > > > > > If you are taking that patch, pay attention to all the drivers in the
> > > > > > > tree which is using rte_io* for mixed access in slowpath.
> > > > > > I see 30+ drivers has calling rte_io* directly or indirectly through
> > > > > rte_write/read*.
> > > > > > It is hard for me to figure out all the mixed accesses in these drivers,
> and
> > > > > as you said, it makes no sense to change the _slow path_.
> > > > > >
> > > > > > How about we keep the old rte_io as is, and introduce 'fast path'
> version
> > > > > of rte_io for new code use?
> > > > > > Then in future, we may merge the two?
> > > > > > Another reason about this proposal is maybe there is rte_io calling in
> > > the
> > > > > fast path, but they are not mixed accesses and rte_cio is not suitable.
> > > > >
> > > > > Could you share more details about the case where fastpath + rte_io
> > > > > needed + rte_cio is not suitable?
> > > >
> > > > Here is an example for i40e, in the fast path, but only a pure io memory
> > > access.
> > > >
> > >
> https://code.dpdk.org/dpdk/v19.11/source/drivers/net/i40e/i40e_rxtx.c#L12
> > > 08
> > >
> > > Yes. That's a performance issue.
> > >
> > > It could be changed to following for the fix that works on x86, arm64
> > > with existing infra.
> > >
> > > From:
> > > I40E_PCI_REG_WRITE()
> > >
> > > to:
> > >
> > > rte_cio_wmb()
> > > I40E_PCI_REG_WRITE_RELAXED()
> > Yes, this is correct, I will submit a new patch for this.
> > This is an example out of all the cases that I must fix before relaxing the
> rte_io barriers.
> > My plan is as follows, any comments are welcome!
> > 1. replace rte_*mb and rte_io_*mb with rte_cio_*mb where applicable in
> the fastpath, this is an optimization, as the barriers are relaxed.
> > 2. replace all the rte_io_*mb with rte_cio_*mb where applicable in the
> slowpath and control path
> > 3. until *all* the occurrences in the step 1 and 2 are done, then this path can
> be re-activated.
> >
> > Please advise if the above approach works from your viewpoint.
> 
> I would prefer to have ONLY the step (1) and add a note for the same
> in  https://doc.dpdk.org/guides/prog_guide/perf_opt_guidelines.html
> as documentation reference.
Ok, good idea, I will submit a documentation patch for this. 
> 
> 
> > Maybe I will stop at step 1, step 2 and 3 are not necessary as they are not in
> the fastpath?
> 
> Yup.
> 
> >
> > >
> > > >
> > > > I wanted two variants of rte_io, because also x86 requires two as
> indicated
> > > here, one for no-WC and another for WC.
> > > > http://inbox.dpdk.org/dev/20191204151916.12607-1-
> > > xiaoyun.li@intel.com/T/#ea8bb1b4a378ab09baedbf95b4542bcb92f4a396f
> > > > >
> > > > > >
> > > > > > Any thoughts?
> > > > > >
> > > > > > >
> > > > > > > > But as the case in i40e, we must pay attention to where rte_cio
> was
> > > > > > > missing but rescued by old rte_io(but not by new rte_io).
> > > > > > > >
> > > > > > > >
  

Patch

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 859ae12..fd63956 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -34,11 +34,11 @@  extern "C" {
 
 #define rte_smp_rmb() dmb(ishld)
 
-#define rte_io_mb() rte_mb()
+#define rte_io_mb() rte_compiler_barrier()
 
-#define rte_io_wmb() rte_wmb()
+#define rte_io_wmb() rte_compiler_barrier()
 
-#define rte_io_rmb() rte_rmb()
+#define rte_io_rmb() rte_compiler_barrier()
 
 #define rte_cio_wmb() dmb(oshst)