[v3,1/3] eal/arm64: add 128-bit atomic compare exchange

Message ID 1561709503-11665-1-git-send-email-phil.yang@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v3,1/3] eal/arm64: add 128-bit atomic compare exchange |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation fail Compilation issues
ci/mellanox-Performance-Testing success Performance Testing PASS
ci/intel-Performance-Testing success Performance Testing PASS

Commit Message

Phil Yang June 28, 2019, 8:11 a.m. UTC
  Add 128-bit atomic compare exchange on aarch64.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Tested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
v3:
1. Avoid duplication code with macro. (Jerin Jocob)
2. Make invalid memory order to strongest barrier. (Jerin Jocob)
3. Update doc/guides/prog_guide/env_abstraction_layer.rst. (Eads Gage)
4. Fix 32-bit x86 builds issue. (Eads Gage)
5. Correct documentation issues in UT. (Eads Gage)

 .../common/include/arch/arm/rte_atomic_64.h        | 165 +++++++++++++++++++++
 .../common/include/arch/x86/rte_atomic_64.h        |  12 --
 lib/librte_eal/common/include/generic/rte_atomic.h |  17 ++-
 3 files changed, 181 insertions(+), 13 deletions(-)
  

Comments

Jerin Jacob Kollanukkaran July 3, 2019, 12:25 p.m. UTC | #1
> -----Original Message-----
> From: Phil Yang <phil.yang@arm.com>
> Sent: Friday, June 28, 2019 1:42 PM
> To: dev@dpdk.org
> Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
> hemant.agrawal@nxp.com; Honnappa.Nagarahalli@arm.com;
> gavin.hu@arm.com; nd@arm.com; gage.eads@intel.com
> Subject: [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic compare exchange
> 
> Add 128-bit atomic compare exchange on aarch64.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Tested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
> v3:
> 1. Avoid duplication code with macro. (Jerin Jocob) 2. Make invalid memory
> order to strongest barrier. (Jerin Jocob) 3. Update
> doc/guides/prog_guide/env_abstraction_layer.rst. (Eads Gage) 4. Fix 32-bit x86
> builds issue. (Eads Gage) 5. Correct documentation issues in UT. (Eads Gage)
> 
>  .../common/include/arch/arm/rte_atomic_64.h        | 165
> +++++++++++++++++++++
>  .../common/include/arch/x86/rte_atomic_64.h        |  12 --
>  lib/librte_eal/common/include/generic/rte_atomic.h |  17 ++-
>  3 files changed, 181 insertions(+), 13 deletions(-)
> 
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> index 97060e4..2080c4d 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2015 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
> 
>  #ifndef _RTE_ATOMIC_ARM64_H_
> @@ -14,6 +15,9 @@ extern "C" {
>  #endif
> 
>  #include "generic/rte_atomic.h"
> +#include <rte_branch_prediction.h>
> +#include <rte_compat.h>
> +#include <rte_debug.h>
> 
>  #define dsb(opt) asm volatile("dsb " #opt : : : "memory")  #define dmb(opt) asm
> volatile("dmb " #opt : : : "memory") @@ -40,6 +44,167 @@ extern "C" {
> 
>  #define rte_cio_rmb() dmb(oshld)
> 
> +/*------------------------ 128 bit atomic operations
> +-------------------------*/
> +
> +#define RTE_HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> +__ATOMIC_RELEASE) #define RTE_HAS_RLS(mo) ((mo) == __ATOMIC_RELEASE
> || \
> +			 (mo) == __ATOMIC_ACQ_REL || \
> +			 (mo) == __ATOMIC_SEQ_CST)
> +
> +#define RTE_MO_LOAD(mo)  (RTE_HAS_ACQ((mo)) \
> +		? __ATOMIC_ACQUIRE : __ATOMIC_RELAXED) #define
> RTE_MO_STORE(mo)
> +(RTE_HAS_RLS((mo)) \
> +		? __ATOMIC_RELEASE : __ATOMIC_RELAXED)
> +
> +#ifdef __ARM_FEATURE_ATOMICS
> +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)                          \
> +static inline rte_int128_t                                                  \
> +cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
> +		rte_int128_t updated)                                               \
> +{                                                                           \
> +	/* caspX instructions register pair must start from even-numbered
> +	 * register at operand 1.
> +	 * So, specify registers for local variables here.
> +	 */                                                                     \
> +	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];                \

I understand CASP limitation on register has to be even and odd.
Is there anyway to remove explicit x0 register allocation and
choose compiler to decide the register. Some reason with optimize(03)
gcc makes correctly but not clang.

Hardcoding to specific register makes compiler to not optimize the stuff,
especially if it is inline function.
  
Jerin Jacob Kollanukkaran July 3, 2019, 1:07 p.m. UTC | #2
> -----Original Message-----
> From: Jerin Jacob Kollanukkaran
> Sent: Wednesday, July 3, 2019 5:56 PM
> To: Phil Yang <phil.yang@arm.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; hemant.agrawal@nxp.com;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; nd@arm.com;
> gage.eads@intel.com
> Subject: RE: [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic compare
> exchange
> 
> > -----Original Message-----
> > From: Phil Yang <phil.yang@arm.com>
> > Sent: Friday, June 28, 2019 1:42 PM
> > To: dev@dpdk.org
> > Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran
> > <jerinj@marvell.com>; hemant.agrawal@nxp.com;
> > Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; nd@arm.com;
> > gage.eads@intel.com
> > Subject: [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic compare
> > exchange
> >
> > Add 128-bit atomic compare exchange on aarch64.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Tested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> > v3:
> > 1. Avoid duplication code with macro. (Jerin Jocob) 2. Make invalid
> > memory order to strongest barrier. (Jerin Jocob) 3. Update
> > doc/guides/prog_guide/env_abstraction_layer.rst. (Eads Gage) 4. Fix
> > 32-bit x86 builds issue. (Eads Gage) 5. Correct documentation issues
> > in UT. (Eads Gage)
> >
> >  .../common/include/arch/arm/rte_atomic_64.h        | 165
> > +++++++++++++++++++++
> >  .../common/include/arch/x86/rte_atomic_64.h        |  12 --
> >  lib/librte_eal/common/include/generic/rte_atomic.h |  17 ++-
> >  3 files changed, 181 insertions(+), 13 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > index 97060e4..2080c4d 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2015 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_ATOMIC_ARM64_H_
> > @@ -14,6 +15,9 @@ extern "C" {
> >  #endif
> >
> >  #include "generic/rte_atomic.h"
> > +#include <rte_branch_prediction.h>
> > +#include <rte_compat.h>
> > +#include <rte_debug.h>
> >
> >  #define dsb(opt) asm volatile("dsb " #opt : : : "memory")  #define
> > dmb(opt) asm volatile("dmb " #opt : : : "memory") @@ -40,6 +44,167 @@
> > extern "C" {
> >
> >  #define rte_cio_rmb() dmb(oshld)
> >
> > +/*------------------------ 128 bit atomic operations
> > +-------------------------*/
> > +
> > +#define RTE_HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> > +__ATOMIC_RELEASE) #define RTE_HAS_RLS(mo) ((mo) ==
> __ATOMIC_RELEASE
> > || \
> > +			 (mo) == __ATOMIC_ACQ_REL || \
> > +			 (mo) == __ATOMIC_SEQ_CST)
> > +
> > +#define RTE_MO_LOAD(mo)  (RTE_HAS_ACQ((mo)) \
> > +		? __ATOMIC_ACQUIRE : __ATOMIC_RELAXED) #define
> > RTE_MO_STORE(mo)
> > +(RTE_HAS_RLS((mo)) \
> > +		? __ATOMIC_RELEASE : __ATOMIC_RELAXED)
> > +
> > +#ifdef __ARM_FEATURE_ATOMICS
> > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)                          \
> > +static inline rte_int128_t                                                  \
> > +cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
> > +		rte_int128_t updated)                                               \
> > +{                                                                           \
> > +	/* caspX instructions register pair must start from even-numbered
> > +	 * register at operand 1.
> > +	 * So, specify registers for local variables here.
> > +	 */                                                                     \
> > +	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];                \
> 
> I understand CASP limitation on register has to be even and odd.
> Is there anyway to remove explicit x0 register allocation and choose compiler to
> decide the register. Some reason with optimize(03) gcc makes correctly but not
> clang.
> 
> Hardcoding to specific register makes compiler to not optimize the stuff,
> especially if it is inline function.

It look like the limitation fixed recently in gcc.
https://patches.linaro.org/patch/147991/

Not sure about old gcc and clang. ARM compiler experts may know the exact status
  
Honnappa Nagarahalli July 5, 2019, 4:20 a.m. UTC | #3
<snip>

> > > Subject: [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic compare
> > > exchange
> > >
> > > Add 128-bit atomic compare exchange on aarch64.
> > >
> > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > Tested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > ---
> > > v3:
> > > 1. Avoid duplication code with macro. (Jerin Jocob) 2. Make invalid
> > > memory order to strongest barrier. (Jerin Jocob) 3. Update
> > > doc/guides/prog_guide/env_abstraction_layer.rst. (Eads Gage) 4. Fix
> > > 32-bit x86 builds issue. (Eads Gage) 5. Correct documentation issues
> > > in UT. (Eads Gage)
> > >
> > >  .../common/include/arch/arm/rte_atomic_64.h        | 165
> > > +++++++++++++++++++++
> > >  .../common/include/arch/x86/rte_atomic_64.h        |  12 --
> > >  lib/librte_eal/common/include/generic/rte_atomic.h |  17 ++-
> > >  3 files changed, 181 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > > b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > > index 97060e4..2080c4d 100644
> > > --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > > +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > > @@ -1,5 +1,6 @@
> > >  /* SPDX-License-Identifier: BSD-3-Clause
> > >   * Copyright(c) 2015 Cavium, Inc
> > > + * Copyright(c) 2019 Arm Limited
> > >   */
> > >
> > >  #ifndef _RTE_ATOMIC_ARM64_H_
> > > @@ -14,6 +15,9 @@ extern "C" {
> > >  #endif
> > >
> > >  #include "generic/rte_atomic.h"
> > > +#include <rte_branch_prediction.h>
> > > +#include <rte_compat.h>
> > > +#include <rte_debug.h>
> > >
> > >  #define dsb(opt) asm volatile("dsb " #opt : : : "memory")  #define
> > > dmb(opt) asm volatile("dmb " #opt : : : "memory") @@ -40,6 +44,167
> > > @@ extern "C" {
> > >
> > >  #define rte_cio_rmb() dmb(oshld)
> > >
> > > +/*------------------------ 128 bit atomic operations
> > > +-------------------------*/
> > > +
> > > +#define RTE_HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> > > +__ATOMIC_RELEASE) #define RTE_HAS_RLS(mo) ((mo) ==
> > __ATOMIC_RELEASE
> > > || \
> > > +			 (mo) == __ATOMIC_ACQ_REL || \
> > > +			 (mo) == __ATOMIC_SEQ_CST)
> > > +
> > > +#define RTE_MO_LOAD(mo)  (RTE_HAS_ACQ((mo)) \
> > > +		? __ATOMIC_ACQUIRE : __ATOMIC_RELAXED) #define
> > > RTE_MO_STORE(mo)
> > > +(RTE_HAS_RLS((mo)) \
> > > +		? __ATOMIC_RELEASE : __ATOMIC_RELAXED)
> > > +
> > > +#ifdef __ARM_FEATURE_ATOMICS
> > > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)
> \
> > > +static inline rte_int128_t                                                  \
> > > +cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
> > > +		rte_int128_t updated)                                               \
> > > +{                                                                           \
> > > +	/* caspX instructions register pair must start from even-numbered
> > > +	 * register at operand 1.
> > > +	 * So, specify registers for local variables here.
> > > +	 */                                                                     \
> > > +	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];                \
> >
> > I understand CASP limitation on register has to be even and odd.
> > Is there anyway to remove explicit x0 register allocation and choose
> > compiler to decide the register. Some reason with optimize(03) gcc
> > makes correctly but not clang.
> >
> > Hardcoding to specific register makes compiler to not optimize the
> > stuff, especially if it is inline function.
> 
> It look like the limitation fixed recently in gcc.
> https://patches.linaro.org/patch/147991/
> 
> Not sure about old gcc and clang. ARM compiler experts may know the exact
> status
> 
We could use syntax as follows, an example is in [1]
static inline rte_int128_t
__rte_casp(rte_int128_t *dst, rte_int128_t old, rte_int128_t updated, int mo)
{
		__asm__ volatile("caspl %0, %H0, %1, %H1, [%2]"
				 : "+r" (old)
				 : "r" (updated), "r" (dst)
				 : "memory");
	return old;       
}

[1] https://godbolt.org/z/EUJnuG
  
Pavan Nikhilesh Bhagavatula July 5, 2019, 4:37 a.m. UTC | #4
>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Honnappa
>Nagarahalli
>Sent: Friday, July 5, 2019 9:51 AM
>To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Phil Yang (Arm
>Technology China) <Phil.Yang@arm.com>; dev@dpdk.org
>Cc: thomas@monjalon.net; hemant.agrawal@nxp.com; Gavin Hu (Arm
>Technology China) <Gavin.Hu@arm.com>; Honnappa Nagarahalli
><Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>;
>gage.eads@intel.com; nd <nd@arm.com>
>Subject: Re: [dpdk-dev] [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit
>atomic compare exchange
>
><snip>
>
>> > > Subject: [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic
>compare
>> > > exchange
>> > >
>> > > Add 128-bit atomic compare exchange on aarch64.
>> > >
>> > > Signed-off-by: Phil Yang <phil.yang@arm.com>
>> > > Tested-by: Honnappa Nagarahalli
><honnappa.nagarahalli@arm.com>
>> > > Reviewed-by: Honnappa Nagarahalli
><honnappa.nagarahalli@arm.com>
>> > > ---
>> > > v3:
>> > > 1. Avoid duplication code with macro. (Jerin Jocob) 2. Make invalid
>> > > memory order to strongest barrier. (Jerin Jocob) 3. Update
>> > > doc/guides/prog_guide/env_abstraction_layer.rst. (Eads Gage) 4.
>Fix
>> > > 32-bit x86 builds issue. (Eads Gage) 5. Correct documentation
>issues
>> > > in UT. (Eads Gage)
>> > >
>> > >  .../common/include/arch/arm/rte_atomic_64.h        | 165
>> > > +++++++++++++++++++++
>> > >  .../common/include/arch/x86/rte_atomic_64.h        |  12 --
>> > >  lib/librte_eal/common/include/generic/rte_atomic.h |  17 ++-
>> > >  3 files changed, 181 insertions(+), 13 deletions(-)
>> > >
>> > > diff --git
>a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
>> > > b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
>> > > index 97060e4..2080c4d 100644
>> > > --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
>> > > +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
>> > > @@ -1,5 +1,6 @@
>> > >  /* SPDX-License-Identifier: BSD-3-Clause
>> > >   * Copyright(c) 2015 Cavium, Inc
>> > > + * Copyright(c) 2019 Arm Limited
>> > >   */
>> > >
>> > >  #ifndef _RTE_ATOMIC_ARM64_H_
>> > > @@ -14,6 +15,9 @@ extern "C" {
>> > >  #endif
>> > >
>> > >  #include "generic/rte_atomic.h"
>> > > +#include <rte_branch_prediction.h>
>> > > +#include <rte_compat.h>
>> > > +#include <rte_debug.h>
>> > >
>> > >  #define dsb(opt) asm volatile("dsb " #opt : : : "memory")  #define
>> > > dmb(opt) asm volatile("dmb " #opt : : : "memory") @@ -40,6
>+44,167
>> > > @@ extern "C" {
>> > >
>> > >  #define rte_cio_rmb() dmb(oshld)
>> > >
>> > > +/*------------------------ 128 bit atomic operations
>> > > +-------------------------*/
>> > > +
>> > > +#define RTE_HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED &&
>(mo) !=
>> > > +__ATOMIC_RELEASE) #define RTE_HAS_RLS(mo) ((mo) ==
>> > __ATOMIC_RELEASE
>> > > || \
>> > > +			 (mo) == __ATOMIC_ACQ_REL || \
>> > > +			 (mo) == __ATOMIC_SEQ_CST)
>> > > +
>> > > +#define RTE_MO_LOAD(mo)  (RTE_HAS_ACQ((mo)) \
>> > > +		? __ATOMIC_ACQUIRE : __ATOMIC_RELAXED) #define
>> > > RTE_MO_STORE(mo)
>> > > +(RTE_HAS_RLS((mo)) \
>> > > +		? __ATOMIC_RELEASE : __ATOMIC_RELAXED)
>> > > +
>> > > +#ifdef __ARM_FEATURE_ATOMICS
>> > > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)
>> \
>> > > +static inline rte_int128_t                                                  \
>> > > +cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
>> > > +		rte_int128_t updated)                                               \
>> > > +{                                                                           \
>> > > +	/* caspX instructions register pair must start from even-
>numbered
>> > > +	 * register at operand 1.
>> > > +	 * So, specify registers for local variables here.
>> > > +	 */                                                                     \
>> > > +	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];                \
>> >
>> > I understand CASP limitation on register has to be even and odd.
>> > Is there anyway to remove explicit x0 register allocation and choose
>> > compiler to decide the register. Some reason with optimize(03) gcc
>> > makes correctly but not clang.
>> >
>> > Hardcoding to specific register makes compiler to not optimize the
>> > stuff, especially if it is inline function.
>>
>> It look like the limitation fixed recently in gcc.
>> https://patches.linaro.org/patch/147991/
>>
>> Not sure about old gcc and clang. ARM compiler experts may know
>the exact
>> status
>>
>We could use syntax as follows, an example is in [1]
>static inline rte_int128_t
>__rte_casp(rte_int128_t *dst, rte_int128_t old, rte_int128_t updated,
>int mo)
>{
>		__asm__ volatile("caspl %0, %H0, %1, %H1, [%2]"
>				 : "+r" (old)
>				 : "r" (updated), "r" (dst)
>				 : "memory");
>	return old;
>}

We have used this format for mempool/octeontx2 but clang wasn't too happy.

dpdk/drivers/mempool/octeontx2/otx2_mempool_ops.c:151:15: error: value size does not match register size specified by the constraint and modifier [-Werror,-Wasm-operand-widths]
                [t0] "=&r" (t0), [t1] "=&r" (t1), [t2] "=&r" (t2),
                            ^
dpdk/drivers/mempool/octeontx2/otx2_mempool_ops.c:82:9: note: use constraint modifier "w"
                "casp %[t0], %H[t0], %[wdata], %H[wdata], [%[loc]]\n"

Had to change it to hand coded asm

http://patches.dpdk.org/patch/56110/

>
>[1] https://godbolt.org/z/EUJnuG
  
Phil Yang July 9, 2019, 9:27 a.m. UTC | #5
> -----Original Message-----
> From: Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com>
> Sent: Friday, July 5, 2019 12:37 PM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> jerinj@marvell.com; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; hemant.agrawal@nxp.com; Gavin Hu (Arm
> Technology China) <Gavin.Hu@arm.com>; nd <nd@arm.com>;
> gage.eads@intel.com; nd <nd@arm.com>
> Subject: RE: [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic compare
> exchange
> 
> 

<snip>

> >> > > +#ifdef __ARM_FEATURE_ATOMICS
> >> > > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)
> >> \
> >> > > +static inline rte_int128_t                                                  \
> >> > > +cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
> >> > > +		rte_int128_t updated)                                               \
> >> > > +{                                                                           \
> >> > > +	/* caspX instructions register pair must start from even-
> >numbered
> >> > > +	 * register at operand 1.
> >> > > +	 * So, specify registers for local variables here.
> >> > > +	 */                                                                     \
> >> > > +	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];
> \
> >> >
> >> > I understand CASP limitation on register has to be even and odd.
> >> > Is there anyway to remove explicit x0 register allocation and choose
> >> > compiler to decide the register. Some reason with optimize(03) gcc
> >> > makes correctly but not clang.
> >> >
> >> > Hardcoding to specific register makes compiler to not optimize the
> >> > stuff, especially if it is inline function.
> >>
> >> It look like the limitation fixed recently in gcc.
> >> https://patches.linaro.org/patch/147991/
> >>
> >> Not sure about old gcc and clang. ARM compiler experts may know
> >the exact
> >> status
> >>
> >We could use syntax as follows, an example is in [1]
> >static inline rte_int128_t
> >__rte_casp(rte_int128_t *dst, rte_int128_t old, rte_int128_t updated,
> >int mo)
> >{
> >		__asm__ volatile("caspl %0, %H0, %1, %H1, [%2]"
> >				 : "+r" (old)
> >				 : "r" (updated), "r" (dst)
> >				 : "memory");
> >	return old;
> >}
> 
> We have used this format for mempool/octeontx2 but clang wasn't too
> happy.
> 
> dpdk/drivers/mempool/octeontx2/otx2_mempool_ops.c:151:15: error:
> value size does not match register size specified by the constraint and
> modifier [-Werror,-Wasm-operand-widths]
>                 [t0] "=&r" (t0), [t1] "=&r" (t1), [t2] "=&r" (t2),
>                             ^
> dpdk/drivers/mempool/octeontx2/otx2_mempool_ops.c:82:9: note: use
> constraint modifier "w"
>                 "casp %[t0], %H[t0], %[wdata], %H[wdata], [%[loc]]\n"
> 
> Had to change it to hand coded asm
> 
> http://patches.dpdk.org/patch/56110/

Hi Jerin,

The update from the compiler team is 'the LSE CASP fix has not been backported to older GCC branches'.
So, currently, this seems the only approach works for all versions of GCC and Clang. 
I think we can add another optimization patch for this once all the compilers were ready. 

Thanks,
Phil
> 
> >
> >[1] https://godbolt.org/z/EUJnuG
  
Jerin Jacob Kollanukkaran July 9, 2019, 11:14 a.m. UTC | #6
> -----Original Message-----
> From: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>
> Sent: Tuesday, July 9, 2019 2:58 PM
> To: Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Jerin Jacob Kollanukkaran
> <jerinj@marvell.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; hemant.agrawal@nxp.com; Gavin Hu (Arm
> Technology China) <Gavin.Hu@arm.com>; nd <nd@arm.com>;
> gage.eads@intel.com; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic compare
> exchange
> 
> > -----Original Message-----
> > From: Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com>
> > Sent: Friday, July 5, 2019 12:37 PM
> > To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> > jerinj@marvell.com; Phil Yang (Arm Technology China)
> > <Phil.Yang@arm.com>; dev@dpdk.org
> > Cc: thomas@monjalon.net; hemant.agrawal@nxp.com; Gavin Hu (Arm
> > Technology China) <Gavin.Hu@arm.com>; nd <nd@arm.com>;
> > gage.eads@intel.com; nd <nd@arm.com>
> > Subject: RE: [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic
> > compare exchange
> >
> >
> 
> <snip>
> 
> > >> > > +#ifdef __ARM_FEATURE_ATOMICS
> > >> > > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)
> > >> \
> > >> > > +static inline rte_int128_t                                                  \
> > >> > > +cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
> > >> > > +		rte_int128_t updated)                                               \
> > >> > > +{                                                                           \
> > >> > > +	/* caspX instructions register pair must start from even-
> > >numbered
> > >> > > +	 * register at operand 1.
> > >> > > +	 * So, specify registers for local variables here.
> > >> > > +	 */                                                                     \
> > >> > > +	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];
> > \
> > >> >
> > >> > I understand CASP limitation on register has to be even and odd.
> > >> > Is there anyway to remove explicit x0 register allocation and
> > >> > choose compiler to decide the register. Some reason with
> > >> > optimize(03) gcc makes correctly but not clang.
> > >> >
> > >> > Hardcoding to specific register makes compiler to not optimize
> > >> > the stuff, especially if it is inline function.
> > >>
> > >> It look like the limitation fixed recently in gcc.
> > >> https://patches.linaro.org/patch/147991/
> > >>
> > >> Not sure about old gcc and clang. ARM compiler experts may know
> > >the exact
> > >> status
> > >>
> > >We could use syntax as follows, an example is in [1] static inline
> > >rte_int128_t __rte_casp(rte_int128_t *dst, rte_int128_t old,
> > >rte_int128_t updated, int mo) {
> > >		__asm__ volatile("caspl %0, %H0, %1, %H1, [%2]"
> > >				 : "+r" (old)
> > >				 : "r" (updated), "r" (dst)
> > >				 : "memory");
> > >	return old;
> > >}
> >
> > We have used this format for mempool/octeontx2 but clang wasn't too
> > happy.
> >
> > dpdk/drivers/mempool/octeontx2/otx2_mempool_ops.c:151:15: error:
> > value size does not match register size specified by the constraint
> > and modifier [-Werror,-Wasm-operand-widths]
> >                 [t0] "=&r" (t0), [t1] "=&r" (t1), [t2] "=&r" (t2),
> >                             ^
> > dpdk/drivers/mempool/octeontx2/otx2_mempool_ops.c:82:9: note: use
> > constraint modifier "w"
> >                 "casp %[t0], %H[t0], %[wdata], %H[wdata], [%[loc]]\n"
> >
> > Had to change it to hand coded asm
> >
> > http://patches.dpdk.org/patch/56110/
> 
> Hi Jerin,
> 
> The update from the compiler team is 'the LSE CASP fix has not been
> backported to older GCC branches'.
> So, currently, this seems the only approach works for all versions of GCC and
> Clang.
> I think we can add another optimization patch for this once all the compilers
> were ready.

We are on same page.


> 
> Thanks,
> Phil
> >
> > >
> > >[1] https://godbolt.org/z/EUJnuG
  
Jerin Jacob Kollanukkaran July 19, 2019, 6:24 a.m. UTC | #7
> -----Original Message-----
> From: Phil Yang <phil.yang@arm.com>
> Sent: Friday, June 28, 2019 1:42 PM
> To: dev@dpdk.org
> Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
> hemant.agrawal@nxp.com; Honnappa.Nagarahalli@arm.com;
> gavin.hu@arm.com; nd@arm.com; gage.eads@intel.com
> Subject: [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic compare
> exchange
> 
> External Email
> 
> ----------------------------------------------------------------------
> Add 128-bit atomic compare exchange on aarch64.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Tested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
> +#define RTE_HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> +__ATOMIC_RELEASE) #define RTE_HAS_RLS(mo) ((mo) ==
> __ATOMIC_RELEASE || \
> +			 (mo) == __ATOMIC_ACQ_REL || \
> +			 (mo) == __ATOMIC_SEQ_CST)
> +
> +#define RTE_MO_LOAD(mo)  (RTE_HAS_ACQ((mo)) \
> +		? __ATOMIC_ACQUIRE : __ATOMIC_RELAXED) #define
> RTE_MO_STORE(mo)
> +(RTE_HAS_RLS((mo)) \
> +		? __ATOMIC_RELEASE : __ATOMIC_RELAXED)
> +

The one starts with RTE_ are public symbols, If it is generic enough,
Move to common layer so that every architecturse can use.
If you think, otherwise make it internal 



> +#ifdef __ARM_FEATURE_ATOMICS

This define is added in gcc 9.1 and I believe for clang it is not supported yet.
So old gcc and clang this will be undefined.
I think, With meson + native build, we  can find the presence of 
ATOMIC support by running a.out. Not sure about make and cross build case.
I don't want block this feature because of this, IMO, We can add this code
with  existing __ARM_FEATURE_ATOMICS scheme and later find a method
to enhance it. But please check how to fix it.

> +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)                          \
> +static inline rte_int128_t                                                  \
> +cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
> +		rte_int128_t updated)                                               \
> +{                                                                           \
> +	/* caspX instructions register pair must start from even-numbered
> +	 * register at operand 1.
> +	 * So, specify registers for local variables here.
> +	 */                                                                     \
> +	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];                \

Since direct x0 register used in the code and
cas_op_name() and rte_atomic128_cmp_exchange() is inline function,
Based on parent function load, we may corrupt x0 register aka
Break arm64 ABI. Not sure clobber list will help here or not?
Making it as no_inline will help but not sure about the performance impact.
May be you can check with compiler team. 

We burned our hands with this scheme, see
5b40ec6b966260e0ff66a8a2c689664f75d6a0e6 ("mempool/octeontx2: fix possible arm64 ABI break")

Probably we can choose a scheme for rc2 and adjust as when we have complete clarity.

> +	register uint64_t x1 __asm("x1") = (uint64_t)old.val[1];                \
> +	register uint64_t x2 __asm("x2") = (uint64_t)updated.val[0];            \
> +	register uint64_t x3 __asm("x3") = (uint64_t)updated.val[1];            \
> +	asm volatile(                                                           \
> +			op_string " %[old0], %[old1], %[upd0], %[upd1],
> [%[dst]]"       \
> +			: [old0] "+r" (x0),                                             \
> +			  [old1] "+r" (x1)                                              \
> +			: [upd0] "r" (x2),                                              \
> +			  [upd1] "r" (x3),                                              \
> +			  [dst] "r" (dst)                                               \
> +			: "memory");                                                    \

Should n't we add x0,x1, x2, x3 in clobber list?


>  static inline int __rte_experimental
>  rte_atomic128_cmp_exchange(rte_int128_t *dst,
>  			   rte_int128_t *exp,
> diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h
> b/lib/librte_eal/common/include/generic/rte_atomic.h
> index 9958543..2355e50 100644
> --- a/lib/librte_eal/common/include/generic/rte_atomic.h
> +++ b/lib/librte_eal/common/include/generic/rte_atomic.h
> @@ -1081,6 +1081,20 @@ static inline void
> rte_atomic64_clear(rte_atomic64_t *v)
> 
>  /*------------------------ 128 bit atomic operations -------------------------*/
> 
> +#if defined(RTE_ARCH_X86_64) || defined(RTE_ARCH_ARM64)

There is nothing specific to x86 and arm64 here, Can we remove this #ifdef ?

> +/**
> + * 128-bit integer structure.
> + */
> +RTE_STD_C11
> +typedef struct {
> +	RTE_STD_C11
> +	union {
> +		uint64_t val[2];
> +		__extension__ __int128 int128;
> +	};
> +} __rte_aligned(16) rte_int128_t;
> +#endif
> +
>  #ifdef __DOXYGEN__
  
Phil Yang July 19, 2019, 11:01 a.m. UTC | #8
> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Friday, July 19, 2019 2:25 PM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; hemant.agrawal@nxp.com; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology
> China) <Gavin.Hu@arm.com>; nd <nd@arm.com>; gage.eads@intel.com
> Subject: RE: [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic compare
> exchange
> 
> > -----Original Message-----
> > From: Phil Yang <phil.yang@arm.com>
> > Sent: Friday, June 28, 2019 1:42 PM
> > To: dev@dpdk.org
> > Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran
> <jerinj@marvell.com>;
> > hemant.agrawal@nxp.com; Honnappa.Nagarahalli@arm.com;
> > gavin.hu@arm.com; nd@arm.com; gage.eads@intel.com
> > Subject: [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic compare
> > exchange
> >
> > External Email
> >
> > ----------------------------------------------------------------------
> > Add 128-bit atomic compare exchange on aarch64.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Tested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> > +#define RTE_HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> > +__ATOMIC_RELEASE) #define RTE_HAS_RLS(mo) ((mo) ==
> > __ATOMIC_RELEASE || \
> > +			 (mo) == __ATOMIC_ACQ_REL || \
> > +			 (mo) == __ATOMIC_SEQ_CST)
> > +
> > +#define RTE_MO_LOAD(mo)  (RTE_HAS_ACQ((mo)) \
> > +		? __ATOMIC_ACQUIRE : __ATOMIC_RELAXED) #define
> > RTE_MO_STORE(mo)
> > +(RTE_HAS_RLS((mo)) \
> > +		? __ATOMIC_RELEASE : __ATOMIC_RELAXED)
> > +
> 
> The one starts with RTE_ are public symbols, If it is generic enough,
> Move to common layer so that every architecturse can use.
> If you think, otherwise make it internal

Let's keep it internal. I will remove the 'RTE_' tag. 

> 
> 
> 
> > +#ifdef __ARM_FEATURE_ATOMICS
> 
> This define is added in gcc 9.1 and I believe for clang it is not supported yet.
> So old gcc and clang this will be undefined.
> I think, With meson + native build, we  can find the presence of
> ATOMIC support by running a.out. Not sure about make and cross build case.
> I don't want block this feature because of this, IMO, We can add this code
> with  existing __ARM_FEATURE_ATOMICS scheme and later find a method
> to enhance it. But please check how to fix it.

OK.

> 
> > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)                          \
> > +static inline rte_int128_t                                                  \
> > +cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
> > +		rte_int128_t updated)                                               \
> > +{                                                                           \
> > +	/* caspX instructions register pair must start from even-numbered
> > +	 * register at operand 1.
> > +	 * So, specify registers for local variables here.
> > +	 */                                                                     \
> > +	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];                \
> 
> Since direct x0 register used in the code and
> cas_op_name() and rte_atomic128_cmp_exchange() is inline function,
> Based on parent function load, we may corrupt x0 register aka

Since x0/x1 and x2/x3 are used a lot and often contain live values.
Maybe to change them to some relatively less frequently used registers like x14/x15 and x16/x17 might help for this case?
According to the PCS (Procedure Call Standard), x14-x17 are also temporary registers.

> Break arm64 ABI. Not sure clobber list will help here or not?

In my understanding, for the register variable, if it contains a live value in the specified register, the compiler will move the live value into a free register. 
Since x0~x3 are present in the input/output operands and x0/x1's value needs to be restored to the variable 'old' as a return value. 
So I didn't add them into the clobber list.

> Making it as no_inline will help but not sure about the performance impact.
> May be you can check with compiler team.
> 
> We burned our hands with this scheme, see
> 5b40ec6b966260e0ff66a8a2c689664f75d6a0e6 ("mempool/octeontx2: fix
> possible arm64 ABI break")
> 
> Probably we can choose a scheme for rc2 and adjust as when we have
> complete clarity.
> 
> > +	register uint64_t x1 __asm("x1") = (uint64_t)old.val[1];                \
> > +	register uint64_t x2 __asm("x2") = (uint64_t)updated.val[0];            \
> > +	register uint64_t x3 __asm("x3") = (uint64_t)updated.val[1];            \
> > +	asm volatile(                                                           \
> > +			op_string " %[old0], %[old1], %[upd0], %[upd1],
> > [%[dst]]"       \
> > +			: [old0] "+r" (x0),                                             \
> > +			  [old1] "+r" (x1)                                              \
> > +			: [upd0] "r" (x2),                                              \
> > +			  [upd1] "r" (x3),                                              \
> > +			  [dst] "r" (dst)                                               \
> > +			: "memory");                                                    \
> 
> Should n't we add x0,x1, x2, x3 in clobber list?

Same as above.

> 
> 
> >  static inline int __rte_experimental
> >  rte_atomic128_cmp_exchange(rte_int128_t *dst,
> >  			   rte_int128_t *exp,
> > diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h
> > b/lib/librte_eal/common/include/generic/rte_atomic.h
> > index 9958543..2355e50 100644
> > --- a/lib/librte_eal/common/include/generic/rte_atomic.h
> > +++ b/lib/librte_eal/common/include/generic/rte_atomic.h
> > @@ -1081,6 +1081,20 @@ static inline void
> > rte_atomic64_clear(rte_atomic64_t *v)
> >
> >  /*------------------------ 128 bit atomic operations -------------------------*/
> >
> > +#if defined(RTE_ARCH_X86_64) || defined(RTE_ARCH_ARM64)
> 
> There is nothing specific to x86 and arm64 here, Can we remove this #ifdef ?

Without this constraint, it will break 32-bit x86 builds.
http://mails.dpdk.org/archives/test-report/2019-June/086586.html 

> 
> > +/**
> > + * 128-bit integer structure.
> > + */
> > +RTE_STD_C11
> > +typedef struct {
> > +	RTE_STD_C11
> > +	union {
> > +		uint64_t val[2];
> > +		__extension__ __int128 int128;
> > +	};
> > +} __rte_aligned(16) rte_int128_t;
> > +#endif
> > +
> >  #ifdef __DOXYGEN__
  
Jerin Jacob Kollanukkaran July 19, 2019, 12:35 p.m. UTC | #9
> > > +#define RTE_HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> > > +__ATOMIC_RELEASE) #define RTE_HAS_RLS(mo) ((mo) ==
> > > __ATOMIC_RELEASE || \
> > > +			 (mo) == __ATOMIC_ACQ_REL || \
> > > +			 (mo) == __ATOMIC_SEQ_CST)
> > > +
> > > +#define RTE_MO_LOAD(mo)  (RTE_HAS_ACQ((mo)) \
> > > +		? __ATOMIC_ACQUIRE : __ATOMIC_RELAXED) #define
> > > RTE_MO_STORE(mo)
> > > +(RTE_HAS_RLS((mo)) \
> > > +		? __ATOMIC_RELEASE : __ATOMIC_RELAXED)
> > > +
> >
> > The one starts with RTE_ are public symbols, If it is generic enough,
> > Move to common layer so that every architecturse can use.
> > If you think, otherwise make it internal
> 
> Let's keep it internal. I will remove the 'RTE_' tag.

Probably change to __HAS_ACQ to avoid collision(just in case)

> >
> >
> >
> > > +#ifdef __ARM_FEATURE_ATOMICS
> >
> > This define is added in gcc 9.1 and I believe for clang it is not supported yet.
> > So old gcc and clang this will be undefined.
> > I think, With meson + native build, we  can find the presence of
> > ATOMIC support by running a.out. Not sure about make and cross build case.
> > I don't want block this feature because of this, IMO, We can add this
> > code with  existing __ARM_FEATURE_ATOMICS scheme and later find a
> > method to enhance it. But please check how to fix it.
> 
> OK.

After thinking on this a bit, I think,  in order to support old gcc(< gcc 9.1) and clang,
We can introduce a config option, where, by default it is disabled and enable
In specific config(where we know, lse is supported) and meson config.

i.e
#if defined(__ARM_FEATURE_ATOMICS) || defined(RTE_ARM_FEATURE_ATOMICS)


> 
> >
> > > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)                          \
> > > +static inline rte_int128_t                                                  \
> > > +cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
> > > +		rte_int128_t updated)                                               \
> > > +{                                                                           \
> > > +	/* caspX instructions register pair must start from even-numbered
> > > +	 * register at operand 1.
> > > +	 * So, specify registers for local variables here.
> > > +	 */                                                                     \
> > > +	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];                \
> >
> > Since direct x0 register used in the code and
> > cas_op_name() and rte_atomic128_cmp_exchange() is inline function,
> > Based on parent function load, we may corrupt x0 register aka
> 
> Since x0/x1 and x2/x3 are used a lot and often contain live values.
> Maybe to change them to some relatively less frequently used registers like
> x14/x15 and x16/x17 might help for this case?
> According to the PCS (Procedure Call Standard), x14-x17 are also temporary
> registers.

X14-x17 are temporary registers but since 
cas_op_name() and rte_atomic128_cmp_exchange() are inline functions,
Based on the parent function register usage, it _may_ corrupt.


> 
> > Break arm64 ABI. Not sure clobber list will help here or not?
> 
> In my understanding, for the register variable, if it contains a live value in the
> specified register, the compiler will move the live value into a free register.
> Since x0~x3 are present in the input/output operands and x0/x1's value needs to
> be restored to the variable 'old' as a return value.
> So I didn't add them into the clobber list.

OK

> 
> > Making it as no_inline will help but not sure about the performance impact.
> > May be you can check with compiler team.
> >
> > We burned our hands with this scheme, see
> > 5b40ec6b966260e0ff66a8a2c689664f75d6a0e6 ("mempool/octeontx2: fix
> > possible arm64 ABI break")
> >
> > Probably we can choose a scheme for rc2 and adjust as when we have
> > complete clarity.
> >
> > >
> > > +#if defined(RTE_ARCH_X86_64) || defined(RTE_ARCH_ARM64)
> >
> > There is nothing specific to x86 and arm64 here, Can we remove this #ifdef ?
> 
> Without this constraint, it will break 32-bit x86 builds.
> http://mails.dpdk.org/archives/test-report/2019-June/086586.html

OK . #ifdef RTE_ARCH_64 would help then.

> 
> >
> > > +/**
> > > + * 128-bit integer structure.
> > > + */
> > > +RTE_STD_C11
> > > +typedef struct {
> > > +	RTE_STD_C11
> > > +	union {
> > > +		uint64_t val[2];
> > > +		__extension__ __int128 int128;

Instead of guarding  RTE_ARCH_64 on this complete structure,
How about it only under
#ifdef RTE_ARCH_64
__extension__ __int128 int128;
#endif
So that it rte_int128_t will be available for 32bit as well.


> > > +	};
> > > +} __rte_aligned(16) rte_int128_t;
> > > +#endif
> > > +
> > >  #ifdef __DOXYGEN__
  
Phil Yang July 19, 2019, 1:56 p.m. UTC | #10
> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Friday, July 19, 2019 8:35 PM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; hemant.agrawal@nxp.com; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology
> China) <Gavin.Hu@arm.com>; nd <nd@arm.com>; gage.eads@intel.com; nd
> <nd@arm.com>
> Subject: RE: [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic compare
> exchange
> 
> > > > +#define RTE_HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> > > > +__ATOMIC_RELEASE) #define RTE_HAS_RLS(mo) ((mo) ==
> > > > __ATOMIC_RELEASE || \
> > > > +			 (mo) == __ATOMIC_ACQ_REL || \
> > > > +			 (mo) == __ATOMIC_SEQ_CST)
> > > > +
> > > > +#define RTE_MO_LOAD(mo)  (RTE_HAS_ACQ((mo)) \
> > > > +		? __ATOMIC_ACQUIRE : __ATOMIC_RELAXED) #define
> > > > RTE_MO_STORE(mo)
> > > > +(RTE_HAS_RLS((mo)) \
> > > > +		? __ATOMIC_RELEASE : __ATOMIC_RELAXED)
> > > > +
> > >
> > > The one starts with RTE_ are public symbols, If it is generic enough,
> > > Move to common layer so that every architecturse can use.
> > > If you think, otherwise make it internal
> >
> > Let's keep it internal. I will remove the 'RTE_' tag.
> 
> Probably change to __HAS_ACQ to avoid collision(just in case)

OK.

> 
> > >
> > >
> > >
> > > > +#ifdef __ARM_FEATURE_ATOMICS
> > >
> > > This define is added in gcc 9.1 and I believe for clang it is not supported
> yet.
> > > So old gcc and clang this will be undefined.
> > > I think, With meson + native build, we  can find the presence of
> > > ATOMIC support by running a.out. Not sure about make and cross build
> case.
> > > I don't want block this feature because of this, IMO, We can add this
> > > code with  existing __ARM_FEATURE_ATOMICS scheme and later find a
> > > method to enhance it. But please check how to fix it.
> >
> > OK.
> 
> After thinking on this a bit, I think,  in order to support old gcc(< gcc 9.1) and
> clang,
> We can introduce a config option, where, by default it is disabled and enable
> In specific config(where we know, lse is supported) and meson config.
> 
> i.e
> #if defined(__ARM_FEATURE_ATOMICS) ||
> defined(RTE_ARM_FEATURE_ATOMICS)

Cool

> 
> 
> >
> > >
> > > > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)
> \
> > > > +static inline rte_int128_t                                                  \
> > > > +cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
> > > > +		rte_int128_t updated)                                               \
> > > > +{                                                                           \
> > > > +	/* caspX instructions register pair must start from even-numbered
> > > > +	 * register at operand 1.
> > > > +	 * So, specify registers for local variables here.
> > > > +	 */                                                                     \
> > > > +	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];                \
> > >
> > > Since direct x0 register used in the code and
> > > cas_op_name() and rte_atomic128_cmp_exchange() is inline function,
> > > Based on parent function load, we may corrupt x0 register aka
> >
> > Since x0/x1 and x2/x3 are used a lot and often contain live values.
> > Maybe to change them to some relatively less frequently used registers
> like
> > x14/x15 and x16/x17 might help for this case?
> > According to the PCS (Procedure Call Standard), x14-x17 are also temporary
> > registers.
> 
> X14-x17 are temporary registers but since
> cas_op_name() and rte_atomic128_cmp_exchange() are inline functions,
> Based on the parent function register usage, it _may_ corrupt.

Just checked how Linux Kernel does similar things:
https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/atomic_lse.h#L19 

Same methods.

I will finish the benchmarking for the no_inline approach. If it has no significant performance loss, I think we can make it as no_inline.  

> 
> 
> >
> > > Break arm64 ABI. Not sure clobber list will help here or not?
> >
> > In my understanding, for the register variable, if it contains a live value in
> the
> > specified register, the compiler will move the live value into a free register.
> > Since x0~x3 are present in the input/output operands and x0/x1's value
> needs to
> > be restored to the variable 'old' as a return value.
> > So I didn't add them into the clobber list.
> 
> OK
> 
> >
> > > Making it as no_inline will help but not sure about the performance
> impact.
> > > May be you can check with compiler team.
> > >
> > > We burned our hands with this scheme, see
> > > 5b40ec6b966260e0ff66a8a2c689664f75d6a0e6 ("mempool/octeontx2: fix
> > > possible arm64 ABI break")
> > >
> > > Probably we can choose a scheme for rc2 and adjust as when we have
> > > complete clarity.
> > >
> > > >
> > > > +#if defined(RTE_ARCH_X86_64) || defined(RTE_ARCH_ARM64)
> > >
> > > There is nothing specific to x86 and arm64 here, Can we remove this
> #ifdef ?
> >
> > Without this constraint, it will break 32-bit x86 builds.
> > http://mails.dpdk.org/archives/test-report/2019-June/086586.html
> 
> OK . #ifdef RTE_ARCH_64 would help then.

OK.

> 
> >
> > >
> > > > +/**
> > > > + * 128-bit integer structure.
> > > > + */
> > > > +RTE_STD_C11
> > > > +typedef struct {
> > > > +	RTE_STD_C11
> > > > +	union {
> > > > +		uint64_t val[2];
> > > > +		__extension__ __int128 int128;
> 
> Instead of guarding  RTE_ARCH_64 on this complete structure,
> How about it only under
> #ifdef RTE_ARCH_64
> __extension__ __int128 int128;
> #endif
> So that it rte_int128_t will be available for 32bit as well.

Agree, it should be work. But I am not sure. 

Hi Gage,

How do you think about this? 

> 
> 
> > > > +	};
> > > > +} __rte_aligned(16) rte_int128_t;
> > > > +#endif
> > > > +
> > > >  #ifdef __DOXYGEN__
  
Eads, Gage July 19, 2019, 2:50 p.m. UTC | #11
> > > > > +/**
> > > > > + * 128-bit integer structure.
> > > > > + */
> > > > > +RTE_STD_C11
> > > > > +typedef struct {
> > > > > +	RTE_STD_C11
> > > > > +	union {
> > > > > +		uint64_t val[2];
> > > > > +		__extension__ __int128 int128;
> >
> > Instead of guarding  RTE_ARCH_64 on this complete structure, How about
> > it only under #ifdef RTE_ARCH_64 __extension__ __int128 int128; #endif
> > So that it rte_int128_t will be available for 32bit as well.
> 
> Agree, it should be work. But I am not sure.
> 
> Hi Gage,
> 
> How do you think about this?
> 

I don't see any harm in that.
  

Patch

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 97060e4..2080c4d 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -1,5 +1,6 @@ 
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2015 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_ATOMIC_ARM64_H_
@@ -14,6 +15,9 @@  extern "C" {
 #endif
 
 #include "generic/rte_atomic.h"
+#include <rte_branch_prediction.h>
+#include <rte_compat.h>
+#include <rte_debug.h>
 
 #define dsb(opt) asm volatile("dsb " #opt : : : "memory")
 #define dmb(opt) asm volatile("dmb " #opt : : : "memory")
@@ -40,6 +44,167 @@  extern "C" {
 
 #define rte_cio_rmb() dmb(oshld)
 
+/*------------------------ 128 bit atomic operations -------------------------*/
+
+#define RTE_HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) != __ATOMIC_RELEASE)
+#define RTE_HAS_RLS(mo) ((mo) == __ATOMIC_RELEASE || \
+			 (mo) == __ATOMIC_ACQ_REL || \
+			 (mo) == __ATOMIC_SEQ_CST)
+
+#define RTE_MO_LOAD(mo)  (RTE_HAS_ACQ((mo)) \
+		? __ATOMIC_ACQUIRE : __ATOMIC_RELAXED)
+#define RTE_MO_STORE(mo) (RTE_HAS_RLS((mo)) \
+		? __ATOMIC_RELEASE : __ATOMIC_RELAXED)
+
+#ifdef __ARM_FEATURE_ATOMICS
+#define __ATOMIC128_CAS_OP(cas_op_name, op_string)                          \
+static inline rte_int128_t                                                  \
+cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
+		rte_int128_t updated)                                               \
+{                                                                           \
+	/* caspX instructions register pair must start from even-numbered
+	 * register at operand 1.
+	 * So, specify registers for local variables here.
+	 */                                                                     \
+	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];                \
+	register uint64_t x1 __asm("x1") = (uint64_t)old.val[1];                \
+	register uint64_t x2 __asm("x2") = (uint64_t)updated.val[0];            \
+	register uint64_t x3 __asm("x3") = (uint64_t)updated.val[1];            \
+	asm volatile(                                                           \
+			op_string " %[old0], %[old1], %[upd0], %[upd1], [%[dst]]"       \
+			: [old0] "+r" (x0),                                             \
+			  [old1] "+r" (x1)                                              \
+			: [upd0] "r" (x2),                                              \
+			  [upd1] "r" (x3),                                              \
+			  [dst] "r" (dst)                                               \
+			: "memory");                                                    \
+	old.val[0] = x0;                                                        \
+	old.val[1] = x1;                                                        \
+	return old;                                                             \
+}
+
+__ATOMIC128_CAS_OP(__rte_cas_relaxed, "casp")
+__ATOMIC128_CAS_OP(__rte_cas_acquire, "caspa")
+__ATOMIC128_CAS_OP(__rte_cas_release, "caspl")
+__ATOMIC128_CAS_OP(__rte_cas_acq_rel, "caspal")
+#else
+#define __ATOMIC128_LDX_OP(ldx_op_name, op_string)                          \
+static inline rte_int128_t                                                  \
+ldx_op_name(const rte_int128_t *src)                                        \
+{                                                                           \
+	rte_int128_t ret;                                                       \
+	asm volatile(                                                           \
+			op_string " %0, %1, %2"                                         \
+			: "=&r" (ret.val[0]),                                           \
+			  "=&r" (ret.val[1])                                            \
+			: "Q" (src->val[0])                                             \
+			: "memory");                                                    \
+	return ret;                                                             \
+}
+
+__ATOMIC128_LDX_OP(__rte_ldx_relaxed, "ldxp")
+__ATOMIC128_LDX_OP(__rte_ldx_acquire, "ldaxp")
+
+#define __ATOMIC128_STX_OP(stx_op_name, op_string)                          \
+static inline uint32_t                                                      \
+stx_op_name(rte_int128_t *dst, const rte_int128_t src)                      \
+{                                                                           \
+	uint32_t ret;                                                           \
+	asm volatile(                                                           \
+			op_string " %w0, %1, %2, %3"                                    \
+			: "=&r" (ret)                                                   \
+			: "r" (src.val[0]),                                             \
+			  "r" (src.val[1]),                                             \
+			  "Q" (dst->val[0])                                             \
+			: "memory");                                                    \
+	/* Return 0 on success, 1 on failure */                                 \
+	return ret;                                                             \
+}
+
+__ATOMIC128_STX_OP(__rte_stx_relaxed, "stxp")
+__ATOMIC128_STX_OP(__rte_stx_release, "stlxp")
+#endif
+
+static inline int __rte_experimental
+rte_atomic128_cmp_exchange(rte_int128_t *dst,
+				rte_int128_t *exp,
+				const rte_int128_t *src,
+				unsigned int weak,
+				int success,
+				int failure)
+{
+	/* Always do strong CAS */
+	RTE_SET_USED(weak);
+	/* Ignore memory ordering for failure, memory order for
+	 * success must be stronger or equal
+	 */
+	RTE_SET_USED(failure);
+	/* Find invalid memory order */
+	RTE_ASSERT(success == __ATOMIC_RELAXED
+			|| success == __ATOMIC_ACQUIRE
+			|| success == __ATOMIC_RELEASE
+			|| success == __ATOMIC_ACQ_REL
+			|| success == __ATOMIC_SEQ_CST);
+
+#ifdef __ARM_FEATURE_ATOMICS
+	rte_int128_t expected = *exp;
+	rte_int128_t desired = *src;
+	rte_int128_t old;
+
+	if (success == __ATOMIC_RELAXED)
+		old = __rte_cas_relaxed(dst, expected, desired);
+	else if (success == __ATOMIC_ACQUIRE)
+		old = __rte_cas_acquire(dst, expected, desired);
+	else if (success == __ATOMIC_RELEASE)
+		old = __rte_cas_release(dst, expected, desired);
+	else
+		old = __rte_cas_acq_rel(dst, expected, desired);
+#else
+	int ldx_mo = RTE_MO_LOAD(success);
+	int stx_mo = RTE_MO_STORE(success);
+	uint32_t ret = 1;
+	register rte_int128_t expected = *exp;
+	register rte_int128_t desired = *src;
+	register rte_int128_t old;
+
+	/* ldx128 can not guarantee atomic,
+	 * Must write back src or old to verify atomicity of ldx128;
+	 */
+	do {
+		if (ldx_mo == __ATOMIC_RELAXED)
+			old = __rte_ldx_relaxed(dst);
+		else
+			old = __rte_ldx_acquire(dst);
+
+		if (likely(old.int128 == expected.int128)) {
+			if (stx_mo == __ATOMIC_RELAXED)
+				ret = __rte_stx_relaxed(dst, desired);
+			else
+				ret = __rte_stx_release(dst, desired);
+		} else {
+			/* In the failure case (since 'weak' is ignored and only
+			 * weak == 0 is implemented), expected should contain the
+			 * atomically read value of dst. This means, 'old' needs
+			 * to be stored back to ensure it was read atomically.
+			 */
+			if (stx_mo == __ATOMIC_RELAXED)
+				ret = __rte_stx_relaxed(dst, old);
+			else
+				ret = __rte_stx_release(dst, old);
+		}
+	} while (unlikely(ret));
+#endif
+
+	/* Unconditionally updating expected removes
+	 * an 'if' statement.
+	 * expected should already be in register if
+	 * not in the cache.
+	 */
+	*exp = old;
+
+	return (old.int128 == expected.int128);
+}
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
index 6232c57..23cf48f 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
@@ -212,18 +212,6 @@  static inline void rte_atomic64_clear(rte_atomic64_t *v)
 
 /*------------------------ 128 bit atomic operations -------------------------*/
 
-/**
- * 128-bit integer structure.
- */
-RTE_STD_C11
-typedef struct {
-	RTE_STD_C11
-	union {
-		uint64_t val[2];
-		__extension__ __int128 int128;
-	};
-} __rte_aligned(16) rte_int128_t;
-
 static inline int __rte_experimental
 rte_atomic128_cmp_exchange(rte_int128_t *dst,
 			   rte_int128_t *exp,
diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
index 9958543..2355e50 100644
--- a/lib/librte_eal/common/include/generic/rte_atomic.h
+++ b/lib/librte_eal/common/include/generic/rte_atomic.h
@@ -1081,6 +1081,20 @@  static inline void rte_atomic64_clear(rte_atomic64_t *v)
 
 /*------------------------ 128 bit atomic operations -------------------------*/
 
+#if defined(RTE_ARCH_X86_64) || defined(RTE_ARCH_ARM64)
+/**
+ * 128-bit integer structure.
+ */
+RTE_STD_C11
+typedef struct {
+	RTE_STD_C11
+	union {
+		uint64_t val[2];
+		__extension__ __int128 int128;
+	};
+} __rte_aligned(16) rte_int128_t;
+#endif
+
 #ifdef __DOXYGEN__
 
 /**
@@ -1093,7 +1107,8 @@  static inline void rte_atomic64_clear(rte_atomic64_t *v)
  *     *exp = *dst
  * @endcode
  *
- * @note This function is currently only available for the x86-64 platform.
+ * @note This function is currently available for the x86-64 and aarch64
+ * platforms.
  *
  * @note The success and failure arguments must be one of the __ATOMIC_* values
  * defined in the C++11 standard. For details on their behavior, refer to the