[v3,06/12] ipsec: optimize with c11 atomic for sa outbound sqn update

Message ID 1584407863-774-7-git-send-email-phil.yang@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series generic rte atomic APIs deprecate proposal |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation fail Compilation issues

Commit Message

Phil Yang March 17, 2020, 1:17 a.m. UTC
  For SA outbound packets, rte_atomic64_add_return is used to generate
SQN atomically. This introduced an unnecessary full barrier by calling
the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
patch optimized it with c11 atomic and eliminated the expensive barrier
for aarch64.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_ipsec/ipsec_sqn.h | 3 ++-
 lib/librte_ipsec/sa.h        | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)
  

Comments

Ananyev, Konstantin March 23, 2020, 6:48 p.m. UTC | #1
Hi Phil,

> 
> For SA outbound packets, rte_atomic64_add_return is used to generate
> SQN atomically. This introduced an unnecessary full barrier by calling
> the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> patch optimized it with c11 atomic and eliminated the expensive barrier
> for aarch64.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> ---
>  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
>  lib/librte_ipsec/sa.h        | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
> index 0c2f76a..e884af7 100644
> --- a/lib/librte_ipsec/ipsec_sqn.h
> +++ b/lib/librte_ipsec/ipsec_sqn.h
> @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa, uint32_t *num)
> 
>  	n = *num;
>  	if (SQN_ATOMIC(sa))
> -		sqn = (uint64_t)rte_atomic64_add_return(&sa->sqn.outb.atom, n);
> +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> +			__ATOMIC_RELAXED);

One generic thing to note:
clang for i686 in some cases will generate a proper function call for
64-bit __atomic builtins (gcc seems to always generate cmpxchng8b for such cases).
Does anyone consider it as a potential problem?
It probably not a big deal, but would like to know broader opinion.

>  	else {
>  		sqn = sa->sqn.outb.raw + n;
>  		sa->sqn.outb.raw = sqn;
> diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
> index d22451b..cab9a2e 100644
> --- a/lib/librte_ipsec/sa.h
> +++ b/lib/librte_ipsec/sa.h
> @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
>  	 */
>  	union {
>  		union {
> -			rte_atomic64_t atom;
> +			uint64_t atom;
>  			uint64_t raw;
>  		} outb;

If we don't need rte_atomic64 here anymore,
then I think we can collapse the union to just:
uint64_t outb; 

>  		struct {
> --
> 2.7.4
  
Honnappa Nagarahalli March 23, 2020, 7:07 p.m. UTC | #2
<snip>

> Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound
> sqn update
> 
> Hi Phil,
> 
> >
> > For SA outbound packets, rte_atomic64_add_return is used to generate
> > SQN atomically. This introduced an unnecessary full barrier by calling
> > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > patch optimized it with c11 atomic and eliminated the expensive
> > barrier for aarch64.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > ---
> >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> >  lib/librte_ipsec/sa.h        | 2 +-
> >  2 files changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > --- a/lib/librte_ipsec/ipsec_sqn.h
> > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > uint32_t *num)
> >
> >  	n = *num;
> >  	if (SQN_ATOMIC(sa))
> > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> >sqn.outb.atom, n);
> > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > +			__ATOMIC_RELAXED);
> 
> One generic thing to note:
> clang for i686 in some cases will generate a proper function call for 64-bit
> __atomic builtins (gcc seems to always generate cmpxchng8b for such cases).
> Does anyone consider it as a potential problem?
> It probably not a big deal, but would like to know broader opinion.
I had looked at this some time back for GCC. The function call is generated only if the underlying platform does not support the atomic instructions for the operand size. Otherwise, gcc generates the instructions directly.
I would think the behavior would be the same for clang.

> 
> >  	else {
> >  		sqn = sa->sqn.outb.raw + n;
> >  		sa->sqn.outb.raw = sqn;
> > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > d22451b..cab9a2e 100644
> > --- a/lib/librte_ipsec/sa.h
> > +++ b/lib/librte_ipsec/sa.h
> > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> >  	 */
> >  	union {
> >  		union {
> > -			rte_atomic64_t atom;
> > +			uint64_t atom;
> >  			uint64_t raw;
> >  		} outb;
> 
> If we don't need rte_atomic64 here anymore, then I think we can collapse the
> union to just:
> uint64_t outb;
> 
> >  		struct {
> > --
> > 2.7.4
  
Ananyev, Konstantin March 23, 2020, 7:18 p.m. UTC | #3
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Monday, March 23, 2020 7:08 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Phil Yang <Phil.Yang@arm.com>; thomas@monjalon.net; Van Haaren, Harry
> <harry.van.haaren@intel.com>; stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Joyce Kong <Joyce.Kong@arm.com>; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
> 
> <snip>
> 
> > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound
> > sqn update
> >
> > Hi Phil,
> >
> > >
> > > For SA outbound packets, rte_atomic64_add_return is used to generate
> > > SQN atomically. This introduced an unnecessary full barrier by calling
> > > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > > patch optimized it with c11 atomic and eliminated the expensive
> > > barrier for aarch64.
> > >
> > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > ---
> > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > >  lib/librte_ipsec/sa.h        | 2 +-
> > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > uint32_t *num)
> > >
> > >  	n = *num;
> > >  	if (SQN_ATOMIC(sa))
> > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > >sqn.outb.atom, n);
> > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > +			__ATOMIC_RELAXED);
> >
> > One generic thing to note:
> > clang for i686 in some cases will generate a proper function call for 64-bit
> > __atomic builtins (gcc seems to always generate cmpxchng8b for such cases).
> > Does anyone consider it as a potential problem?
> > It probably not a big deal, but would like to know broader opinion.
> I had looked at this some time back for GCC. The function call is generated only if the underlying platform does not support the atomic
> instructions for the operand size. Otherwise, gcc generates the instructions directly.
> I would think the behavior would be the same for clang.

From what I see not really.
As an example:

$ cat tatm11.c
#include <stdint.h>

struct x {
        uint64_t v __attribute__((aligned(8)));
};

uint64_t
ffxadd1(struct x *x, uint32_t n, uint32_t m)
{
        return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED);
}

uint64_t
ffxadd11(uint64_t *v, uint32_t n, uint32_t m)
{
        return __atomic_add_fetch(v, n, __ATOMIC_RELAXED);
}

gcc for i686 will generate code with cmpxchng8b for both cases.
clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B aligned,
but will emit a function call for ffxadd11().

> 
> >
> > >  	else {
> > >  		sqn = sa->sqn.outb.raw + n;
> > >  		sa->sqn.outb.raw = sqn;
> > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > d22451b..cab9a2e 100644
> > > --- a/lib/librte_ipsec/sa.h
> > > +++ b/lib/librte_ipsec/sa.h
> > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > >  	 */
> > >  	union {
> > >  		union {
> > > -			rte_atomic64_t atom;
> > > +			uint64_t atom;
> > >  			uint64_t raw;
> > >  		} outb;
> >
> > If we don't need rte_atomic64 here anymore, then I think we can collapse the
> > union to just:
> > uint64_t outb;
> >
> > >  		struct {
> > > --
> > > 2.7.4
  
Honnappa Nagarahalli March 23, 2020, 8:20 p.m. UTC | #4
<snip>

> > > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> > > outbound sqn update
> > >
> > > Hi Phil,
> > >
> > > >
> > > > For SA outbound packets, rte_atomic64_add_return is used to
> > > > generate SQN atomically. This introduced an unnecessary full
> > > > barrier by calling the '__sync' builtin implemented rte_atomic_XX
> > > > API on aarch64. This patch optimized it with c11 atomic and
> > > > eliminated the expensive barrier for aarch64.
> > > >
> > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > ---
> > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > uint32_t *num)
> > > >
> > > >  	n = *num;
> > > >  	if (SQN_ATOMIC(sa))
> > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > >sqn.outb.atom, n);
> > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > +			__ATOMIC_RELAXED);
> > >
> > > One generic thing to note:
> > > clang for i686 in some cases will generate a proper function call
> > > for 64-bit __atomic builtins (gcc seems to always generate cmpxchng8b for
> such cases).
> > > Does anyone consider it as a potential problem?
> > > It probably not a big deal, but would like to know broader opinion.
> > I had looked at this some time back for GCC. The function call is
> > generated only if the underlying platform does not support the atomic
> instructions for the operand size. Otherwise, gcc generates the instructions
> directly.
> > I would think the behavior would be the same for clang.
> 
> From what I see not really.
> As an example:
> 
> $ cat tatm11.c
> #include <stdint.h>
> 
> struct x {
>         uint64_t v __attribute__((aligned(8))); };
> 
> uint64_t
> ffxadd1(struct x *x, uint32_t n, uint32_t m) {
>         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED); }
> 
> uint64_t
> ffxadd11(uint64_t *v, uint32_t n, uint32_t m) {
>         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED); }
> 
> gcc for i686 will generate code with cmpxchng8b for both cases.
> clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> aligned, but will emit a function call for ffxadd11().
Does it require libatomic to be linked in this case? Clang documentation calls out unaligned case where it would generate the function call [1].
On aarch64, the atomic instructions need the address to be aligned.

[1] https://clang.llvm.org/docs/Toolchain.html#atomics-library

> 
> >
> > >
> > > >  	else {
> > > >  		sqn = sa->sqn.outb.raw + n;
> > > >  		sa->sqn.outb.raw = sqn;
> > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > d22451b..cab9a2e 100644
> > > > --- a/lib/librte_ipsec/sa.h
> > > > +++ b/lib/librte_ipsec/sa.h
> > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > >  	 */
> > > >  	union {
> > > >  		union {
> > > > -			rte_atomic64_t atom;
> > > > +			uint64_t atom;
> > > >  			uint64_t raw;
> > > >  		} outb;
> > >
> > > If we don't need rte_atomic64 here anymore, then I think we can
> > > collapse the union to just:
> > > uint64_t outb;
> > >
> > > >  		struct {
> > > > --
> > > > 2.7.4
  
Phil Yang March 24, 2020, 10:37 a.m. UTC | #5
Hi Konstantin,

<snip>
> >
> > > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> outbound
> > > sqn update
> > >
> > > Hi Phil,
> > >
> > > >
> > > > For SA outbound packets, rte_atomic64_add_return is used to
> generate
> > > > SQN atomically. This introduced an unnecessary full barrier by calling
> > > > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > > > patch optimized it with c11 atomic and eliminated the expensive
> > > > barrier for aarch64.
> > > >
> > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > ---
> > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > uint32_t *num)
> > > >
> > > >  	n = *num;
> > > >  	if (SQN_ATOMIC(sa))
> > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > >sqn.outb.atom, n);
> > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > +			__ATOMIC_RELAXED);
> > >
> > > One generic thing to note:
> > > clang for i686 in some cases will generate a proper function call for 64-bit
> > > __atomic builtins (gcc seems to always generate cmpxchng8b for such
> cases).
> > > Does anyone consider it as a potential problem?
> > > It probably not a big deal, but would like to know broader opinion.
> > I had looked at this some time back for GCC. The function call is generated
> only if the underlying platform does not support the atomic
> > instructions for the operand size. Otherwise, gcc generates the instructions
> directly.
> > I would think the behavior would be the same for clang.
> 
> From what I see not really.
> As an example:
> 
> $ cat tatm11.c
> #include <stdint.h>
> 
> struct x {
>         uint64_t v __attribute__((aligned(8)));
> };
> 
> uint64_t
> ffxadd1(struct x *x, uint32_t n, uint32_t m)
> {
>         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED);
> }
> 
> uint64_t
> ffxadd11(uint64_t *v, uint32_t n, uint32_t m)
> {
>         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED);
> }
> 
> gcc for i686 will generate code with cmpxchng8b for both cases.
> clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> aligned,
> but will emit a function call for ffxadd11().

I guess your testbed is an i386 platform.  However, what I see here is different.

Testbed i686:  Ubuntu 18.04.4 LTS/GCC 8.3/ Clang 9.0.0-2
Both Clang and GCC for i686 generate code with xadd for these two cases.

Testbed i386:  Ubuntu 16.04 LTS (Installed libatomic)/GCC 5.4.0/ Clang 4.0.0
GCC will generate code with cmpxchng8b for both cases.
Clang generated code emits a function call for both cases.

Thanks,
Phil
> 
> >
> > >
> > > >  	else {
> > > >  		sqn = sa->sqn.outb.raw + n;
> > > >  		sa->sqn.outb.raw = sqn;
> > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > d22451b..cab9a2e 100644
> > > > --- a/lib/librte_ipsec/sa.h
> > > > +++ b/lib/librte_ipsec/sa.h
> > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > >  	 */
> > > >  	union {
> > > >  		union {
> > > > -			rte_atomic64_t atom;
> > > > +			uint64_t atom;
> > > >  			uint64_t raw;
> > > >  		} outb;
> > >
> > > If we don't need rte_atomic64 here anymore, then I think we can
> collapse the
> > > union to just:
> > > uint64_t outb;
> > >
> > > >  		struct {
> > > > --
> > > > 2.7.4
  
Ananyev, Konstantin March 24, 2020, 11:03 a.m. UTC | #6
Hi Phil,

> <snip>
> > >
> > > > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> > outbound
> > > > sqn update
> > > >
> > > > Hi Phil,
> > > >
> > > > >
> > > > > For SA outbound packets, rte_atomic64_add_return is used to
> > generate
> > > > > SQN atomically. This introduced an unnecessary full barrier by calling
> > > > > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > > > > patch optimized it with c11 atomic and eliminated the expensive
> > > > > barrier for aarch64.
> > > > >
> > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > ---
> > > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > > uint32_t *num)
> > > > >
> > > > >  	n = *num;
> > > > >  	if (SQN_ATOMIC(sa))
> > > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > > >sqn.outb.atom, n);
> > > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > > +			__ATOMIC_RELAXED);
> > > >
> > > > One generic thing to note:
> > > > clang for i686 in some cases will generate a proper function call for 64-bit
> > > > __atomic builtins (gcc seems to always generate cmpxchng8b for such
> > cases).
> > > > Does anyone consider it as a potential problem?
> > > > It probably not a big deal, but would like to know broader opinion.
> > > I had looked at this some time back for GCC. The function call is generated
> > only if the underlying platform does not support the atomic
> > > instructions for the operand size. Otherwise, gcc generates the instructions
> > directly.
> > > I would think the behavior would be the same for clang.
> >
> > From what I see not really.
> > As an example:
> >
> > $ cat tatm11.c
> > #include <stdint.h>
> >
> > struct x {
> >         uint64_t v __attribute__((aligned(8)));
> > };
> >
> > uint64_t
> > ffxadd1(struct x *x, uint32_t n, uint32_t m)
> > {
> >         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED);
> > }
> >
> > uint64_t
> > ffxadd11(uint64_t *v, uint32_t n, uint32_t m)
> > {
> >         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED);
> > }
> >
> > gcc for i686 will generate code with cmpxchng8b for both cases.
> > clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> > aligned,
> > but will emit a function call for ffxadd11().
> 
> I guess your testbed is an i386 platform.  However, what I see here is different.
> 
> Testbed i686:  Ubuntu 18.04.4 LTS/GCC 8.3/ Clang 9.0.0-2
> Both Clang and GCC for i686 generate code with xadd for these two cases.

I suppose you meant x86_64 here (-m64), right?


> 
> Testbed i386:  Ubuntu 16.04 LTS (Installed libatomic)/GCC 5.4.0/ Clang 4.0.0
> GCC will generate code with cmpxchng8b for both cases.
> Clang generated code emits a function call for both cases.

That's exactly what I am talking about above.
X86_64 (64 bit binary) - no function calls for both gcc and clang
i686 (32 bit binary) - no function calls with gcc, functions calls with clang
when explicit alignment is not specified.

As I said in my initial email, that's probably not a big deal -
from what I was told so far we don't officially support clang for IA-32
and I don't know does anyone uses it at all right now.
Though if someone thinks it is a potential problem here -
it is better to flag it at early stage.
So once again my questions to the community:
1/ Does anyone builds/uses DPDK with i686-clang? 
2/ If there are anyone, can these persons try to evaluate
how big perf drop it would cause for them?  
3/ Is there an option to switch to i686-gcc (supported one)?
Konstantin

> >
> > >
> > > >
> > > > >  	else {
> > > > >  		sqn = sa->sqn.outb.raw + n;
> > > > >  		sa->sqn.outb.raw = sqn;
> > > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > > d22451b..cab9a2e 100644
> > > > > --- a/lib/librte_ipsec/sa.h
> > > > > +++ b/lib/librte_ipsec/sa.h
> > > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > > >  	 */
> > > > >  	union {
> > > > >  		union {
> > > > > -			rte_atomic64_t atom;
> > > > > +			uint64_t atom;
> > > > >  			uint64_t raw;
> > > > >  		} outb;
> > > >
> > > > If we don't need rte_atomic64 here anymore, then I think we can
> > collapse the
> > > > union to just:
> > > > uint64_t outb;
> > > >
> > > > >  		struct {
> > > > > --
> > > > > 2.7.4
  
Ananyev, Konstantin March 24, 2020, 1:10 p.m. UTC | #7
> > > > > For SA outbound packets, rte_atomic64_add_return is used to
> > > > > generate SQN atomically. This introduced an unnecessary full
> > > > > barrier by calling the '__sync' builtin implemented rte_atomic_XX
> > > > > API on aarch64. This patch optimized it with c11 atomic and
> > > > > eliminated the expensive barrier for aarch64.
> > > > >
> > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > ---
> > > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > > uint32_t *num)
> > > > >
> > > > >  	n = *num;
> > > > >  	if (SQN_ATOMIC(sa))
> > > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > > >sqn.outb.atom, n);
> > > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > > +			__ATOMIC_RELAXED);
> > > >
> > > > One generic thing to note:
> > > > clang for i686 in some cases will generate a proper function call
> > > > for 64-bit __atomic builtins (gcc seems to always generate cmpxchng8b for
> > such cases).
> > > > Does anyone consider it as a potential problem?
> > > > It probably not a big deal, but would like to know broader opinion.
> > > I had looked at this some time back for GCC. The function call is
> > > generated only if the underlying platform does not support the atomic
> > instructions for the operand size. Otherwise, gcc generates the instructions
> > directly.
> > > I would think the behavior would be the same for clang.
> >
> > From what I see not really.
> > As an example:
> >
> > $ cat tatm11.c
> > #include <stdint.h>
> >
> > struct x {
> >         uint64_t v __attribute__((aligned(8))); };
> >
> > uint64_t
> > ffxadd1(struct x *x, uint32_t n, uint32_t m) {
> >         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED); }
> >
> > uint64_t
> > ffxadd11(uint64_t *v, uint32_t n, uint32_t m) {
> >         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED); }
> >
> > gcc for i686 will generate code with cmpxchng8b for both cases.
> > clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> > aligned, but will emit a function call for ffxadd11().
> Does it require libatomic to be linked in this case? 

Yes, it does.
In fact same story even with current dpdk.org master.
To make i686-native-linuxapp-clang successfully, I have to 
explicitly add EXTRA_LDFLAGS="-latomic".
To be more specific:
$ for i in i686-native-linuxapp-clang/lib/*.a; do x=`nm $i | grep __atomic_`; if [[ -n "${x}" ]]; then echo $i; echo $x; fi; done
i686-native-linuxapp-clang/lib/librte_distributor.a
U __atomic_load_8 U __atomic_store_8
i686-native-linuxapp-clang/lib/librte_pmd_opdl_event.a
U __atomic_load_8 U __atomic_store_8
i686-native-linuxapp-clang/lib/librte_rcu.a
U __atomic_compare_exchange_8 U __atomic_load_8

As there were no complains so far, it makes me think that
probably no-one using clang for IA-32 builds.

> Clang documentation calls out unaligned case where it would generate the function call
> [1].

Seems so, and it treats uin64_t as 4B aligned for IA.
 
> On aarch64, the atomic instructions need the address to be aligned.

For that particular case (cmpxchng8b) there is no such restrictions for IA-32.
Again, as I said before, gcc manages to emit code without function calls
for exactly the same source.

> 
> [1] https://clang.llvm.org/docs/Toolchain.html#atomics-library
> 
> >
> > >
> > > >
> > > > >  	else {
> > > > >  		sqn = sa->sqn.outb.raw + n;
> > > > >  		sa->sqn.outb.raw = sqn;
> > > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > > d22451b..cab9a2e 100644
> > > > > --- a/lib/librte_ipsec/sa.h
> > > > > +++ b/lib/librte_ipsec/sa.h
> > > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > > >  	 */
> > > > >  	union {
> > > > >  		union {
> > > > > -			rte_atomic64_t atom;
> > > > > +			uint64_t atom;
> > > > >  			uint64_t raw;
> > > > >  		} outb;
> > > >
> > > > If we don't need rte_atomic64 here anymore, then I think we can
> > > > collapse the union to just:
> > > > uint64_t outb;
> > > >
> > > > >  		struct {
> > > > > --
> > > > > 2.7.4
  
Ananyev, Konstantin March 24, 2020, 1:21 p.m. UTC | #8
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Ananyev, Konstantin
> Sent: Tuesday, March 24, 2020 1:10 PM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Phil Yang <Phil.Yang@arm.com>; thomas@monjalon.net; Van Haaren,
> Harry <harry.van.haaren@intel.com>; stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Joyce Kong <Joyce.Kong@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
> 
> 
> > > > > > For SA outbound packets, rte_atomic64_add_return is used to
> > > > > > generate SQN atomically. This introduced an unnecessary full
> > > > > > barrier by calling the '__sync' builtin implemented rte_atomic_XX
> > > > > > API on aarch64. This patch optimized it with c11 atomic and
> > > > > > eliminated the expensive barrier for aarch64.
> > > > > >
> > > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > ---
> > > > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > > > uint32_t *num)
> > > > > >
> > > > > >  	n = *num;
> > > > > >  	if (SQN_ATOMIC(sa))
> > > > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > > > >sqn.outb.atom, n);
> > > > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > > > +			__ATOMIC_RELAXED);
> > > > >
> > > > > One generic thing to note:
> > > > > clang for i686 in some cases will generate a proper function call
> > > > > for 64-bit __atomic builtins (gcc seems to always generate cmpxchng8b for
> > > such cases).
> > > > > Does anyone consider it as a potential problem?
> > > > > It probably not a big deal, but would like to know broader opinion.
> > > > I had looked at this some time back for GCC. The function call is
> > > > generated only if the underlying platform does not support the atomic
> > > instructions for the operand size. Otherwise, gcc generates the instructions
> > > directly.
> > > > I would think the behavior would be the same for clang.
> > >
> > > From what I see not really.
> > > As an example:
> > >
> > > $ cat tatm11.c
> > > #include <stdint.h>
> > >
> > > struct x {
> > >         uint64_t v __attribute__((aligned(8))); };
> > >
> > > uint64_t
> > > ffxadd1(struct x *x, uint32_t n, uint32_t m) {
> > >         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED); }
> > >
> > > uint64_t
> > > ffxadd11(uint64_t *v, uint32_t n, uint32_t m) {
> > >         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED); }
> > >
> > > gcc for i686 will generate code with cmpxchng8b for both cases.
> > > clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> > > aligned, but will emit a function call for ffxadd11().
> > Does it require libatomic to be linked in this case?
> 
> Yes, it does.
> In fact same story even with current dpdk.org master.
> To make i686-native-linuxapp-clang successfully, I have to
> explicitly add EXTRA_LDFLAGS="-latomic".
> To be more specific:
> $ for i in i686-native-linuxapp-clang/lib/*.a; do x=`nm $i | grep __atomic_`; if [[ -n "${x}" ]]; then echo $i; echo $x; fi; done
> i686-native-linuxapp-clang/lib/librte_distributor.a
> U __atomic_load_8 U __atomic_store_8
> i686-native-linuxapp-clang/lib/librte_pmd_opdl_event.a
> U __atomic_load_8 U __atomic_store_8
> i686-native-linuxapp-clang/lib/librte_rcu.a
> U __atomic_compare_exchange_8 U __atomic_load_8
> 
> As there were no complains so far, it makes me think that
> probably no-one using clang for IA-32 builds.
> 
> > Clang documentation calls out unaligned case where it would generate the function call
> > [1].
> 
> Seems so, and it treats uin64_t as 4B aligned for IA.
correction: for IA-32

> 
> > On aarch64, the atomic instructions need the address to be aligned.
> 
> For that particular case (cmpxchng8b) there is no such restrictions for IA-32.
> Again, as I said before, gcc manages to emit code without function calls
> for exactly the same source.
> 
> >
> > [1] https://clang.llvm.org/docs/Toolchain.html#atomics-library
> >
> > >
> > > >
> > > > >
> > > > > >  	else {
> > > > > >  		sqn = sa->sqn.outb.raw + n;
> > > > > >  		sa->sqn.outb.raw = sqn;
> > > > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > > > d22451b..cab9a2e 100644
> > > > > > --- a/lib/librte_ipsec/sa.h
> > > > > > +++ b/lib/librte_ipsec/sa.h
> > > > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > > > >  	 */
> > > > > >  	union {
> > > > > >  		union {
> > > > > > -			rte_atomic64_t atom;
> > > > > > +			uint64_t atom;
> > > > > >  			uint64_t raw;
> > > > > >  		} outb;
> > > > >
> > > > > If we don't need rte_atomic64 here anymore, then I think we can
> > > > > collapse the union to just:
> > > > > uint64_t outb;
> > > > >
> > > > > >  		struct {
> > > > > > --
> > > > > > 2.7.4
  
Phil Yang March 25, 2020, 9:38 a.m. UTC | #9
> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Sent: Tuesday, March 24, 2020 7:04 PM
> To: Phil Yang <Phil.Yang@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; thomas@monjalon.net; Van Haaren,
> Harry <harry.van.haaren@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Joyce Kong <Joyce.Kong@arm.com>; nd
> <nd@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> outbound sqn update
> 
> 
> Hi Phil,
> 
> > <snip>
> > > >
> > > > > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> > > outbound
> > > > > sqn update
> > > > >
> > > > > Hi Phil,
> > > > >
> > > > > >
> > > > > > For SA outbound packets, rte_atomic64_add_return is used to
> > > generate
> > > > > > SQN atomically. This introduced an unnecessary full barrier by calling
> > > > > > the '__sync' builtin implemented rte_atomic_XX API on aarch64.
> This
> > > > > > patch optimized it with c11 atomic and eliminated the expensive
> > > > > > barrier for aarch64.
> > > > > >
> > > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > ---
> > > > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa
> *sa,
> > > > > > uint32_t *num)
> > > > > >
> > > > > >  	n = *num;
> > > > > >  	if (SQN_ATOMIC(sa))
> > > > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > > > >sqn.outb.atom, n);
> > > > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > > > +			__ATOMIC_RELAXED);
> > > > >
> > > > > One generic thing to note:
> > > > > clang for i686 in some cases will generate a proper function call for 64-
> bit
> > > > > __atomic builtins (gcc seems to always generate cmpxchng8b for such
> > > cases).
> > > > > Does anyone consider it as a potential problem?
> > > > > It probably not a big deal, but would like to know broader opinion.
> > > > I had looked at this some time back for GCC. The function call is
> generated
> > > only if the underlying platform does not support the atomic
> > > > instructions for the operand size. Otherwise, gcc generates the
> instructions
> > > directly.
> > > > I would think the behavior would be the same for clang.
> > >
> > > From what I see not really.
> > > As an example:
> > >
> > > $ cat tatm11.c
> > > #include <stdint.h>
> > >
> > > struct x {
> > >         uint64_t v __attribute__((aligned(8)));
> > > };
> > >
> > > uint64_t
> > > ffxadd1(struct x *x, uint32_t n, uint32_t m)
> > > {
> > >         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED);
> > > }
> > >
> > > uint64_t
> > > ffxadd11(uint64_t *v, uint32_t n, uint32_t m)
> > > {
> > >         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED);
> > > }
> > >
> > > gcc for i686 will generate code with cmpxchng8b for both cases.
> > > clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> > > aligned,
> > > but will emit a function call for ffxadd11().
> >
> > I guess your testbed is an i386 platform.  However, what I see here is
> different.
> >
> > Testbed i686:  Ubuntu 18.04.4 LTS/GCC 8.3/ Clang 9.0.0-2
> > Both Clang and GCC for i686 generate code with xadd for these two cases.
> 
> I suppose you meant x86_64 here (-m64), right?

Yes. It is x86_64 here.

> 
> 
> >
> > Testbed i386:  Ubuntu 16.04 LTS (Installed libatomic)/GCC 5.4.0/ Clang 4.0.0
> > GCC will generate code with cmpxchng8b for both cases.
> > Clang generated code emits a function call for both cases.
> 
> That's exactly what I am talking about above.
> X86_64 (64 bit binary) - no function calls for both gcc and clang
> i686 (32 bit binary) - no function calls with gcc, functions calls with clang
> when explicit alignment is not specified.
> 
> As I said in my initial email, that's probably not a big deal -
> from what I was told so far we don't officially support clang for IA-32
> and I don't know does anyone uses it at all right now.
> Though if someone thinks it is a potential problem here -
> it is better to flag it at early stage.
> So once again my questions to the community:
> 1/ Does anyone builds/uses DPDK with i686-clang?
> 2/ If there are anyone, can these persons try to evaluate
> how big perf drop it would cause for them?
> 3/ Is there an option to switch to i686-gcc (supported one)?
> Konstantin
> 
> > >
> > > >
> > > > >
> > > > > >  	else {
> > > > > >  		sqn = sa->sqn.outb.raw + n;
> > > > > >  		sa->sqn.outb.raw = sqn;
> > > > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > > > d22451b..cab9a2e 100644
> > > > > > --- a/lib/librte_ipsec/sa.h
> > > > > > +++ b/lib/librte_ipsec/sa.h
> > > > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > > > >  	 */
> > > > > >  	union {
> > > > > >  		union {
> > > > > > -			rte_atomic64_t atom;
> > > > > > +			uint64_t atom;
> > > > > >  			uint64_t raw;
> > > > > >  		} outb;
> > > > >
> > > > > If we don't need rte_atomic64 here anymore, then I think we can
> > > collapse the
> > > > > union to just:
> > > > > uint64_t outb;
> > > > >
> > > > > >  		struct {
> > > > > > --
> > > > > > 2.7.4
  

Patch

diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
index 0c2f76a..e884af7 100644
--- a/lib/librte_ipsec/ipsec_sqn.h
+++ b/lib/librte_ipsec/ipsec_sqn.h
@@ -128,7 +128,8 @@  esn_outb_update_sqn(struct rte_ipsec_sa *sa, uint32_t *num)
 
 	n = *num;
 	if (SQN_ATOMIC(sa))
-		sqn = (uint64_t)rte_atomic64_add_return(&sa->sqn.outb.atom, n);
+		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
+			__ATOMIC_RELAXED);
 	else {
 		sqn = sa->sqn.outb.raw + n;
 		sa->sqn.outb.raw = sqn;
diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
index d22451b..cab9a2e 100644
--- a/lib/librte_ipsec/sa.h
+++ b/lib/librte_ipsec/sa.h
@@ -120,7 +120,7 @@  struct rte_ipsec_sa {
 	 */
 	union {
 		union {
-			rte_atomic64_t atom;
+			uint64_t atom;
 			uint64_t raw;
 		} outb;
 		struct {