mbox series

[v3,00/12] generic rte atomic APIs deprecate proposal

Message ID 1584407863-774-1-git-send-email-phil.yang@arm.com (mailing list archive)
Headers
Series generic rte atomic APIs deprecate proposal |

Message

Phil Yang March 17, 2020, 1:17 a.m. UTC
  DPDK provides generic rte_atomic APIs to do several atomic operations.
These APIs are using the deprecated __sync built-ins and enforce full
memory barriers on aarch64. However, full barriers are not necessary
in many use cases. In order to address such use cases, C language offers
C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
by making use of the memory ordering parameter provided by the user.
Various patches submitted in the past [2] and the patches in this series
indicate significant performance gains on multiple aarch64 CPUs and no
performance loss on x86.

But the existing rte_atomic API implementations cannot be changed as the
APIs do not take the memory ordering parameter. The only choice available
is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
order to make this change, the following steps are proposed:

[1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
APIs (a script is added to flag the usages).
[2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.

This patchset contains:
1) the checkpatch script changes to flag rte_atomic API usage in patches.
2) changes to programmer guide describing writing efficient code for aarch64.
3) changes to various libraries to make use of c11 atomic APIs.

We are planning to replicate this idea across all the other libraries,
drivers, examples, test applications. In the next phase, we will add
changes to the mbuf, the EAL interrupts and the event timer adapter libraries.

v3:
add libatomic dependency for 32-bit clang

v2:
1. fix Clang '-Wincompatible-pointer-types' WARNING.
2. fix typos.

Honnappa Nagarahalli (2):
  service: avoid race condition for MT unsafe service
  service: identify service running on another core correctly

Phil Yang (10):
  doc: add generic atomic deprecation section
  devtools: prevent use of rte atomic APIs in future patches
  eal/build: add libatomic dependency for 32-bit clang
  build: remove redundant code
  vhost: optimize broadcast rarp sync with c11 atomic
  ipsec: optimize with c11 atomic for sa outbound sqn update
  service: remove rte prefix from static functions
  service: remove redundant code
  service: optimize with c11 one-way barrier
  service: relax barriers with C11 atomic operations

 devtools/checkpatches.sh                         |   9 ++
 doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
 drivers/event/octeontx/meson.build               |   5 -
 drivers/event/octeontx2/meson.build              |   5 -
 drivers/event/opdl/meson.build                   |   5 -
 lib/librte_eal/common/rte_service.c              | 175 ++++++++++++-----------
 lib/librte_eal/meson.build                       |   6 +
 lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
 lib/librte_ipsec/sa.h                            |   2 +-
 lib/librte_rcu/meson.build                       |   5 -
 lib/librte_vhost/vhost.h                         |   2 +-
 lib/librte_vhost/vhost_user.c                    |   7 +-
 lib/librte_vhost/virtio_net.c                    |  16 ++-
 13 files changed, 181 insertions(+), 119 deletions(-)
  

Comments

Van Haaren, Harry March 18, 2020, 2:01 p.m. UTC | #1
Hi Phil & Honnappa,

> -----Original Message-----
> From: Phil Yang <phil.yang@arm.com>
> Sent: Tuesday, March 17, 2020 1:18 AM
> To: thomas@monjalon.net; Van Haaren, Harry <harry.van.haaren@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; ruifeng.wang@arm.com;
> joyce.kong@arm.com; nd@arm.com
> Subject: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> 
> DPDK provides generic rte_atomic APIs to do several atomic operations.
> These APIs are using the deprecated __sync built-ins and enforce full
> memory barriers on aarch64. However, full barriers are not necessary
> in many use cases. In order to address such use cases, C language offers
> C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
> by making use of the memory ordering parameter provided by the user.
> Various patches submitted in the past [2] and the patches in this series
> indicate significant performance gains on multiple aarch64 CPUs and no
> performance loss on x86.
> 
> But the existing rte_atomic API implementations cannot be changed as the
> APIs do not take the memory ordering parameter. The only choice available
> is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
> order to make this change, the following steps are proposed:
> 
> [1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
> APIs (a script is added to flag the usages).
> [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.

On [1] above, I feel deprecating DPDKs atomic functions and failing checkpatch is
a bit sudden. Perhaps noting that in a future release (20.11?) DPDK will move to a
C11 based atomics model is a more gradual step to achieving the goal, and at that
point add a checkpatch warning for additions of rte_atomic*?

More on [2] in context below.

The above is my point-of-view, of course I'd like more people from the DPDK community
to provide their input too.


> This patchset contains:
> 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> 2) changes to programmer guide describing writing efficient code for aarch64.
> 3) changes to various libraries to make use of c11 atomic APIs.
> 
> We are planning to replicate this idea across all the other libraries,
> drivers, examples, test applications. In the next phase, we will add
> changes to the mbuf, the EAL interrupts and the event timer adapter libraries.

About ~6/12 patches of this C11 set are targeting the Service Cores area of DPDK. I have some concerns
over increased complexity of C11 implementation vs the (already complex) rte_atomic implementation today.
I see other patchsets enabling C11 across other DPDK components, so maybe we should also discuss C11
enabling in a wider context that just service cores?

I don't think it fair to expect all developers to be well versed in C11 atomic semantics like
understanding the complex interactions between the various C11 RELEASE, AQUIRE barriers requires.

As maintainer of Service Cores I'm hesitant to accept the large-scale refactor of atomic-implementation,
as it could lead to racey bugs that are likely extremely difficult to track down. (The recent race-on-exit
has proven the difficulty in reproducing, and that's with an atomics model I'm quite familiar with).

Let me be very clear: I don't wish to block a C11 atomic implementation, but I'd like to discuss how we
(DPDK community) can best port-to and maintain a complex multi-threaded service library with best-in-class
performance for the workload.

To put some discussions/solutions on the table:
- Shared Maintainership of a component?
     Split in functionality and C11 atomics implementation
     Obviously there would be collaboration required in such a case.
- Maybe shared maintainership is too much?
     A gentlemans/womans agreement of "helping out" with C11 atomics debug is enough?


Hope my concerns are understandable, and of course input/feedback welcomed! -Harry


PS: Apologies for the delay in reply - was OOO on Irish national holiday.


> v3:
> add libatomic dependency for 32-bit clang
> 
> v2:
> 1. fix Clang '-Wincompatible-pointer-types' WARNING.
> 2. fix typos.
> 
> Honnappa Nagarahalli (2):
>   service: avoid race condition for MT unsafe service
>   service: identify service running on another core correctly
> 
> Phil Yang (10):
>   doc: add generic atomic deprecation section
>   devtools: prevent use of rte atomic APIs in future patches
>   eal/build: add libatomic dependency for 32-bit clang
>   build: remove redundant code
>   vhost: optimize broadcast rarp sync with c11 atomic
>   ipsec: optimize with c11 atomic for sa outbound sqn update
>   service: remove rte prefix from static functions
>   service: remove redundant code
>   service: optimize with c11 one-way barrier
>   service: relax barriers with C11 atomic operations
> 
>  devtools/checkpatches.sh                         |   9 ++
>  doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
>  drivers/event/octeontx/meson.build               |   5 -
>  drivers/event/octeontx2/meson.build              |   5 -
>  drivers/event/opdl/meson.build                   |   5 -
>  lib/librte_eal/common/rte_service.c              | 175 ++++++++++++----------
> -
>  lib/librte_eal/meson.build                       |   6 +
>  lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
>  lib/librte_ipsec/sa.h                            |   2 +-
>  lib/librte_rcu/meson.build                       |   5 -
>  lib/librte_vhost/vhost.h                         |   2 +-
>  lib/librte_vhost/vhost_user.c                    |   7 +-
>  lib/librte_vhost/virtio_net.c                    |  16 ++-
>  13 files changed, 181 insertions(+), 119 deletions(-)
> 
> --
> 2.7.4
  
Thomas Monjalon March 18, 2020, 3:13 p.m. UTC | #2
18/03/2020 15:01, Van Haaren, Harry:
> Hi Phil & Honnappa,
> 
> From: Phil Yang <phil.yang@arm.com>
> > 
> > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > These APIs are using the deprecated __sync built-ins and enforce full
> > memory barriers on aarch64. However, full barriers are not necessary
> > in many use cases. In order to address such use cases, C language offers
> > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
> > by making use of the memory ordering parameter provided by the user.
> > Various patches submitted in the past [2] and the patches in this series
> > indicate significant performance gains on multiple aarch64 CPUs and no
> > performance loss on x86.
> > 
> > But the existing rte_atomic API implementations cannot be changed as the
> > APIs do not take the memory ordering parameter. The only choice available
> > is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
> > order to make this change, the following steps are proposed:
> > 
> > [1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
> > APIs (a script is added to flag the usages).
> > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> 
> On [1] above, I feel deprecating DPDKs atomic functions and failing checkpatch is
> a bit sudden. Perhaps noting that in a future release (20.11?) DPDK will move to a
> C11 based atomics model is a more gradual step to achieving the goal, and at that
> point add a checkpatch warning for additions of rte_atomic*?
> 
> More on [2] in context below.
> 
> The above is my point-of-view, of course I'd like more people from the DPDK community
> to provide their input too.
> 
> 
> > This patchset contains:
> > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > 2) changes to programmer guide describing writing efficient code for aarch64.
> > 3) changes to various libraries to make use of c11 atomic APIs.
> > 
> > We are planning to replicate this idea across all the other libraries,
> > drivers, examples, test applications. In the next phase, we will add
> > changes to the mbuf, the EAL interrupts and the event timer adapter libraries.
> 
> About ~6/12 patches of this C11 set are targeting the Service Cores area of DPDK. I have some concerns
> over increased complexity of C11 implementation vs the (already complex) rte_atomic implementation today.
> I see other patchsets enabling C11 across other DPDK components, so maybe we should also discuss C11
> enabling in a wider context that just service cores?
> 
> I don't think it fair to expect all developers to be well versed in C11 atomic semantics like
> understanding the complex interactions between the various C11 RELEASE, AQUIRE barriers requires.
> 
> As maintainer of Service Cores I'm hesitant to accept the large-scale refactor of atomic-implementation,
> as it could lead to racey bugs that are likely extremely difficult to track down. (The recent race-on-exit
> has proven the difficulty in reproducing, and that's with an atomics model I'm quite familiar with).
> 
> Let me be very clear: I don't wish to block a C11 atomic implementation, but I'd like to discuss how we
> (DPDK community) can best port-to and maintain a complex multi-threaded service library with best-in-class
> performance for the workload.
> 
> To put some discussions/solutions on the table:
> - Shared Maintainership of a component?
>      Split in functionality and C11 atomics implementation
>      Obviously there would be collaboration required in such a case.
> - Maybe shared maintainership is too much?
>      A gentlemans/womans agreement of "helping out" with C11 atomics debug is enough?
> 
> 
> Hope my concerns are understandable, and of course input/feedback welcomed! -Harry

Thanks for raising the issue Harry.

I think we should have at least two official maintainers for C11 atomics in general.
C11 conversion is a progressive effort being done, and should be merged step by step.
If C11 maintainers fail to fix some issues on time, then we can hold the effort.
Does it make sense?
  
Honnappa Nagarahalli March 20, 2020, 4:51 a.m. UTC | #3
<snip>

> > Subject: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> >
> > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > These APIs are using the deprecated __sync built-ins and enforce full
> > memory barriers on aarch64. However, full barriers are not necessary
> > in many use cases. In order to address such use cases, C language
> > offers
> > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > control by making use of the memory ordering parameter provided by the
> user.
> > Various patches submitted in the past [2] and the patches in this
> > series indicate significant performance gains on multiple aarch64 CPUs
> > and no performance loss on x86.
> >
> > But the existing rte_atomic API implementations cannot be changed as
> > the APIs do not take the memory ordering parameter. The only choice
> > available is replacing the usage of the rte_atomic APIs with C11
> > atomic APIs. In order to make this change, the following steps are proposed:
> >
> > [1] deprecate rte_atomic APIs so that future patches do not use
> > rte_atomic APIs (a script is added to flag the usages).
> > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> 
> On [1] above, I feel deprecating DPDKs atomic functions and failing checkpatch
> is a bit sudden. Perhaps noting that in a future release (20.11?) DPDK will
> move to a
> C11 based atomics model is a more gradual step to achieving the goal, and at
> that point add a checkpatch warning for additions of rte_atomic*?
We have been working on changing existing usages of rte_atomic APIs in DPDK to use C11 atomics. Usually, the x.11 releases have significant amount of changes (not sure how many would use rte_atomic APIs). I would prefer that in 20.11 no additional code is added using rte_atomics APIs. However, I am open to suggestions on the exact time frame.
Once we decide on the release, I think it makes sense to add a 'warning' in the checkpatch to indicate the deprecation timeline and add an 'error' after the release.

> 
> More on [2] in context below.
> 
> The above is my point-of-view, of course I'd like more people from the DPDK
> community to provide their input too.
> 
> 
> > This patchset contains:
> > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > 2) changes to programmer guide describing writing efficient code for
> aarch64.
> > 3) changes to various libraries to make use of c11 atomic APIs.
> >
> > We are planning to replicate this idea across all the other libraries,
> > drivers, examples, test applications. In the next phase, we will add
> > changes to the mbuf, the EAL interrupts and the event timer adapter
> libraries.
> 
> About ~6/12 patches of this C11 set are targeting the Service Cores area of
> DPDK. I have some concerns over increased complexity of C11 implementation
> vs the (already complex) rte_atomic implementation today.
I agree that it C11 changes are complex, especially if one is starting out to understand what these APIs provide. From my experience, once few underlying concepts are understood, reviewing or making changes do not take too much time.

> I see other patchsets enabling C11 across other DPDK components, so maybe
> we should also discuss C11 enabling in a wider context that just service cores?
Yes, agree. We are in the process of making changes to other areas as well.

> 
> I don't think it fair to expect all developers to be well versed in C11 atomic
> semantics like understanding the complex interactions between the various
> C11 RELEASE, AQUIRE barriers requires.
C11 has been around from sometime now. To my surprise, OVS already uses C11 APIs extensively. VPP has been accepting C11 related changes from past couple of years. Having said that, I agree in general that not everyone is well versed.

> 
> As maintainer of Service Cores I'm hesitant to accept the large-scale refactor
Right now, the patches are split into multiple commits. If required I can host a call to go over simple C11 API usages (sufficient to cover the usage in service core) and the changes in this patch. If you find that particular areas need more understanding I can work on providing additional information such as memory order ladder diagrams. Please let me know what you think.

> of atomic-implementation, as it could lead to racey bugs that are likely
> extremely difficult to track down. (The recent race-on-exit has proven the
> difficulty in reproducing, and that's with an atomics model I'm quite familiar
> with).
> 
> Let me be very clear: I don't wish to block a C11 atomic implementation, but
> I'd like to discuss how we (DPDK community) can best port-to and maintain a
> complex multi-threaded service library with best-in-class performance for the
> workload.
> 
> To put some discussions/solutions on the table:
> - Shared Maintainership of a component?
>      Split in functionality and C11 atomics implementation
>      Obviously there would be collaboration required in such a case.
> - Maybe shared maintainership is too much?
>      A gentlemans/womans agreement of "helping out" with C11 atomics debug
> is enough?
I think shared maintainer ship could be too much as there are many changes. But, I and other engineers from Arm (I would include Arm ecosystem as well) can definitely help out on debug and reviews involving C11 APIs. (I see Thomas's reply on this topic).

> 
> 
> Hope my concerns are understandable, and of course input/feedback
> welcomed! -Harry
No worries at all. We are here to help and make this as easy as possible.

> 
> 
> PS: Apologies for the delay in reply - was OOO on Irish national holiday.
Same here, was on vacation for 3 days.

> 
> 
> > v3:
> > add libatomic dependency for 32-bit clang
> >
> > v2:
> > 1. fix Clang '-Wincompatible-pointer-types' WARNING.
> > 2. fix typos.
> >
> > Honnappa Nagarahalli (2):
> >   service: avoid race condition for MT unsafe service
> >   service: identify service running on another core correctly
> >
> > Phil Yang (10):
> >   doc: add generic atomic deprecation section
> >   devtools: prevent use of rte atomic APIs in future patches
> >   eal/build: add libatomic dependency for 32-bit clang
> >   build: remove redundant code
> >   vhost: optimize broadcast rarp sync with c11 atomic
> >   ipsec: optimize with c11 atomic for sa outbound sqn update
> >   service: remove rte prefix from static functions
> >   service: remove redundant code
> >   service: optimize with c11 one-way barrier
> >   service: relax barriers with C11 atomic operations
> >
> >  devtools/checkpatches.sh                         |   9 ++
> >  doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
> >  drivers/event/octeontx/meson.build               |   5 -
> >  drivers/event/octeontx2/meson.build              |   5 -
> >  drivers/event/opdl/meson.build                   |   5 -
> >  lib/librte_eal/common/rte_service.c              | 175 ++++++++++++----------
> > -
> >  lib/librte_eal/meson.build                       |   6 +
> >  lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
> >  lib/librte_ipsec/sa.h                            |   2 +-
> >  lib/librte_rcu/meson.build                       |   5 -
> >  lib/librte_vhost/vhost.h                         |   2 +-
> >  lib/librte_vhost/vhost_user.c                    |   7 +-
> >  lib/librte_vhost/virtio_net.c                    |  16 ++-
> >  13 files changed, 181 insertions(+), 119 deletions(-)
> >
> > --
> > 2.7.4
  
Honnappa Nagarahalli March 20, 2020, 5:01 a.m. UTC | #4
<snip>

> Subject: Re: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> 
> 18/03/2020 15:01, Van Haaren, Harry:
> > Hi Phil & Honnappa,
> >
> > From: Phil Yang <phil.yang@arm.com>
> > >
> > > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > > These APIs are using the deprecated __sync built-ins and enforce
> > > full memory barriers on aarch64. However, full barriers are not
> > > necessary in many use cases. In order to address such use cases, C
> > > language offers
> > > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > > control by making use of the memory ordering parameter provided by the
> user.
> > > Various patches submitted in the past [2] and the patches in this
> > > series indicate significant performance gains on multiple aarch64
> > > CPUs and no performance loss on x86.
> > >
> > > But the existing rte_atomic API implementations cannot be changed as
> > > the APIs do not take the memory ordering parameter. The only choice
> > > available is replacing the usage of the rte_atomic APIs with C11
> > > atomic APIs. In order to make this change, the following steps are
> proposed:
> > >
> > > [1] deprecate rte_atomic APIs so that future patches do not use
> > > rte_atomic APIs (a script is added to flag the usages).
> > > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> >
> > On [1] above, I feel deprecating DPDKs atomic functions and failing
> > checkpatch is a bit sudden. Perhaps noting that in a future release
> > (20.11?) DPDK will move to a
> > C11 based atomics model is a more gradual step to achieving the goal,
> > and at that point add a checkpatch warning for additions of rte_atomic*?
> >
> > More on [2] in context below.
> >
> > The above is my point-of-view, of course I'd like more people from the
> > DPDK community to provide their input too.
> >
> >
> > > This patchset contains:
> > > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > > 2) changes to programmer guide describing writing efficient code for
> aarch64.
> > > 3) changes to various libraries to make use of c11 atomic APIs.
> > >
> > > We are planning to replicate this idea across all the other
> > > libraries, drivers, examples, test applications. In the next phase,
> > > we will add changes to the mbuf, the EAL interrupts and the event timer
> adapter libraries.
> >
> > About ~6/12 patches of this C11 set are targeting the Service Cores
> > area of DPDK. I have some concerns over increased complexity of C11
> implementation vs the (already complex) rte_atomic implementation today.
> > I see other patchsets enabling C11 across other DPDK components, so
> > maybe we should also discuss C11 enabling in a wider context that just
> service cores?
> >
> > I don't think it fair to expect all developers to be well versed in
> > C11 atomic semantics like understanding the complex interactions between
> the various C11 RELEASE, AQUIRE barriers requires.
> >
> > As maintainer of Service Cores I'm hesitant to accept the large-scale
> > refactor of atomic-implementation, as it could lead to racey bugs that
> > are likely extremely difficult to track down. (The recent race-on-exit has
> proven the difficulty in reproducing, and that's with an atomics model I'm
> quite familiar with).
> >
> > Let me be very clear: I don't wish to block a C11 atomic
> > implementation, but I'd like to discuss how we (DPDK community) can
> > best port-to and maintain a complex multi-threaded service library with
> best-in-class performance for the workload.
> >
> > To put some discussions/solutions on the table:
> > - Shared Maintainership of a component?
> >      Split in functionality and C11 atomics implementation
> >      Obviously there would be collaboration required in such a case.
> > - Maybe shared maintainership is too much?
> >      A gentlemans/womans agreement of "helping out" with C11 atomics
> debug is enough?
> >
> >
> > Hope my concerns are understandable, and of course input/feedback
> > welcomed! -Harry
> 
> Thanks for raising the issue Harry.
> 
> I think we should have at least two official maintainers for C11 atomics in
> general.
Sure, I can volunteer.

> C11 conversion is a progressive effort being done, and should be merged step
> by step.
Agree, the changes need to be understood, it is not a search and replace effort. The changes will come-in in stages unless others join the effort.
The concern I have is about the new patches that get added. I think we need to stop the new patches from using rte_atomic APIs, otherwise we might be making these changes forever.

> If C11 maintainers fail to fix some issues on time, then we can hold the effort.
> Does it make sense?
I am fine with this approach. But, I think we need to have a deadline in mind to complete the work.

>
  
Jerin Jacob March 20, 2020, 12:20 p.m. UTC | #5
On Fri, Mar 20, 2020 at 10:31 AM Honnappa Nagarahalli
<Honnappa.Nagarahalli@arm.com> wrote:
>
> <snip>
>
> > Subject: Re: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> >
> > 18/03/2020 15:01, Van Haaren, Harry:
> > > Hi Phil & Honnappa,
> > >
> > > From: Phil Yang <phil.yang@arm.com>
> > > >
> > > > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > > > These APIs are using the deprecated __sync built-ins and enforce
> > > > full memory barriers on aarch64. However, full barriers are not
> > > > necessary in many use cases. In order to address such use cases, C
> > > > language offers
> > > > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > > > control by making use of the memory ordering parameter provided by the
> > user.
> > > > Various patches submitted in the past [2] and the patches in this
> > > > series indicate significant performance gains on multiple aarch64
> > > > CPUs and no performance loss on x86.
> > > >
> > > > But the existing rte_atomic API implementations cannot be changed as
> > > > the APIs do not take the memory ordering parameter. The only choice
> > > > available is replacing the usage of the rte_atomic APIs with C11
> > > > atomic APIs. In order to make this change, the following steps are
> > proposed:
> > > >
> > > > [1] deprecate rte_atomic APIs so that future patches do not use
> > > > rte_atomic APIs (a script is added to flag the usages).
> > > > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> > >
> > > On [1] above, I feel deprecating DPDKs atomic functions and failing
> > > checkpatch is a bit sudden. Perhaps noting that in a future release
> > > (20.11?) DPDK will move to a
> > > C11 based atomics model is a more gradual step to achieving the goal,
> > > and at that point add a checkpatch warning for additions of rte_atomic*?
> > >
> > > More on [2] in context below.
> > >
> > > The above is my point-of-view, of course I'd like more people from the
> > > DPDK community to provide their input too.
> > >
> > >
> > > > This patchset contains:
> > > > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > > > 2) changes to programmer guide describing writing efficient code for
> > aarch64.
> > > > 3) changes to various libraries to make use of c11 atomic APIs.
> > > >
> > > > We are planning to replicate this idea across all the other
> > > > libraries, drivers, examples, test applications. In the next phase,
> > > > we will add changes to the mbuf, the EAL interrupts and the event timer
> > adapter libraries.
> > >
> > > About ~6/12 patches of this C11 set are targeting the Service Cores
> > > area of DPDK. I have some concerns over increased complexity of C11
> > implementation vs the (already complex) rte_atomic implementation today.
> > > I see other patchsets enabling C11 across other DPDK components, so
> > > maybe we should also discuss C11 enabling in a wider context that just
> > service cores?
> > >
> > > I don't think it fair to expect all developers to be well versed in
> > > C11 atomic semantics like understanding the complex interactions between
> > the various C11 RELEASE, AQUIRE barriers requires.
> > >
> > > As maintainer of Service Cores I'm hesitant to accept the large-scale
> > > refactor of atomic-implementation, as it could lead to racey bugs that
> > > are likely extremely difficult to track down. (The recent race-on-exit has
> > proven the difficulty in reproducing, and that's with an atomics model I'm
> > quite familiar with).
> > >
> > > Let me be very clear: I don't wish to block a C11 atomic
> > > implementation, but I'd like to discuss how we (DPDK community) can
> > > best port-to and maintain a complex multi-threaded service library with
> > best-in-class performance for the workload.
> > >
> > > To put some discussions/solutions on the table:
> > > - Shared Maintainership of a component?
> > >      Split in functionality and C11 atomics implementation
> > >      Obviously there would be collaboration required in such a case.
> > > - Maybe shared maintainership is too much?
> > >      A gentlemans/womans agreement of "helping out" with C11 atomics
> > debug is enough?
> > >
> > >
> > > Hope my concerns are understandable, and of course input/feedback
> > > welcomed! -Harry
> >
> > Thanks for raising the issue Harry.
> >
> > I think we should have at least two official maintainers for C11 atomics in
> > general.
> Sure, I can volunteer.
>
> > C11 conversion is a progressive effort being done, and should be merged step
> > by step.
> Agree, the changes need to be understood, it is not a search and replace effort. The changes will come-in in stages unless others join the effort.
> The concern I have is about the new patches that get added. I think we need to stop the new patches from using rte_atomic APIs, otherwise we might be making these changes forever.

+1. We must define a time frame otherwise we might be making these
changes forever.

>
> > If C11 maintainers fail to fix some issues on time, then we can hold the effort.
> > Does it make sense?
> I am fine with this approach. But, I think we need to have a deadline in mind to complete the work.
>
> >
>
  
Honnappa Nagarahalli March 20, 2020, 6:32 p.m. UTC | #6
+ Erik as there are similar changes to timer library

<snip>

> 
> > > Subject: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> > >
> > > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > > These APIs are using the deprecated __sync built-ins and enforce
> > > full memory barriers on aarch64. However, full barriers are not
> > > necessary in many use cases. In order to address such use cases, C
> > > language offers
> > > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > > control by making use of the memory ordering parameter provided by
> > > the
> > user.
> > > Various patches submitted in the past [2] and the patches in this
> > > series indicate significant performance gains on multiple aarch64
> > > CPUs and no performance loss on x86.
> > >
> > > But the existing rte_atomic API implementations cannot be changed as
> > > the APIs do not take the memory ordering parameter. The only choice
> > > available is replacing the usage of the rte_atomic APIs with C11
> > > atomic APIs. In order to make this change, the following steps are
> proposed:
> > >
> > > [1] deprecate rte_atomic APIs so that future patches do not use
> > > rte_atomic APIs (a script is added to flag the usages).
> > > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> >
> > On [1] above, I feel deprecating DPDKs atomic functions and failing
> > checkpatch is a bit sudden. Perhaps noting that in a future release
> > (20.11?) DPDK will move to a
> > C11 based atomics model is a more gradual step to achieving the goal,
> > and at that point add a checkpatch warning for additions of rte_atomic*?
> We have been working on changing existing usages of rte_atomic APIs in DPDK
> to use C11 atomics. Usually, the x.11 releases have significant amount of
> changes (not sure how many would use rte_atomic APIs). I would prefer that
> in 20.11 no additional code is added using rte_atomics APIs. However, I am
> open to suggestions on the exact time frame.
> Once we decide on the release, I think it makes sense to add a 'warning' in the
> checkpatch to indicate the deprecation timeline and add an 'error' after the
> release.
> 
> >
> > More on [2] in context below.
> >
> > The above is my point-of-view, of course I'd like more people from the
> > DPDK community to provide their input too.
> >
> >
> > > This patchset contains:
> > > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > > 2) changes to programmer guide describing writing efficient code for
> > aarch64.
> > > 3) changes to various libraries to make use of c11 atomic APIs.
> > >
> > > We are planning to replicate this idea across all the other
> > > libraries, drivers, examples, test applications. In the next phase,
> > > we will add changes to the mbuf, the EAL interrupts and the event
> > > timer adapter
> > libraries.
> >
> > About ~6/12 patches of this C11 set are targeting the Service Cores
> > area of DPDK. I have some concerns over increased complexity of C11
> > implementation vs the (already complex) rte_atomic implementation today.
> I agree that it C11 changes are complex, especially if one is starting out to
> understand what these APIs provide. From my experience, once few
> underlying concepts are understood, reviewing or making changes do not take
> too much time.
> 
> > I see other patchsets enabling C11 across other DPDK components, so
> > maybe we should also discuss C11 enabling in a wider context that just
> service cores?
> Yes, agree. We are in the process of making changes to other areas as well.
> 
> >
> > I don't think it fair to expect all developers to be well versed in
> > C11 atomic semantics like understanding the complex interactions
> > between the various
> > C11 RELEASE, AQUIRE barriers requires.
> C11 has been around from sometime now. To my surprise, OVS already uses
> C11 APIs extensively. VPP has been accepting C11 related changes from past
> couple of years. Having said that, I agree in general that not everyone is well
> versed.
> 
> >
> > As maintainer of Service Cores I'm hesitant to accept the large-scale
> > refactor
> Right now, the patches are split into multiple commits. If required I can host a
> call to go over simple C11 API usages (sufficient to cover the usage in service
> core) and the changes in this patch. If you find that particular areas need more
> understanding I can work on providing additional information such as memory
> order ladder diagrams. Please let me know what you think.
When I started working with C11 APIs, I had referred to the following blogs.
https://preshing.com/20120913/acquire-and-release-semantics/
https://preshing.com/20130702/the-happens-before-relation/
https://preshing.com/20130823/the-synchronizes-with-relation/

These will be helpful to understand the changes.

> 
> > of atomic-implementation, as it could lead to racey bugs that are
> > likely extremely difficult to track down. (The recent race-on-exit has
> > proven the difficulty in reproducing, and that's with an atomics model
> > I'm quite familiar with).
> >
> > Let me be very clear: I don't wish to block a C11 atomic
> > implementation, but I'd like to discuss how we (DPDK community) can
> > best port-to and maintain a complex multi-threaded service library
> > with best-in-class performance for the workload.
> >
> > To put some discussions/solutions on the table:
> > - Shared Maintainership of a component?
> >      Split in functionality and C11 atomics implementation
> >      Obviously there would be collaboration required in such a case.
> > - Maybe shared maintainership is too much?
> >      A gentlemans/womans agreement of "helping out" with C11 atomics
> > debug is enough?
> I think shared maintainer ship could be too much as there are many changes.
> But, I and other engineers from Arm (I would include Arm ecosystem as well)
> can definitely help out on debug and reviews involving C11 APIs. (I see
> Thomas's reply on this topic).
> 
> >
> >
> > Hope my concerns are understandable, and of course input/feedback
> > welcomed! -Harry
> No worries at all. We are here to help and make this as easy as possible.
> 
> >
> >
> > PS: Apologies for the delay in reply - was OOO on Irish national holiday.
> Same here, was on vacation for 3 days.
> 
> >
> >
> > > v3:
> > > add libatomic dependency for 32-bit clang
> > >
> > > v2:
> > > 1. fix Clang '-Wincompatible-pointer-types' WARNING.
> > > 2. fix typos.
> > >
> > > Honnappa Nagarahalli (2):
> > >   service: avoid race condition for MT unsafe service
> > >   service: identify service running on another core correctly
> > >
> > > Phil Yang (10):
> > >   doc: add generic atomic deprecation section
> > >   devtools: prevent use of rte atomic APIs in future patches
> > >   eal/build: add libatomic dependency for 32-bit clang
> > >   build: remove redundant code
> > >   vhost: optimize broadcast rarp sync with c11 atomic
> > >   ipsec: optimize with c11 atomic for sa outbound sqn update
> > >   service: remove rte prefix from static functions
> > >   service: remove redundant code
> > >   service: optimize with c11 one-way barrier
> > >   service: relax barriers with C11 atomic operations
> > >
> > >  devtools/checkpatches.sh                         |   9 ++
> > >  doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
> > >  drivers/event/octeontx/meson.build               |   5 -
> > >  drivers/event/octeontx2/meson.build              |   5 -
> > >  drivers/event/opdl/meson.build                   |   5 -
> > >  lib/librte_eal/common/rte_service.c              | 175 ++++++++++++----------
> > > -
> > >  lib/librte_eal/meson.build                       |   6 +
> > >  lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
> > >  lib/librte_ipsec/sa.h                            |   2 +-
> > >  lib/librte_rcu/meson.build                       |   5 -
> > >  lib/librte_vhost/vhost.h                         |   2 +-
> > >  lib/librte_vhost/vhost_user.c                    |   7 +-
> > >  lib/librte_vhost/virtio_net.c                    |  16 ++-
> > >  13 files changed, 181 insertions(+), 119 deletions(-)
> > >
> > > --
> > > 2.7.4
>
  
Van Haaren, Harry March 27, 2020, 2:47 p.m. UTC | #7
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Friday, March 20, 2020 6:32 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang
> <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Carrillo, Erik G <erik.g.carrillo@intel.com>; nd
> <nd@arm.com>; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd
> <nd@arm.com>
> Subject: RE: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> 
> + Erik as there are similar changes to timer library
> 
> <snip>
> 
> >
> > > > Subject: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> > > >
> > > > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > > > These APIs are using the deprecated __sync built-ins and enforce
> > > > full memory barriers on aarch64. However, full barriers are not
> > > > necessary in many use cases. In order to address such use cases, C
> > > > language offers
> > > > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > > > control by making use of the memory ordering parameter provided by
> > > > the
> > > user.
> > > > Various patches submitted in the past [2] and the patches in this
> > > > series indicate significant performance gains on multiple aarch64
> > > > CPUs and no performance loss on x86.
> > > >
> > > > But the existing rte_atomic API implementations cannot be changed as
> > > > the APIs do not take the memory ordering parameter. The only choice
> > > > available is replacing the usage of the rte_atomic APIs with C11
> > > > atomic APIs. In order to make this change, the following steps are
> > proposed:
> > > >
> > > > [1] deprecate rte_atomic APIs so that future patches do not use
> > > > rte_atomic APIs (a script is added to flag the usages).
> > > > [2] refactor the code that uses rte_atomic APIs to use c11 atomic
> APIs.
> > >
> > > On [1] above, I feel deprecating DPDKs atomic functions and failing
> > > checkpatch is a bit sudden. Perhaps noting that in a future release
> > > (20.11?) DPDK will move to a
> > > C11 based atomics model is a more gradual step to achieving the goal,
> > > and at that point add a checkpatch warning for additions of rte_atomic*?
> > We have been working on changing existing usages of rte_atomic APIs in
> DPDK
> > to use C11 atomics. Usually, the x.11 releases have significant amount of
> > changes (not sure how many would use rte_atomic APIs). I would prefer that
> > in 20.11 no additional code is added using rte_atomics APIs. However, I am
> > open to suggestions on the exact time frame.
> > Once we decide on the release, I think it makes sense to add a 'warning'
> in the
> > checkpatch to indicate the deprecation timeline and add an 'error' after
> the
> > release.

The above sounds reasonable - mainly let's not block any code that exists
or is being developed today using rte_atomic_* APIs from making a release.


> > > More on [2] in context below.
> > >
> > > The above is my point-of-view, of course I'd like more people from the
> > > DPDK community to provide their input too.
> > >
> > >
> > > > This patchset contains:
> > > > 1) the checkpatch script changes to flag rte_atomic API usage in
> patches.
> > > > 2) changes to programmer guide describing writing efficient code for
> > > aarch64.
> > > > 3) changes to various libraries to make use of c11 atomic APIs.
> > > >
> > > > We are planning to replicate this idea across all the other
> > > > libraries, drivers, examples, test applications. In the next phase,
> > > > we will add changes to the mbuf, the EAL interrupts and the event
> > > > timer adapter
> > > libraries.
> > >
> > > About ~6/12 patches of this C11 set are targeting the Service Cores
> > > area of DPDK. I have some concerns over increased complexity of C11
> > > implementation vs the (already complex) rte_atomic implementation today.
> > I agree that it C11 changes are complex, especially if one is starting out
> to
> > understand what these APIs provide. From my experience, once few
> > underlying concepts are understood, reviewing or making changes do not
> take
> > too much time.
> >
> > > I see other patchsets enabling C11 across other DPDK components, so
> > > maybe we should also discuss C11 enabling in a wider context that just
> > service cores?
> > Yes, agree. We are in the process of making changes to other areas as
> well.
> >
> > >
> > > I don't think it fair to expect all developers to be well versed in
> > > C11 atomic semantics like understanding the complex interactions
> > > between the various
> > > C11 RELEASE, AQUIRE barriers requires.
> > C11 has been around from sometime now. To my surprise, OVS already uses
> > C11 APIs extensively. VPP has been accepting C11 related changes from past
> > couple of years. Having said that, I agree in general that not everyone is
> > well versed.

Fair point - like so many things, once familiar with it, it becomes easy :)


> > > As maintainer of Service Cores I'm hesitant to accept the large-scale
> > > refactor
> > Right now, the patches are split into multiple commits. If required I can
> host a
> > call to go over simple C11 API usages (sufficient to cover the usage in
> service
> > core) and the changes in this patch. If you find that particular areas
> need more
> > understanding I can work on providing additional information such as
> memory
> > order ladder diagrams. Please let me know what you think.

Thanks for the offer - I will need to do my due diligence on reiview before
taking up any of your or other C11 folks time.

> When I started working with C11 APIs, I had referred to the following blogs.
> https://preshing.com/20120913/acquire-and-release-semantics/
> https://preshing.com/20130702/the-happens-before-relation/
> https://preshing.com/20130823/the-synchronizes-with-relation/
> 
> These will be helpful to understand the changes.

Thanks, indeed good articles. I found the following slide deck particularly
informative due to the fantastic diagrams (eg, slide 23):
https://mariadb.org/wp-content/uploads/2017/11/2017-11-Memory-barriers.pdf

That said, finding a nice diagram and understanding the implications of
actually using it is different! I hope to properly review the
service-cores patches next week.

> > > of atomic-implementation, as it could lead to racey bugs that are
> > > likely extremely difficult to track down. (The recent race-on-exit has
> > > proven the difficulty in reproducing, and that's with an atomics model
> > > I'm quite familiar with).
> > >
> > > Let me be very clear: I don't wish to block a C11 atomic
> > > implementation, but I'd like to discuss how we (DPDK community) can
> > > best port-to and maintain a complex multi-threaded service library
> > > with best-in-class performance for the workload.
> > >
> > > To put some discussions/solutions on the table:
> > > - Shared Maintainership of a component?
> > >      Split in functionality and C11 atomics implementation
> > >      Obviously there would be collaboration required in such a case.
> > > - Maybe shared maintainership is too much?
> > >      A gentlemans/womans agreement of "helping out" with C11 atomics
> > > debug is enough?
> > I think shared maintainer ship could be too much as there are many
> changes.
> > But, I and other engineers from Arm (I would include Arm ecosystem as
> well)
> > can definitely help out on debug and reviews involving C11 APIs. (I see
> > Thomas's reply on this topic).

Thanks for the offer - as above, ball on my side, I'll go review.

<snip>
  
Mattias Rönnblom April 3, 2020, 7:23 a.m. UTC | #8
On 2020-03-17 02:17, Phil Yang wrote:
> DPDK provides generic rte_atomic APIs to do several atomic operations.
> These APIs are using the deprecated __sync built-ins and enforce full
> memory barriers on aarch64. However, full barriers are not necessary
> in many use cases. In order to address such use cases, C language offers
> C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
> by making use of the memory ordering parameter provided by the user.
> Various patches submitted in the past [2] and the patches in this series
> indicate significant performance gains on multiple aarch64 CPUs and no
> performance loss on x86.
>
> But the existing rte_atomic API implementations cannot be changed as the
> APIs do not take the memory ordering parameter. The only choice available
> is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
> order to make this change, the following steps are proposed:

First of all I must say I much support the effort of introducing C11 
atomics or something equivalent into DPDK, across the board.


What's being proposed however, is not to use C11 atomics, but rather the 
GCC built-ins designed to allow an efficient C11 atomics implementation. 
The C11 atomic API is found in <stdatomic.h>. Also, the <rte_atomic.h> 
API is not using __sync. It doesn't dictate any particular 
implementation at all.


I don't think directly accessing GCC built-ins across the whole DPDK 
code base sounds like a good idea at all.


Beyond just being plain ugly, and setting a bad precedence, using 
built-ins directly also effectively prevents API extensions. Although 
C11 is new and shiny, I'm sure there will come a day when we want to 
extend this API, to make it easier for consumers and avoid code 
duplication. Some parts of the DPDK code base already today define their 
own __atomic_* functions. Bad idea to use the "__*" namespace, 
especially in a way that has a real risk of future collisions. It's also 
confusing for anyone reading the code, since they are led to believe 
it's a GCC built-in.


Direct calls to GCC built-ins also prevents the use of any other 
implementation than the GCC built-ins, if some ISA or ISA implementation 
would benefit from this. This should be avoided of course, so it's just 
a minor objection.


I think the right way to go about this is not to deprecate 
<rte_atomic.h>. Rather, <rte_atomic.h> should be reshaped into something 
that closely maps to the GCC built-ins for C11 (which seems more 
convenient than real C11 atomics). The parts of <rte_atomic.h> that 
doesn't fit the new model, should be deprecated.


To summarize, I'm not in favor of deprecating <rte_atomic.h>. If we 
should deprecate anything, it's directly accessing compiler built-ins.

> [1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
> APIs (a script is added to flag the usages).
> [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
>
> This patchset contains:
> 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> 2) changes to programmer guide describing writing efficient code for aarch64.
> 3) changes to various libraries to make use of c11 atomic APIs.
>
> We are planning to replicate this idea across all the other libraries,
> drivers, examples, test applications. In the next phase, we will add
> changes to the mbuf, the EAL interrupts and the event timer adapter libraries.
>
> v3:
> add libatomic dependency for 32-bit clang
>
> v2:
> 1. fix Clang '-Wincompatible-pointer-types' WARNING.
> 2. fix typos.
>
> Honnappa Nagarahalli (2):
>    service: avoid race condition for MT unsafe service
>    service: identify service running on another core correctly
>
> Phil Yang (10):
>    doc: add generic atomic deprecation section
>    devtools: prevent use of rte atomic APIs in future patches
>    eal/build: add libatomic dependency for 32-bit clang
>    build: remove redundant code
>    vhost: optimize broadcast rarp sync with c11 atomic
>    ipsec: optimize with c11 atomic for sa outbound sqn update
>    service: remove rte prefix from static functions
>    service: remove redundant code
>    service: optimize with c11 one-way barrier
>    service: relax barriers with C11 atomic operations
>
>   devtools/checkpatches.sh                         |   9 ++
>   doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
>   drivers/event/octeontx/meson.build               |   5 -
>   drivers/event/octeontx2/meson.build              |   5 -
>   drivers/event/opdl/meson.build                   |   5 -
>   lib/librte_eal/common/rte_service.c              | 175 ++++++++++++-----------
>   lib/librte_eal/meson.build                       |   6 +
>   lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
>   lib/librte_ipsec/sa.h                            |   2 +-
>   lib/librte_rcu/meson.build                       |   5 -
>   lib/librte_vhost/vhost.h                         |   2 +-
>   lib/librte_vhost/vhost_user.c                    |   7 +-
>   lib/librte_vhost/virtio_net.c                    |  16 ++-
>   13 files changed, 181 insertions(+), 119 deletions(-)
>