[v13,0/5] use WFE for aarch64
mbox series

Message ID 1573162528-16230-1-git-send-email-david.marchand@redhat.com
Headers show
Series
  • use WFE for aarch64
Related show

Message

David Marchand Nov. 7, 2019, 9:35 p.m. UTC
DPDK has multiple use cases where the core repeatedly polls a location in
memory. This polling results in many cache and memory transactions.

Arm architecture provides WFE (Wait For Event) instruction, which allows
the cpu core to enter a low power state until woken up by the update to the
memory location being polled. Thus reducing the cache and memory
transactions.

x86 has the PAUSE hint instruction to reduce such overhead.

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

For non-Arm platforms, these APIs are just wrappers around do-while loop
with rte_pause, so there are no performance differences.

For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
option. It is disabled by default.

Currently, use of WFE is supported only for aarch64 platforms. armv7
platforms do support the WFE instruction, but they require explicit wake up
events(sev) and are less performannt.

Testing shows that, performance varies across different platforms, with
some showing degradation.

CONFIG_RTE_USE_WFE should be enabled depending on the performance on the
target platforms.

V13:
- added release notes update,
- reworked arm implementation to avoid exporting inlines,
- added assert in generic implementation,

V12:
- remove the 'rte_' prefix from the arm specific functions (David Marchand)
- use the __atomic_load_ex_xx functions in arm specific implementations of
  APIS (David Marchand)
- remove the experimental warnings (David Marchand)
- tweak the macros working scope (David Marchand)
V11:
- add rte_ prefix to the __atomic_load_ex_x funtions (Ananyev Konstantin)
- define the above rte_atomic_load_ex_x funtions even if not
  RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED for future non-wfe usages (Ananyev
  Konstantin)
- use the above functions for arm specific rte_wait_until_equal_x functions
  (Ananyev Konstantin)
- simplify the generic implementation by immersing "if" into "while"
  (Ananyev Konstantin)

V10:
- move arm specific stuff to arch/arm/rte_pause_64.h (Ananyev Konstantin)

V9:
- fix a weblink broken (David Marchand)
- define rte_wfe and rte_sev() (Ananyev Konstantin)
- explicitly define three function APIs instead of marcos (Ananyev Konstantin)
- incorporate common rte_wfe and rte_sev into the generic rte_spinlock (David
  Marchand)
- define arch neutral RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED (Ananyev Konstantin)
- define rte_load_ex_16/32/64 functions to use load-exclusive instruction for
  aarch64, which is required for wake up of WFE
- drop the rte_spinlock patch from this series, as the it calls this
  experimental API and it is widely included by a lot of components each
  requires the ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave
  it to future after the experimental is removed.

V8:
- simplify dmb definition to use io barriers (David Marchand)
- define wfe() and sev() macros and use them inside normal C code (Ananyev
  Konstantin)
- pass memorder as parameter, not to incorporate it into function name, less
  functions, similar to C11 atomic intrinsics (Ananyev Konstantin)
- remove mandating RTE_FORCE_INTRINSICS in arm spinlock implementation (David
  Marchand)
- undef __WAIT_UNTIL_EQUAL after use (David Marchand)
- add experimental tag and warning (David Marchand)
- add the limitation of using WFE instruction in the commit log (David
  Marchand) 
- tweak the use of RTE_FORCE_INSTRINSICS (still mandatory for aarch64) and
  RTE_ARM_USE_WFE for spinlock (David Marchand)
- drop the rte_ring patch from this series, as the rte_ring.h calls this API
  and it is widely included by a lot of components each requires the
  ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave it to future
  after the experimental is removed.

V7:
- fix the checkpatch LONG_LINE_COMMENT issue

V6:
- squash the RTE_ARM_USE_WFE configuration entry patch into the new API patch
- move the new configuration to the end of EAL
- add doxygen comments to reflect the relaxed and acquire semantics
- correct the meson configuration 

V5:
- add doxygen comments for the new APIs
- spinlock early exit without wfe if the spinlock not taken by others.
- add two patches on top for opdl and thunderx

V4:
- rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
- introduce a macro for assembly Skelton to reduce the duplication of code
- add one patch for nxp fslmc to address a compiling error

V3:
- Convert RFCs to patches

V2:
- Use inline functions instead of marcos
- Add load and compare in the beginning of the APIs
- Fix some style errors in asm inline 

V1:
- Add the new APIs and use it for ring and locks

Gavin Hu (5):
  bus/fslmc: fix the conflicting dmb function
  eal: add the APIs to wait until equal
  ticketlock: use new API to reduce contention on aarch64
  net/thunderx: use new API to save cycles on aarch64
  event/opdl: use new API to save cycles on aarch64

 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 doc/guides/rel_notes/release_19_11.rst             |   5 +
 drivers/bus/fslmc/mc/fsl_mc_sys.h                  |   9 +-
 drivers/event/opdl/Makefile                        |   1 +
 drivers/event/opdl/meson.build                     |   1 +
 drivers/event/opdl/opdl_ring.c                     |   5 +-
 drivers/net/thunderx/Makefile                      |   1 +
 drivers/net/thunderx/meson.build                   |   1 +
 drivers/net/thunderx/nicvf_rxtx.c                  |   3 +-
 .../common/include/arch/arm/rte_pause_64.h         | 133 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 105 ++++++++++++++++
 .../common/include/generic/rte_ticketlock.h        |   3 +-
 13 files changed, 261 insertions(+), 12 deletions(-)

Comments

David Marchand Jan. 17, 2020, 11:15 a.m. UTC | #1
On Thu, Nov 7, 2019 at 10:35 PM David Marchand
<david.marchand@redhat.com> wrote:
>
> DPDK has multiple use cases where the core repeatedly polls a location in
> memory. This polling results in many cache and memory transactions.
>
> Arm architecture provides WFE (Wait For Event) instruction, which allows
> the cpu core to enter a low power state until woken up by the update to the
> memory location being polled. Thus reducing the cache and memory
> transactions.
>
> x86 has the PAUSE hint instruction to reduce such overhead.
>
> The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
> for a memory location to become equal to a given value'.
>
> For non-Arm platforms, these APIs are just wrappers around do-while loop
> with rte_pause, so there are no performance differences.
>
> For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
> option. It is disabled by default.
>
> Currently, use of WFE is supported only for aarch64 platforms. armv7
> platforms do support the WFE instruction, but they require explicit wake up
> events(sev) and are less performannt.
>
> Testing shows that, performance varies across different platforms, with
> some showing degradation.
>
> CONFIG_RTE_USE_WFE should be enabled depending on the performance on the
> target platforms.
>
> V13:
> - added release notes update,
> - reworked arm implementation to avoid exporting inlines,
> - added assert in generic implementation,
>
> V12:
> - remove the 'rte_' prefix from the arm specific functions (David Marchand)
> - use the __atomic_load_ex_xx functions in arm specific implementations of
>   APIS (David Marchand)
> - remove the experimental warnings (David Marchand)
> - tweak the macros working scope (David Marchand)
> V11:
> - add rte_ prefix to the __atomic_load_ex_x funtions (Ananyev Konstantin)
> - define the above rte_atomic_load_ex_x funtions even if not
>   RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED for future non-wfe usages (Ananyev
>   Konstantin)
> - use the above functions for arm specific rte_wait_until_equal_x functions
>   (Ananyev Konstantin)
> - simplify the generic implementation by immersing "if" into "while"
>   (Ananyev Konstantin)
>
> V10:
> - move arm specific stuff to arch/arm/rte_pause_64.h (Ananyev Konstantin)
>
> V9:
> - fix a weblink broken (David Marchand)
> - define rte_wfe and rte_sev() (Ananyev Konstantin)
> - explicitly define three function APIs instead of marcos (Ananyev Konstantin)
> - incorporate common rte_wfe and rte_sev into the generic rte_spinlock (David
>   Marchand)
> - define arch neutral RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED (Ananyev Konstantin)
> - define rte_load_ex_16/32/64 functions to use load-exclusive instruction for
>   aarch64, which is required for wake up of WFE
> - drop the rte_spinlock patch from this series, as the it calls this
>   experimental API and it is widely included by a lot of components each
>   requires the ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave
>   it to future after the experimental is removed.
>
> V8:
> - simplify dmb definition to use io barriers (David Marchand)
> - define wfe() and sev() macros and use them inside normal C code (Ananyev
>   Konstantin)
> - pass memorder as parameter, not to incorporate it into function name, less
>   functions, similar to C11 atomic intrinsics (Ananyev Konstantin)
> - remove mandating RTE_FORCE_INTRINSICS in arm spinlock implementation (David
>   Marchand)
> - undef __WAIT_UNTIL_EQUAL after use (David Marchand)
> - add experimental tag and warning (David Marchand)
> - add the limitation of using WFE instruction in the commit log (David
>   Marchand)
> - tweak the use of RTE_FORCE_INSTRINSICS (still mandatory for aarch64) and
>   RTE_ARM_USE_WFE for spinlock (David Marchand)
> - drop the rte_ring patch from this series, as the rte_ring.h calls this API
>   and it is widely included by a lot of components each requires the
>   ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave it to future
>   after the experimental is removed.
>
> V7:
> - fix the checkpatch LONG_LINE_COMMENT issue
>
> V6:
> - squash the RTE_ARM_USE_WFE configuration entry patch into the new API patch
> - move the new configuration to the end of EAL
> - add doxygen comments to reflect the relaxed and acquire semantics
> - correct the meson configuration
>
> V5:
> - add doxygen comments for the new APIs
> - spinlock early exit without wfe if the spinlock not taken by others.
> - add two patches on top for opdl and thunderx
>
> V4:
> - rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
> - introduce a macro for assembly Skelton to reduce the duplication of code
> - add one patch for nxp fslmc to address a compiling error
>
> V3:
> - Convert RFCs to patches
>
> V2:
> - Use inline functions instead of marcos
> - Add load and compare in the beginning of the APIs
> - Fix some style errors in asm inline
>
> V1:
> - Add the new APIs and use it for ring and locks

Series applied.
Thanks.