[v6,5/7] eal: provide option to use compiler memcpy instead of RTE

Message ID 20240920102716.738940-6-mattias.ronnblom@ericsson.com (mailing list archive)
State New
Delegated to: Thomas Monjalon
Headers
Series Optionally have rte_memcpy delegate to compiler memcpy |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Mattias Rönnblom Sept. 20, 2024, 10:27 a.m. UTC
Provide build option to have functions in <rte_memcpy.h> delegate to
the standard compiler/libc memcpy(), instead of using the various
custom DPDK, handcrafted, per-architecture rte_memcpy()
implementations.

A new meson build option 'use_cc_memcpy' is added. By default, the
traditional, custom DPDK rte_memcpy() implementation is used.

The performance benefits of the custom DPDK rte_memcpy()
implementations have been diminishing with every compiler release, and
with current toolchains the use of a custom memcpy() implementation
may even be a liability.

An additional benefit of this change is that compilers and static
analysis tools have an easier time detecting incorrect usage of
rte_memcpy() (e.g., buffer overruns, or overlapping source and
destination buffers).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

---

PATCH v6:
 * Rebase for 24.11.

PATCH v5:
 o Take a more cautious approach, setting use_cc_memcpy to disabled by
   default.
 o Fix ARM build issue in case RTE_ARCH_ARM64_MEMCPY was set.
 o Use separate macros to indicate that the rte_memcpy() is implemented
   by the compiler, and that use_cc_memcpy is set, to avoid accidental
   <rte_build_config.h> #undefs.
 o Remove redundant rte_config.h includes.

PATCH:
 o Add entry in release notes.
 o Update meson help text.

RFC v3:
 o Fix missing #endif on loongarch.
 o PPC and RISCV now implemented, meaning all architectures are supported.
 o Unnecessary <rte_vect.h> include is removed from <rte_memcpy.h>.

RFC v2:
 * Fix bug where rte_memcpy.h was not installed on x86.
 * Made attempt to make Loongarch compile.
---
 config/meson.build                     |  1 +
 doc/guides/rel_notes/release_24_11.rst | 20 +++++++++
 lib/eal/arm/include/rte_memcpy.h       |  9 ++++
 lib/eal/include/generic/rte_memcpy.h   | 61 +++++++++++++++++++++++---
 lib/eal/loongarch/include/rte_memcpy.h | 54 +----------------------
 lib/eal/ppc/include/rte_memcpy.h       |  9 ++++
 lib/eal/riscv/include/rte_memcpy.h     | 54 +----------------------
 lib/eal/x86/include/meson.build        |  1 +
 lib/eal/x86/include/rte_memcpy.h       |  9 ++++
 meson_options.txt                      |  2 +
 10 files changed, 109 insertions(+), 111 deletions(-)
  

Comments

David Marchand Oct. 4, 2024, 7:52 a.m. UTC | #1
On Fri, Sep 20, 2024 at 12:36 PM Mattias Rönnblom
<mattias.ronnblom@ericsson.com> wrote:
>
> Provide build option to have functions in <rte_memcpy.h> delegate to
> the standard compiler/libc memcpy(), instead of using the various
> custom DPDK, handcrafted, per-architecture rte_memcpy()
> implementations.
>
> A new meson build option 'use_cc_memcpy' is added. By default, the
> traditional, custom DPDK rte_memcpy() implementation is used.
>
> The performance benefits of the custom DPDK rte_memcpy()
> implementations have been diminishing with every compiler release, and
> with current toolchains the use of a custom memcpy() implementation
> may even be a liability.
>
> An additional benefit of this change is that compilers and static
> analysis tools have an easier time detecting incorrect usage of
> rte_memcpy() (e.g., buffer overruns, or overlapping source and
> destination buffers).
>
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>

I like this patch and the direction we are taking: stop reinvent
memcpy and rely on compiler to optimize it.

I have some comments on the implementation.

- When I splitted headers in the early days of dpdk, the intention
with arch-specific headers in EAL was to have them include the generic
one, in all cases.
It seems that, over time, x86 rte_memcpy.h (at least) deviated from
this and stopped including generic/rte_memcpy.h...

So in this current patch, I expect every arch specific headers first
include generic/rte_memcpy.h, regardless of any arch-specific define
coming from the configuration.

An additional note on this, ARM32 and ARM64 have their own
implementation in rte_memcpy_32.h resp. rte_memcpy_64.h, and I would
check RTE_USE_CC_MEMCPY in each of them rather than in the top as
ARM32 and ARM64 are like two different arches.


- Now, looking at what was available for arches so far in DPDK:
* ARM was relying by default on compiler implementation, with specific
implementations for ARM32 and ARM64 available (see for more details
below) => possible values (default first) RTE_USE_CC_MEMCPY = true /
false
* loongarch was relying on compiler implementation, with no specific
implementations, => RTE_USE_CC_MEMCPY = true
* ppc was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false
* risc was relying on compiler implementation, with no specific
implementations, => RTE_USE_CC_MEMCPY = true
* x86 was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false

We can't get a unified default value for a meson option and keep
compat for all arches (except maybe introduce a "auto" value).

Plus, disabling RTE_USE_CC_MEMCPY on loongarch and risc makes no
sense, as there was never a specific implementation.

My suggestion is to drop the meson option and instead just set
RTE_USE_CC_MEMCPY in config/$arch/meson.build.
Testers / interested users may edit config/$arch/meson.build on their own.


- Additionnally, ARM people have introduced arch-specific
implementation config options for memcpy in ARM32 resp. ARM64:
RTE_ARCH_ARM_NEON_MEMCPY resp. RTE_ARCH_ARM64_MEMCPY.
RTE_USE_CC_MEMCPY can replace those two options (we may keep some
compat in case someone relied on those defines for arm).
That removes the need for a RTE_CC_MEMCPY define.

More comments below:

[snip]

> diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
> index 0ff70d9057..8be000294d 100644
> --- a/doc/guides/rel_notes/release_24_11.rst
> +++ b/doc/guides/rel_notes/release_24_11.rst
> @@ -55,6 +55,26 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =======================================================
>
> +* **Compiler memcpy replaces custom DPDK implementation.**
> +
> +  The memory copy functions of ``<rte_memcpy.h>`` now optionally
> +  delegates to the standard memcpy() function, implemented by the
> +  compiler and the C runtime (e.g., libc).
> +
> +  In this release of DPDK, the handcrafted, per-architecture memory
> +  copy implementations are still the default. Compiler memcpy is
> +  enabled by setting the new ``use_cc_memcpy`` build option to true.
> +
> +  The performance benefits of the custom DPDK rte_memcpy()
> +  implementations have been diminishing with every new compiler
> +  release, and with current toolchains the use of a custom memcpy()
> +  implementation may even result in worse performance than the
> +  standard memcpy().
> +
> +  An additional benefit of using compiler memcpy is that compilers and
> +  static analysis tools have an easier time detecting incorrect usage
> +  of rte_memcpy() (e.g., buffer overruns, or overlapping source and
> +  destination buffers).

As explained in the RN comments, an entry should use the form:

   * **Add a title in the past tense with a full stop.**

     Add a short 1-2 sentence description in the past tense.
     The description should be enough to allow someone scanning
     the release notes to understand the new feature.

It seems this note is a copy/paste of the commit log, please adjust
the title and make the description shorter.

>
>  Removed Items
>  -------------

[snip]

> diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
> index e7f0f8eaa9..cfb0175bd2 100644
> --- a/lib/eal/include/generic/rte_memcpy.h
> +++ b/lib/eal/include/generic/rte_memcpy.h
> @@ -5,12 +5,19 @@
>  #ifndef _RTE_MEMCPY_H_
>  #define _RTE_MEMCPY_H_
>
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
>  /**
>   * @file
>   *
>   * Functions for vectorised implementation of memcpy().
>   */
>
> +#include <stdint.h>
> +#include <string.h>

I don't think those includes should go in a extern "C" { block.

> +
>  /**
>   * Copy 16 bytes from one location to another using optimised
>   * instructions. The locations should not overlap.
> @@ -35,8 +42,6 @@ rte_mov16(uint8_t *dst, const uint8_t *src);
>  static inline void
>  rte_mov32(uint8_t *dst, const uint8_t *src);
>
> -#ifdef __DOXYGEN__
> -

This strange check was added as not all architectures provide
rte_mov48 (/me slaps Adrien and Thomas).
I think the CI reported no issue because of a problem in the next
patch where all that is tested is RTE_USE_CC_MEMCPY = true
combination.

Still, the overall goal of this work is to drop the whole rte_memcpy
thing in the future, so I think we can live with this #ifdef
__DOXYGEN__ non sense hiding the absence of rte_mov48 in x86...


>  /**
>   * Copy 48 bytes from one location to another using optimised
>   * instructions. The locations should not overlap.
> @@ -49,8 +54,6 @@ rte_mov32(uint8_t *dst, const uint8_t *src);
>  static inline void
>  rte_mov48(uint8_t *dst, const uint8_t *src);
>
> -#endif /* __DOXYGEN__ */
> -
>  /**
>   * Copy 64 bytes from one location to another using optimised
>   * instructions. The locations should not overlap.
> @@ -87,8 +90,6 @@ rte_mov128(uint8_t *dst, const uint8_t *src);
>  static inline void
>  rte_mov256(uint8_t *dst, const uint8_t *src);
>
> -#ifdef __DOXYGEN__
> -
>  /**
>   * Copy bytes from one location to another. The locations must not overlap.
>   *
> @@ -111,6 +112,52 @@ rte_mov256(uint8_t *dst, const uint8_t *src);
>  static void *
>  rte_memcpy(void *dst, const void *src, size_t n);
>
> -#endif /* __DOXYGEN__ */

Removing this DOXYGEN here should be ok.
CI will tell us.


> diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
> index 52d2f8e969..09c2fe2485 100644
> --- a/lib/eal/x86/include/meson.build
> +++ b/lib/eal/x86/include/meson.build
> @@ -16,6 +16,7 @@ arch_headers = files(
>          'rte_spinlock.h',
>          'rte_vect.h',
>  )
> +

Unrelated change.


>  arch_indirect_headers = files(
>          'rte_atomic_32.h',
>          'rte_atomic_64.h',
  
Mattias Rönnblom Oct. 4, 2024, 9:21 a.m. UTC | #2
On 2024-10-04 09:52, David Marchand wrote:
> On Fri, Sep 20, 2024 at 12:36 PM Mattias Rönnblom
> <mattias.ronnblom@ericsson.com> wrote:
>>
>> Provide build option to have functions in <rte_memcpy.h> delegate to
>> the standard compiler/libc memcpy(), instead of using the various
>> custom DPDK, handcrafted, per-architecture rte_memcpy()
>> implementations.
>>
>> A new meson build option 'use_cc_memcpy' is added. By default, the
>> traditional, custom DPDK rte_memcpy() implementation is used.
>>
>> The performance benefits of the custom DPDK rte_memcpy()
>> implementations have been diminishing with every compiler release, and
>> with current toolchains the use of a custom memcpy() implementation
>> may even be a liability.
>>
>> An additional benefit of this change is that compilers and static
>> analysis tools have an easier time detecting incorrect usage of
>> rte_memcpy() (e.g., buffer overruns, or overlapping source and
>> destination buffers).
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> I like this patch and the direction we are taking: stop reinvent
> memcpy and rely on compiler to optimize it.
> 
> I have some comments on the implementation.
> 
> - When I splitted headers in the early days of dpdk, the intention
> with arch-specific headers in EAL was to have them include the generic
> one, in all cases.
> It seems that, over time, x86 rte_memcpy.h (at least) deviated from
> this and stopped including generic/rte_memcpy.h...
> 
> So in this current patch, I expect every arch specific headers first
> include generic/rte_memcpy.h, regardless of any arch-specific define
> coming from the configuration.
> 
> An additional note on this, ARM32 and ARM64 have their own
> implementation in rte_memcpy_32.h resp. rte_memcpy_64.h, and I would
> check RTE_USE_CC_MEMCPY in each of them rather than in the top as
> ARM32 and ARM64 are like two different arches.
> 
> 
> - Now, looking at what was available for arches so far in DPDK:
> * ARM was relying by default on compiler implementation, with specific
> implementations for ARM32 and ARM64 available (see for more details
> below) => possible values (default first) RTE_USE_CC_MEMCPY = true /
> false
> * loongarch was relying on compiler implementation, with no specific
> implementations, => RTE_USE_CC_MEMCPY = true
> * ppc was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false
> * risc was relying on compiler implementation, with no specific
> implementations, => RTE_USE_CC_MEMCPY = true
> * x86 was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false
> 
> We can't get a unified default value for a meson option and keep
> compat for all arches (except maybe introduce a "auto" value).
> 
> Plus, disabling RTE_USE_CC_MEMCPY on loongarch and risc makes no
> sense, as there was never a specific implementation.
> 
> My suggestion is to drop the meson option and instead just set
> RTE_USE_CC_MEMCPY in config/$arch/meson.build.
> Testers / interested users may edit config/$arch/meson.build on their own.
> 

So we've gone from...

"Eliminate DPDK custom per-arch memcpy altogether"
to
"Keep custom memcpy, but make cc memcpy the default"
to
"Keep custom memcpy as the default, but make cc memcpy a build option"
to
"Keep custom memcpy as the default, and have the user modify some 
obscure build file to use cc memcpy"

I seems like the natural next step is just

"Keep the custom memcpy. Period."

If we intend to keep the custom DPDK memcpy implementations 
indefinitely, we should just provide an option to use CC memcpy on x86 
as well, just like on ARM.

That would go against the original intention of this patch set, which 
was to reduce DPDK complexity (and hopefully improve performance as 
well, on average).

> 
> - Additionnally, ARM people have introduced arch-specific
> implementation config options for memcpy in ARM32 resp. ARM64:
> RTE_ARCH_ARM_NEON_MEMCPY resp. RTE_ARCH_ARM64_MEMCPY.
> RTE_USE_CC_MEMCPY can replace those two options (we may keep some
> compat in case someone relied on those defines for arm).
> That removes the need for a RTE_CC_MEMCPY define.
> 
> More comments below:
> 
> [snip]
> 
>> diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
>> index 0ff70d9057..8be000294d 100644
>> --- a/doc/guides/rel_notes/release_24_11.rst
>> +++ b/doc/guides/rel_notes/release_24_11.rst
>> @@ -55,6 +55,26 @@ New Features
>>        Also, make sure to start the actual text at the margin.
>>        =======================================================
>>
>> +* **Compiler memcpy replaces custom DPDK implementation.**
>> +
>> +  The memory copy functions of ``<rte_memcpy.h>`` now optionally
>> +  delegates to the standard memcpy() function, implemented by the
>> +  compiler and the C runtime (e.g., libc).
>> +
>> +  In this release of DPDK, the handcrafted, per-architecture memory
>> +  copy implementations are still the default. Compiler memcpy is
>> +  enabled by setting the new ``use_cc_memcpy`` build option to true.
>> +
>> +  The performance benefits of the custom DPDK rte_memcpy()
>> +  implementations have been diminishing with every new compiler
>> +  release, and with current toolchains the use of a custom memcpy()
>> +  implementation may even result in worse performance than the
>> +  standard memcpy().
>> +
>> +  An additional benefit of using compiler memcpy is that compilers and
>> +  static analysis tools have an easier time detecting incorrect usage
>> +  of rte_memcpy() (e.g., buffer overruns, or overlapping source and
>> +  destination buffers).
> 
> As explained in the RN comments, an entry should use the form:
> 
>     * **Add a title in the past tense with a full stop.**
> 
>       Add a short 1-2 sentence description in the past tense.
>       The description should be enough to allow someone scanning
>       the release notes to understand the new feature.
> 
> It seems this note is a copy/paste of the commit log, please adjust
> the title and make the description shorter.
> 
>>
>>   Removed Items
>>   -------------
> 
> [snip]
> 
>> diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
>> index e7f0f8eaa9..cfb0175bd2 100644
>> --- a/lib/eal/include/generic/rte_memcpy.h
>> +++ b/lib/eal/include/generic/rte_memcpy.h
>> @@ -5,12 +5,19 @@
>>   #ifndef _RTE_MEMCPY_H_
>>   #define _RTE_MEMCPY_H_
>>
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>>   /**
>>    * @file
>>    *
>>    * Functions for vectorised implementation of memcpy().
>>    */
>>
>> +#include <stdint.h>
>> +#include <string.h>
> 
> I don't think those includes should go in a extern "C" { block.
> 
>> +
>>   /**
>>    * Copy 16 bytes from one location to another using optimised
>>    * instructions. The locations should not overlap.
>> @@ -35,8 +42,6 @@ rte_mov16(uint8_t *dst, const uint8_t *src);
>>   static inline void
>>   rte_mov32(uint8_t *dst, const uint8_t *src);
>>
>> -#ifdef __DOXYGEN__
>> -
> 
> This strange check was added as not all architectures provide
> rte_mov48 (/me slaps Adrien and Thomas).
> I think the CI reported no issue because of a problem in the next
> patch where all that is tested is RTE_USE_CC_MEMCPY = true
> combination.
> 
> Still, the overall goal of this work is to drop the whole rte_memcpy
> thing in the future, so I think we can live with this #ifdef
> __DOXYGEN__ non sense hiding the absence of rte_mov48 in x86...
> 
> 
>>   /**
>>    * Copy 48 bytes from one location to another using optimised
>>    * instructions. The locations should not overlap.
>> @@ -49,8 +54,6 @@ rte_mov32(uint8_t *dst, const uint8_t *src);
>>   static inline void
>>   rte_mov48(uint8_t *dst, const uint8_t *src);
>>
>> -#endif /* __DOXYGEN__ */
>> -
>>   /**
>>    * Copy 64 bytes from one location to another using optimised
>>    * instructions. The locations should not overlap.
>> @@ -87,8 +90,6 @@ rte_mov128(uint8_t *dst, const uint8_t *src);
>>   static inline void
>>   rte_mov256(uint8_t *dst, const uint8_t *src);
>>
>> -#ifdef __DOXYGEN__
>> -
>>   /**
>>    * Copy bytes from one location to another. The locations must not overlap.
>>    *
>> @@ -111,6 +112,52 @@ rte_mov256(uint8_t *dst, const uint8_t *src);
>>   static void *
>>   rte_memcpy(void *dst, const void *src, size_t n);
>>
>> -#endif /* __DOXYGEN__ */
> 
> Removing this DOXYGEN here should be ok.
> CI will tell us.
> 
> 
>> diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
>> index 52d2f8e969..09c2fe2485 100644
>> --- a/lib/eal/x86/include/meson.build
>> +++ b/lib/eal/x86/include/meson.build
>> @@ -16,6 +16,7 @@ arch_headers = files(
>>           'rte_spinlock.h',
>>           'rte_vect.h',
>>   )
>> +
> 
> Unrelated change.
> 
> 
>>   arch_indirect_headers = files(
>>           'rte_atomic_32.h',
>>           'rte_atomic_64.h',
> 
>
  
Mattias Rönnblom Oct. 4, 2024, 9:27 a.m. UTC | #3
On 2024-10-04 09:52, David Marchand wrote:
> On Fri, Sep 20, 2024 at 12:36 PM Mattias Rönnblom
> <mattias.ronnblom@ericsson.com> wrote:
>>
>> Provide build option to have functions in <rte_memcpy.h> delegate to
>> the standard compiler/libc memcpy(), instead of using the various
>> custom DPDK, handcrafted, per-architecture rte_memcpy()
>> implementations.
>>
>> A new meson build option 'use_cc_memcpy' is added. By default, the
>> traditional, custom DPDK rte_memcpy() implementation is used.
>>
>> The performance benefits of the custom DPDK rte_memcpy()
>> implementations have been diminishing with every compiler release, and
>> with current toolchains the use of a custom memcpy() implementation
>> may even be a liability.
>>
>> An additional benefit of this change is that compilers and static
>> analysis tools have an easier time detecting incorrect usage of
>> rte_memcpy() (e.g., buffer overruns, or overlapping source and
>> destination buffers).
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> I like this patch and the direction we are taking: stop reinvent
> memcpy and rely on compiler to optimize it.
> 
> I have some comments on the implementation.
> 
> - When I splitted headers in the early days of dpdk, the intention
> with arch-specific headers in EAL was to have them include the generic
> one, in all cases.
> It seems that, over time, x86 rte_memcpy.h (at least) deviated from
> this and stopped including generic/rte_memcpy.h...
> 
> So in this current patch, I expect every arch specific headers first
> include generic/rte_memcpy.h, regardless of any arch-specific define
> coming from the configuration.
> 
> An additional note on this, ARM32 and ARM64 have their own
> implementation in rte_memcpy_32.h resp. rte_memcpy_64.h, and I would
> check RTE_USE_CC_MEMCPY in each of them rather than in the top as
> ARM32 and ARM64 are like two different arches.
> 
> 
> - Now, looking at what was available for arches so far in DPDK:
> * ARM was relying by default on compiler implementation, with specific
> implementations for ARM32 and ARM64 available (see for more details
> below) => possible values (default first) RTE_USE_CC_MEMCPY = true /
> false
> * loongarch was relying on compiler implementation, with no specific
> implementations, => RTE_USE_CC_MEMCPY = true
> * ppc was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false
> * risc was relying on compiler implementation, with no specific
> implementations, => RTE_USE_CC_MEMCPY = true
> * x86 was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false
> 
> We can't get a unified default value for a meson option and keep
> compat for all arches (except maybe introduce a "auto" value).
> 

What if you just renamed RTE_USE_CC_MEMCPY to
RTE_ALWAYS_USE_CC_MEMCPY
RTE_FORCE_CC_MEMCPY

Then the naming would better match a scenario where cc memcpy may be the 
only option.

> Plus, disabling RTE_USE_CC_MEMCPY on loongarch and risc makes no
> sense, as there was never a specific implementation.
> 
> My suggestion is to drop the meson option and instead just set
> RTE_USE_CC_MEMCPY in config/$arch/meson.build.
> Testers / interested users may edit config/$arch/meson.build on their own.
> 
> 
> - Additionnally, ARM people have introduced arch-specific
> implementation config options for memcpy in ARM32 resp. ARM64:
> RTE_ARCH_ARM_NEON_MEMCPY resp. RTE_ARCH_ARM64_MEMCPY.
> RTE_USE_CC_MEMCPY can replace those two options (we may keep some
> compat in case someone relied on those defines for arm).
> That removes the need for a RTE_CC_MEMCPY define.
> 
> More comments below:
> 
> [snip]
> 
>> diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
>> index 0ff70d9057..8be000294d 100644
>> --- a/doc/guides/rel_notes/release_24_11.rst
>> +++ b/doc/guides/rel_notes/release_24_11.rst
>> @@ -55,6 +55,26 @@ New Features
>>        Also, make sure to start the actual text at the margin.
>>        =======================================================
>>
>> +* **Compiler memcpy replaces custom DPDK implementation.**
>> +
>> +  The memory copy functions of ``<rte_memcpy.h>`` now optionally
>> +  delegates to the standard memcpy() function, implemented by the
>> +  compiler and the C runtime (e.g., libc).
>> +
>> +  In this release of DPDK, the handcrafted, per-architecture memory
>> +  copy implementations are still the default. Compiler memcpy is
>> +  enabled by setting the new ``use_cc_memcpy`` build option to true.
>> +
>> +  The performance benefits of the custom DPDK rte_memcpy()
>> +  implementations have been diminishing with every new compiler
>> +  release, and with current toolchains the use of a custom memcpy()
>> +  implementation may even result in worse performance than the
>> +  standard memcpy().
>> +
>> +  An additional benefit of using compiler memcpy is that compilers and
>> +  static analysis tools have an easier time detecting incorrect usage
>> +  of rte_memcpy() (e.g., buffer overruns, or overlapping source and
>> +  destination buffers).
> 
> As explained in the RN comments, an entry should use the form:
> 
>     * **Add a title in the past tense with a full stop.**
> 
>       Add a short 1-2 sentence description in the past tense.
>       The description should be enough to allow someone scanning
>       the release notes to understand the new feature.
> 
> It seems this note is a copy/paste of the commit log, please adjust
> the title and make the description shorter.
> 
>>
>>   Removed Items
>>   -------------
> 
> [snip]
> 
>> diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
>> index e7f0f8eaa9..cfb0175bd2 100644
>> --- a/lib/eal/include/generic/rte_memcpy.h
>> +++ b/lib/eal/include/generic/rte_memcpy.h
>> @@ -5,12 +5,19 @@
>>   #ifndef _RTE_MEMCPY_H_
>>   #define _RTE_MEMCPY_H_
>>
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>>   /**
>>    * @file
>>    *
>>    * Functions for vectorised implementation of memcpy().
>>    */
>>
>> +#include <stdint.h>
>> +#include <string.h>
> 
> I don't think those includes should go in a extern "C" { block.
> 
>> +
>>   /**
>>    * Copy 16 bytes from one location to another using optimised
>>    * instructions. The locations should not overlap.
>> @@ -35,8 +42,6 @@ rte_mov16(uint8_t *dst, const uint8_t *src);
>>   static inline void
>>   rte_mov32(uint8_t *dst, const uint8_t *src);
>>
>> -#ifdef __DOXYGEN__
>> -
> 
> This strange check was added as not all architectures provide
> rte_mov48 (/me slaps Adrien and Thomas).
> I think the CI reported no issue because of a problem in the next
> patch where all that is tested is RTE_USE_CC_MEMCPY = true
> combination.
> 
> Still, the overall goal of this work is to drop the whole rte_memcpy
> thing in the future, so I think we can live with this #ifdef
> __DOXYGEN__ non sense hiding the absence of rte_mov48 in x86...
> 
> 
>>   /**
>>    * Copy 48 bytes from one location to another using optimised
>>    * instructions. The locations should not overlap.
>> @@ -49,8 +54,6 @@ rte_mov32(uint8_t *dst, const uint8_t *src);
>>   static inline void
>>   rte_mov48(uint8_t *dst, const uint8_t *src);
>>
>> -#endif /* __DOXYGEN__ */
>> -
>>   /**
>>    * Copy 64 bytes from one location to another using optimised
>>    * instructions. The locations should not overlap.
>> @@ -87,8 +90,6 @@ rte_mov128(uint8_t *dst, const uint8_t *src);
>>   static inline void
>>   rte_mov256(uint8_t *dst, const uint8_t *src);
>>
>> -#ifdef __DOXYGEN__
>> -
>>   /**
>>    * Copy bytes from one location to another. The locations must not overlap.
>>    *
>> @@ -111,6 +112,52 @@ rte_mov256(uint8_t *dst, const uint8_t *src);
>>   static void *
>>   rte_memcpy(void *dst, const void *src, size_t n);
>>
>> -#endif /* __DOXYGEN__ */
> 
> Removing this DOXYGEN here should be ok.
> CI will tell us.
> 
> 
>> diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
>> index 52d2f8e969..09c2fe2485 100644
>> --- a/lib/eal/x86/include/meson.build
>> +++ b/lib/eal/x86/include/meson.build
>> @@ -16,6 +16,7 @@ arch_headers = files(
>>           'rte_spinlock.h',
>>           'rte_vect.h',
>>   )
>> +
> 
> Unrelated change.
> 
> 
>>   arch_indirect_headers = files(
>>           'rte_atomic_32.h',
>>           'rte_atomic_64.h',
> 
>
  
David Marchand Oct. 4, 2024, 9:54 a.m. UTC | #4
On Fri, Oct 4, 2024 at 11:21 AM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> > - Now, looking at what was available for arches so far in DPDK:
> > * ARM was relying by default on compiler implementation, with specific
> > implementations for ARM32 and ARM64 available (see for more details
> > below) => possible values (default first) RTE_USE_CC_MEMCPY = true /
> > false
> > * loongarch was relying on compiler implementation, with no specific
> > implementations, => RTE_USE_CC_MEMCPY = true
> > * ppc was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false
> > * risc was relying on compiler implementation, with no specific
> > implementations, => RTE_USE_CC_MEMCPY = true
> > * x86 was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false
> >
> > We can't get a unified default value for a meson option and keep
> > compat for all arches (except maybe introduce a "auto" value).
> >
> > Plus, disabling RTE_USE_CC_MEMCPY on loongarch and risc makes no
> > sense, as there was never a specific implementation.
> >
> > My suggestion is to drop the meson option and instead just set
> > RTE_USE_CC_MEMCPY in config/$arch/meson.build.
> > Testers / interested users may edit config/$arch/meson.build on their own.
> >
>
> So we've gone from...
>
> "Eliminate DPDK custom per-arch memcpy altogether"
> to
> "Keep custom memcpy, but make cc memcpy the default"
> to
> "Keep custom memcpy as the default, but make cc memcpy a build option"
> to
> "Keep custom memcpy as the default, and have the user modify some
> obscure build file to use cc memcpy"
>
> I seems like the natural next step is just
>
> "Keep the custom memcpy. Period."

Well, the current implementation has holes, that I tried to list so we
can move forward.

About adding a meson option, we try to have as less of them as
possible to reduce complexity.
And this is why an obscure option is probably the best so that
performance tests can be run with the compiler, without breaking
existing users.


If what I replied is irrelevant to others, well, I will let others
review _*in* *depth*_ and Thomas can merge the series.

Thanks.
  
Thomas Monjalon Oct. 4, 2024, 12:07 p.m. UTC | #5
04/10/2024 11:54, David Marchand:
> On Fri, Oct 4, 2024 at 11:21 AM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> > > - Now, looking at what was available for arches so far in DPDK:
> > > * ARM was relying by default on compiler implementation, with specific
> > > implementations for ARM32 and ARM64 available (see for more details
> > > below) => possible values (default first) RTE_USE_CC_MEMCPY = true /
> > > false
> > > * loongarch was relying on compiler implementation, with no specific
> > > implementations, => RTE_USE_CC_MEMCPY = true
> > > * ppc was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false
> > > * risc was relying on compiler implementation, with no specific
> > > implementations, => RTE_USE_CC_MEMCPY = true
> > > * x86 was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false
> > >
> > > We can't get a unified default value for a meson option and keep
> > > compat for all arches (except maybe introduce a "auto" value).
> > >
> > > Plus, disabling RTE_USE_CC_MEMCPY on loongarch and risc makes no
> > > sense, as there was never a specific implementation.
> > >
> > > My suggestion is to drop the meson option and instead just set
> > > RTE_USE_CC_MEMCPY in config/$arch/meson.build.
> > > Testers / interested users may edit config/$arch/meson.build on their own.
> > >
> >
> > So we've gone from...
> >
> > "Eliminate DPDK custom per-arch memcpy altogether"
> > to
> > "Keep custom memcpy, but make cc memcpy the default"
> > to
> > "Keep custom memcpy as the default, but make cc memcpy a build option"
> > to
> > "Keep custom memcpy as the default, and have the user modify some
> > obscure build file to use cc memcpy"
> >
> > I seems like the natural next step is just
> >
> > "Keep the custom memcpy. Period."
> 
> Well, the current implementation has holes, that I tried to list so we
> can move forward.
> 
> About adding a meson option, we try to have as less of them as
> possible to reduce complexity.
> And this is why an obscure option is probably the best so that
> performance tests can be run with the compiler, without breaking
> existing users.
> 
> 
> If what I replied is irrelevant to others, well, I will let others
> review _*in* *depth*_ and Thomas can merge the series.

No I won't merge it as is.

Mattias, please take a breath and think again about the comments David did.

We support the move to CC memcpy.
The CC memcpy was already used for some arches.
We agree having it as an option for all arches is a good step forward.
Having a compilation define to do some benchmark on any arch is good.
When it will become good enough, we will enable it on more arches by default.

I think it looks very reasonable, we just ask you to prepare the future
without breaking anything for now.
Then it will be very simple patches to decide enabling it more.
And because we want CC memcpy to become the only option,
it does not make sense to introduce a new option for the user.
  

Patch

diff --git a/config/meson.build b/config/meson.build
index 8c8b019c25..456056628e 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -353,6 +353,7 @@  endforeach
 # set other values pulled from the build options
 dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
 dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
+dpdk_conf.set('RTE_USE_CC_MEMCPY', get_option('use_cc_memcpy'))
 dpdk_conf.set('RTE_ENABLE_STDATOMIC', get_option('enable_stdatomic'))
 dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
 dpdk_conf.set('RTE_PKTMBUF_HEADROOM', get_option('pkt_mbuf_headroom'))
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 0ff70d9057..8be000294d 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -55,6 +55,26 @@  New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Compiler memcpy replaces custom DPDK implementation.**
+
+  The memory copy functions of ``<rte_memcpy.h>`` now optionally
+  delegates to the standard memcpy() function, implemented by the
+  compiler and the C runtime (e.g., libc).
+
+  In this release of DPDK, the handcrafted, per-architecture memory
+  copy implementations are still the default. Compiler memcpy is
+  enabled by setting the new ``use_cc_memcpy`` build option to true.
+
+  The performance benefits of the custom DPDK rte_memcpy()
+  implementations have been diminishing with every new compiler
+  release, and with current toolchains the use of a custom memcpy()
+  implementation may even result in worse performance than the
+  standard memcpy().
+
+  An additional benefit of using compiler memcpy is that compilers and
+  static analysis tools have an easier time detecting incorrect usage
+  of rte_memcpy() (e.g., buffer overruns, or overlapping source and
+  destination buffers).
 
 Removed Items
 -------------
diff --git a/lib/eal/arm/include/rte_memcpy.h b/lib/eal/arm/include/rte_memcpy.h
index 47dea9a8cc..5d2ea7dbfa 100644
--- a/lib/eal/arm/include/rte_memcpy.h
+++ b/lib/eal/arm/include/rte_memcpy.h
@@ -5,10 +5,19 @@ 
 #ifndef _RTE_MEMCPY_ARM_H_
 #define _RTE_MEMCPY_ARM_H_
 
+#if defined(RTE_USE_CC_MEMCPY) || !defined(RTE_ARCH_ARM64_MEMCPY)
+
+#define RTE_CC_MEMCPY
+#include <generic/rte_memcpy.h>
+
+#else
+
 #ifdef RTE_ARCH_64
 #include <rte_memcpy_64.h>
 #else
 #include <rte_memcpy_32.h>
 #endif
 
+#endif /* RTE_USE_CC_MEMCPY */
+
 #endif /* _RTE_MEMCPY_ARM_H_ */
diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
index e7f0f8eaa9..cfb0175bd2 100644
--- a/lib/eal/include/generic/rte_memcpy.h
+++ b/lib/eal/include/generic/rte_memcpy.h
@@ -5,12 +5,19 @@ 
 #ifndef _RTE_MEMCPY_H_
 #define _RTE_MEMCPY_H_
 
+#ifdef __cplusplus
+extern "C" {
+#endif
+
 /**
  * @file
  *
  * Functions for vectorised implementation of memcpy().
  */
 
+#include <stdint.h>
+#include <string.h>
+
 /**
  * Copy 16 bytes from one location to another using optimised
  * instructions. The locations should not overlap.
@@ -35,8 +42,6 @@  rte_mov16(uint8_t *dst, const uint8_t *src);
 static inline void
 rte_mov32(uint8_t *dst, const uint8_t *src);
 
-#ifdef __DOXYGEN__
-
 /**
  * Copy 48 bytes from one location to another using optimised
  * instructions. The locations should not overlap.
@@ -49,8 +54,6 @@  rte_mov32(uint8_t *dst, const uint8_t *src);
 static inline void
 rte_mov48(uint8_t *dst, const uint8_t *src);
 
-#endif /* __DOXYGEN__ */
-
 /**
  * Copy 64 bytes from one location to another using optimised
  * instructions. The locations should not overlap.
@@ -87,8 +90,6 @@  rte_mov128(uint8_t *dst, const uint8_t *src);
 static inline void
 rte_mov256(uint8_t *dst, const uint8_t *src);
 
-#ifdef __DOXYGEN__
-
 /**
  * Copy bytes from one location to another. The locations must not overlap.
  *
@@ -111,6 +112,52 @@  rte_mov256(uint8_t *dst, const uint8_t *src);
 static void *
 rte_memcpy(void *dst, const void *src, size_t n);
 
-#endif /* __DOXYGEN__ */
+#ifdef RTE_CC_MEMCPY
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	memcpy(dst, src, 16);
+}
+
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	memcpy(dst, src, 32);
+}
+
+static inline void
+rte_mov48(uint8_t *dst, const uint8_t *src)
+{
+	memcpy(dst, src, 48);
+}
+
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	memcpy(dst, src, 64);
+}
+
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	memcpy(dst, src, 128);
+}
+
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	memcpy(dst, src, 256);
+}
+
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n)
+{
+	return memcpy(dst, src, n);
+}
+#endif /* RTE_CC_MEMCPY */
+
+#ifdef __cplusplus
+}
+#endif
 
 #endif /* _RTE_MEMCPY_H_ */
diff --git a/lib/eal/loongarch/include/rte_memcpy.h b/lib/eal/loongarch/include/rte_memcpy.h
index 22578d40f4..4e6027caee 100644
--- a/lib/eal/loongarch/include/rte_memcpy.h
+++ b/lib/eal/loongarch/include/rte_memcpy.h
@@ -5,57 +5,7 @@ 
 #ifndef RTE_MEMCPY_LOONGARCH_H
 #define RTE_MEMCPY_LOONGARCH_H
 
-#include <stdint.h>
-#include <string.h>
-
-#include "rte_common.h"
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-#include "generic/rte_memcpy.h"
-
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	memcpy(dst, src, 16);
-}
-
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	memcpy(dst, src, 32);
-}
-
-static inline void
-rte_mov48(uint8_t *dst, const uint8_t *src)
-{
-	memcpy(dst, src, 48);
-}
-
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	memcpy(dst, src, 64);
-}
-
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	memcpy(dst, src, 128);
-}
-
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	memcpy(dst, src, 256);
-}
-
-#define rte_memcpy(d, s, n)	memcpy((d), (s), (n))
-
-#ifdef __cplusplus
-}
-#endif
+#define RTE_CC_MEMCPY
+#include <generic/rte_memcpy.h>
 
 #endif /* RTE_MEMCPY_LOONGARCH_H */
diff --git a/lib/eal/ppc/include/rte_memcpy.h b/lib/eal/ppc/include/rte_memcpy.h
index 6f388c0234..162c1483f5 100644
--- a/lib/eal/ppc/include/rte_memcpy.h
+++ b/lib/eal/ppc/include/rte_memcpy.h
@@ -6,6 +6,13 @@ 
 #ifndef _RTE_MEMCPY_PPC_64_H_
 #define _RTE_MEMCPY_PPC_64_H_
 
+#ifdef RTE_USE_CC_MEMCPY
+
+#define RTE_CC_MEMCPY
+#include <generic/rte_memcpy.h>
+
+#else
+
 #include <stdint.h>
 #include <string.h>
 
@@ -215,4 +222,6 @@  rte_memcpy_func(void *dst, const void *src, size_t n)
 }
 #endif
 
+#endif /* RTE_USE_CC_MEMCPY */
+
 #endif /* _RTE_MEMCPY_PPC_64_H_ */
diff --git a/lib/eal/riscv/include/rte_memcpy.h b/lib/eal/riscv/include/rte_memcpy.h
index e34f19396e..7f6c07d090 100644
--- a/lib/eal/riscv/include/rte_memcpy.h
+++ b/lib/eal/riscv/include/rte_memcpy.h
@@ -7,57 +7,7 @@ 
 #ifndef RTE_MEMCPY_RISCV_H
 #define RTE_MEMCPY_RISCV_H
 
-#include <stdint.h>
-#include <string.h>
-
-#include "rte_common.h"
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-#include "generic/rte_memcpy.h"
-
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	memcpy(dst, src, 16);
-}
-
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	memcpy(dst, src, 32);
-}
-
-static inline void
-rte_mov48(uint8_t *dst, const uint8_t *src)
-{
-	memcpy(dst, src, 48);
-}
-
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	memcpy(dst, src, 64);
-}
-
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	memcpy(dst, src, 128);
-}
-
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	memcpy(dst, src, 256);
-}
-
-#define rte_memcpy(d, s, n)	memcpy((d), (s), (n))
-
-#ifdef __cplusplus
-}
-#endif
+#define RTE_CC_MEMCPY
+#include <generic/rte_memcpy.h>
 
 #endif /* RTE_MEMCPY_RISCV_H */
diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
index 52d2f8e969..09c2fe2485 100644
--- a/lib/eal/x86/include/meson.build
+++ b/lib/eal/x86/include/meson.build
@@ -16,6 +16,7 @@  arch_headers = files(
         'rte_spinlock.h',
         'rte_vect.h',
 )
+
 arch_indirect_headers = files(
         'rte_atomic_32.h',
         'rte_atomic_64.h',
diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
index 42058e4a3f..2d9f5954f1 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -11,6 +11,13 @@ 
  * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
  */
 
+#ifdef RTE_USE_CC_MEMCPY
+
+#define RTE_CC_MEMCPY
+#include <generic/rte_memcpy.h>
+
+#else
+
 #include <stdio.h>
 #include <stdint.h>
 #include <string.h>
@@ -767,4 +774,6 @@  rte_memcpy(void *dst, const void *src, size_t n)
 }
 #endif
 
+#endif /* RTE_USE_CC_MEMCPY */
+
 #endif /* _RTE_MEMCPY_X86_64_H_ */
diff --git a/meson_options.txt b/meson_options.txt
index e49b2fc089..69a01f6578 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -60,3 +60,5 @@  option('tests', type: 'boolean', value: true, description:
        'build unit tests')
 option('use_hpet', type: 'boolean', value: false, description:
        'use HPET timer in EAL')
+option('use_cc_memcpy', type: 'boolean', value: false, description:
+       'Have the functions of <rte_memcpy.h> delegate to compiler/libc memcpy() instead of using custom implementation.')