[v4] eal: non-temporal memcpy

Message ID 20221010064600.16495-1-mb@smartsharesystems.com (mailing list archive)
State Changes Requested, archived
Delegated to: Thomas Monjalon
Headers
Series [v4] eal: non-temporal memcpy |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/iol-mellanox-Performance success Performance Testing PASS
ci/Intel-compilation success Compilation OK
ci/iol-intel-Performance success Performance Testing PASS
ci/github-robot: build fail github build: failed
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/intel-Testing success Testing PASS

Commit Message

Morten Brørup Oct. 10, 2022, 6:46 a.m. UTC
  This patch provides a function for memory copy using non-temporal store,
load or both, controlled by flags passed to the function.

Applications sometimes copy data to another memory location, which is only
used much later.
In this case, it is inefficient to pollute the data cache with the copied
data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture buffer
for offline analysis.

The purpose of the function is to achieve a performance gain by not
polluting the cache when copying data.
Although the throughput can be improved by further optimization, I do not
have time to do it now.

The functional tests and performance tests for memory copy have been
expanded to include non-temporal copying.

A non-temporal version of the mbuf library's function to create a full
copy of a given packet mbuf is provided.

The packet capture and packet dump libraries have been updated to use
non-temporal memory copy of the packets.

Implementation notes:

Implementations for non-x86 architectures can be provided by anyone at a
later time. I am not going to do it.

x86 non-temporal load instructions must be 16 byte aligned [1], and
non-temporal store instructions must be 4, 8 or 16 byte aligned [2].

ARM non-temporal load and store instructions seem to require 4 byte
alignment [3].

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
index.html#text=_mm_stream_load
[2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
index.html#text=_mm_stream_si
[3] https://developer.arm.com/documentation/100076/0100/
A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
LDNP--SIMD-and-FP-

This patch is a major rewrite from the RFC v3, so no version log comparing
to the RFC is provided.

v4
* Also ignore the warning for clang int the workaround for
  _mm_stream_load_si128() missing const in the parameter.
* Add missing C linkage specifier in rte_memcpy.h.

v3
* _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
  use it on 64-bit x86 architecture.
* CLANG warns that _mm_stream_load_si128_const() and
  rte_memcpy_nt_15_or_less_s16a() are not public,
  so remove __rte_internal from them. It also affects the documentation
  for the functions, so the fix can't be limited to CLANG.
* Use __rte_experimental instead of __rte_internal.
* Replace <n> with nnn in function documentation; it doesn't look like
  HTML.
* Slightly modify the workaround for _mm_stream_load_si128() missing const
  in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
  #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
  #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
* Fixed one coding style issue missed in v2.

v2
* The last 16 byte block of data, incl. any trailing bytes, were not
  copied from the source memory area in rte_memcpy_nt_buf().
* Fix many coding style issues.
* Add some missing header files.
* Fix build time warning for non-x86 architectures by using a different
  method to mark the flags parameter unused.
* CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
  so omit it when using CLANG.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_memcpy.c               |   65 +-
 app/test/test_memcpy_perf.c          |  187 ++--
 lib/eal/include/generic/rte_memcpy.h |  127 +++
 lib/eal/x86/include/rte_memcpy.h     | 1238 ++++++++++++++++++++++++++
 lib/mbuf/rte_mbuf.c                  |   77 ++
 lib/mbuf/rte_mbuf.h                  |   32 +
 lib/mbuf/version.map                 |    1 +
 lib/pcapng/rte_pcapng.c              |    3 +-
 lib/pdump/rte_pdump.c                |    6 +-
 9 files changed, 1645 insertions(+), 91 deletions(-)
  

Comments

Mattias Rönnblom Oct. 16, 2022, 2:27 p.m. UTC | #1
On 2022-10-10 08:46, Morten Brørup wrote:
> This patch provides a function for memory copy using non-temporal store,
> load or both, controlled by flags passed to the function.
> 
> Applications sometimes copy data to another memory location, which is only
> used much later.
> In this case, it is inefficient to pollute the data cache with the copied
> data.
> 
> An example use case (originating from a real life application):
> Copying filtered packets, or the first part of them, into a capture buffer
> for offline analysis.
> 
> The purpose of the function is to achieve a performance gain by not
> polluting the cache when copying data.
> Although the throughput can be improved by further optimization, I do not
> have time to do it now.
> 

The above section is a little repetitive, and only indirectly explains 
what NT loads/stores are.

"This patch provides a new function rte_memcpy_ex() for copying data 
between non-overlapping memory regions. The primary aim of 
rte_memcpy_ex() is to provide a rte_memcpy() (and memcpy()) plug-in 
replacement, where the user may opt for loads and/or stores with 
non-temporal hints to be used while copying the data.

By using a non-temporal hint, the program informs the system that it 
does not intended to further access the data any time soon.

This in turn allows the CPU to bypass the CPU caches or by other means 
avoid this unlikely-to-be-used-soon data to evict cache lines or force 
the future evictions of more useful cache lines."

You should also say something about the memory ordering issue.

> The functional tests and performance tests for memory copy have been
> expanded to include non-temporal copying.
> 
> A non-temporal version of the mbuf library's function to create a full
> copy of a given packet mbuf is provided.
> 
> The packet capture and packet dump libraries have been updated to use
> non-temporal memory copy of the packets.
>  > Implementation notes:
> 
> Implementations for non-x86 architectures can be provided by anyone at a
> later time. I am not going to do it.
> 
> x86 non-temporal load instructions must be 16 byte aligned [1], and
> non-temporal store instructions must be 4, 8 or 16 byte aligned [2].
> 
> ARM non-temporal load and store instructions seem to require 4 byte
> alignment [3].
> 

Would this patch be better off as a series? And maybe leave some of this 
information to a cover letter?

> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_load
> [2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_si
> [3] https://developer.arm.com/documentation/100076/0100/
> A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
> LDNP--SIMD-and-FP-
> 
> This patch is a major rewrite from the RFC v3, so no version log comparing
> to the RFC is provided.
> 
> v4
> * Also ignore the warning for clang int the workaround for
>    _mm_stream_load_si128() missing const in the parameter.
> * Add missing C linkage specifier in rte_memcpy.h.
> 
> v3
> * _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
>    use it on 64-bit x86 architecture.
> * CLANG warns that _mm_stream_load_si128_const() and
>    rte_memcpy_nt_15_or_less_s16a() are not public,
>    so remove __rte_internal from them. It also affects the documentation
>    for the functions, so the fix can't be limited to CLANG.
> * Use __rte_experimental instead of __rte_internal.
> * Replace <n> with nnn in function documentation; it doesn't look like
>    HTML.
> * Slightly modify the workaround for _mm_stream_load_si128() missing const
>    in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
>    #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
>    #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
> * Fixed one coding style issue missed in v2.
> 
> v2
> * The last 16 byte block of data, incl. any trailing bytes, were not
>    copied from the source memory area in rte_memcpy_nt_buf().
> * Fix many coding style issues.
> * Add some missing header files.
> * Fix build time warning for non-x86 architectures by using a different
>    method to mark the flags parameter unused.
> * CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
>    so omit it when using CLANG.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>   app/test/test_memcpy.c               |   65 +-
>   app/test/test_memcpy_perf.c          |  187 ++--
>   lib/eal/include/generic/rte_memcpy.h |  127 +++
>   lib/eal/x86/include/rte_memcpy.h     | 1238 ++++++++++++++++++++++++++
>   lib/mbuf/rte_mbuf.c                  |   77 ++
>   lib/mbuf/rte_mbuf.h                  |   32 +
>   lib/mbuf/version.map                 |    1 +
>   lib/pcapng/rte_pcapng.c              |    3 +-
>   lib/pdump/rte_pdump.c                |    6 +-
>   9 files changed, 1645 insertions(+), 91 deletions(-)
> 
> diff --git a/app/test/test_memcpy.c b/app/test/test_memcpy.c
> index 1ab86f4967..12410ce413 100644
> --- a/app/test/test_memcpy.c
> +++ b/app/test/test_memcpy.c
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #include <stdint.h>
> @@ -36,6 +37,19 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
>   /* Data is aligned on this many bytes (power of 2) */
>   #define ALIGNMENT_UNIT          32
>   
> +const uint64_t nt_mode_flags[4] = {

Delete "4".

> +	0,
> +	RTE_MEMOPS_F_SRC_NT,
> +	RTE_MEMOPS_F_DST_NT,
> +	RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT
> +};
> +const char * const nt_mode_str[4] = {

Delete "4".

> +	"none",
> +	"src",
> +	"dst",
> +	"src+dst"
> +};
> +
>   
>   /*
>    * Create two buffers, and initialise one with random values. These are copied
> @@ -44,12 +58,13 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
>    * changed.
>    */
>   static int
> -test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
> +test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size, unsigned int nt_mode)
>   {
>   	unsigned int i;
>   	uint8_t dest[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
>   	uint8_t src[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
>   	void * ret;
> +	const uint64_t flags = nt_mode_flags[nt_mode];
>   
>   	/* Setup buffers */
>   	for (i = 0; i < SMALL_BUFFER_SIZE + ALIGNMENT_UNIT; i++) {
> @@ -58,18 +73,23 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   	}
>   
>   	/* Do the copy */
> -	ret = rte_memcpy(dest + off_dst, src + off_src, size);
> -	if (ret != (dest + off_dst)) {
> -		printf("rte_memcpy() returned %p, not %p\n",
> -		       ret, dest + off_dst);
> +	if (nt_mode) {
> +		rte_memcpy_ex(dest + off_dst, src + off_src, size, flags);
> +	} else {
> +		ret = rte_memcpy(dest + off_dst, src + off_src, size);
> +		if (ret != (dest + off_dst)) {
> +			printf("rte_memcpy() returned %p, not %p\n",
> +			       ret, dest + off_dst);
> +		}
>   	}
>   
>   	/* Check nothing before offset is affected */
>   	for (i = 0; i < off_dst; i++) {
>   		if (dest[i] != 0) {
> -			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
> +			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
>   			       "[modified before start of dst].\n",
> -			       (unsigned)size, off_src, off_dst);
> +			       nt_mode ? "_ex" : "",

Introduce nt_mode_name() helper, which returns a string.

> +			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
>   			return -1;
>   		}
>   	}
> @@ -77,9 +97,11 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   	/* Check everything was copied */
>   	for (i = 0; i < size; i++) {
>   		if (dest[i + off_dst] != src[i + off_src]) {
> -			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
> -			       "[didn't copy byte %u].\n",
> -			       (unsigned)size, off_src, off_dst, i);
> +			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
> +			       "[didn't copy byte %u: 0x%02x!=0x%02x].\n",
> +			       nt_mode ? "_ex" : "",
> +			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode], i,
> +			       dest[i + off_dst], src[i + off_src]);
>   			return -1;
>   		}
>   	}
> @@ -87,9 +109,10 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   	/* Check nothing after copy was affected */
>   	for (i = size; i < SMALL_BUFFER_SIZE; i++) {
>   		if (dest[i + off_dst] != 0) {
> -			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
> +			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
>   			       "[copied too many].\n",
> -			       (unsigned)size, off_src, off_dst);
> +			       nt_mode ? "_ex" : "",
> +			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);

For the size_t argument, use the 'z' length modifier, instead of a cast.

>   			return -1;
>   		}
>   	}
> @@ -102,16 +125,18 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   static int
>   func_test(void)
>   {
> -	unsigned int off_src, off_dst, i;
> +	unsigned int off_src, off_dst, i, nt_mode;
>   	int ret;
>   
> -	for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
> -		for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
> -			for (i = 0; i < RTE_DIM(buf_sizes); i++) {
> -				ret = test_single_memcpy(off_src, off_dst,
> -				                         buf_sizes[i]);
> -				if (ret != 0)
> -					return -1;
> +	for (nt_mode = 0; nt_mode < 4; nt_mode++) {
> +		for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
> +			for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
> +				for (i = 0; i < RTE_DIM(buf_sizes); i++) {
> +					ret = test_single_memcpy(off_src, off_dst,
> +								 buf_sizes[i], nt_mode);
> +					if (ret != 0)
> +						return -1;
> +				}
>   			}
>   		}
>   	}
> diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
> index 3727c160e6..6bb52cba88 100644
> --- a/app/test/test_memcpy_perf.c
> +++ b/app/test/test_memcpy_perf.c
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #include <stdint.h>
> @@ -15,6 +16,7 @@
>   #include <rte_malloc.h>
>   
>   #include <rte_memcpy.h>
> +#include <rte_atomic.h>
>   
>   #include "test.h"
>   
> @@ -27,9 +29,9 @@
>   /* List of buffer sizes to test */
>   #if TEST_VALUE_RANGE == 0
>   static size_t buf_sizes[] = {
> -	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
> -	129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448,
> -	449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
> +	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 40, 48, 60, 63, 64, 65, 80, 92, 124,
> +	127, 128, 129, 140, 152, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447,
> +	448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
>   	2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
>   };
>   /* MUST be as large as largest packet size above */
> @@ -72,7 +74,7 @@ static uint8_t *small_buf_read, *small_buf_write;
>   static int
>   init_buffers(void)
>   {
> -	unsigned i;
> +	unsigned int i;
>   
>   	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
>   	if (large_buf_read == NULL)
> @@ -151,7 +153,7 @@ static void
>   do_uncached_write(uint8_t *dst, int is_dst_cached,
>   				  const uint8_t *src, int is_src_cached, size_t size)
>   {
> -	unsigned i, j;
> +	unsigned int i, j;
>   	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];
>   
>   	for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) {
> @@ -167,66 +169,112 @@ do_uncached_write(uint8_t *dst, int is_dst_cached,
>    * Run a single memcpy performance test. This is a macro to ensure that if
>    * the "size" parameter is a constant it won't be converted to a variable.
>    */
> -#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,                   \
> -                         src, is_src_cached, src_uoffset, size)             \
> -do {                                                                        \
> -    unsigned int iter, t;                                                   \
> -    size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];          \
> -    uint64_t start_time, total_time = 0;                                    \
> -    uint64_t total_time2 = 0;                                               \
> -    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
> -        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
> -                         src_addrs, is_src_cached, src_uoffset);            \
> -        start_time = rte_rdtsc();                                           \
> -        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
> -            rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size);           \
> -        total_time += rte_rdtsc() - start_time;                             \
> -    }                                                                       \
> -    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
> -        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
> -                         src_addrs, is_src_cached, src_uoffset);            \
> -        start_time = rte_rdtsc();                                           \
> -        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
> -            memcpy(dst+dst_addrs[t], src+src_addrs[t], size);               \
> -        total_time2 += rte_rdtsc() - start_time;                            \
> -    }                                                                       \
> -    printf("%3.0f -", (double)total_time  / TEST_ITERATIONS);                 \
> -    printf("%3.0f",   (double)total_time2 / TEST_ITERATIONS);                 \
> -    printf("(%6.2f%%) ", ((double)total_time - total_time2)*100/total_time2); \
> +#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,					  \
> +			 src, is_src_cached, src_uoffset, size)					  \
> +do {												  \
> +	unsigned int iter, t;									  \
> +	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];				  \
> +	uint64_t start_time;									  \
> +	uint64_t total_time_rte = 0, total_time_std = 0;					  \
> +	uint64_t total_time_ntd = 0, total_time_nts = 0, total_time_nt = 0;			  \
> +	const uint64_t flags = ((dst_uoffset == 0) ?						  \
> +				(ALIGNMENT_UNIT << RTE_MEMOPS_F_DSTA_SHIFT) : 0) |		  \
> +			       ((src_uoffset == 0) ?						  \
> +				(ALIGNMENT_UNIT << RTE_MEMOPS_F_SRCA_SHIFT) : 0);		  \
> +	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
> +		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
> +				 src_addrs, is_src_cached, src_uoffset);			  \
> +		start_time = rte_rdtsc();							  \
> +		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
> +			rte_memcpy(dst + dst_addrs[t], src + src_addrs[t], size);		  \
> +		total_time_rte += rte_rdtsc() - start_time;					  \
> +	}											  \
> +	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
> +		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
> +				 src_addrs, is_src_cached, src_uoffset);			  \
> +		start_time = rte_rdtsc();							  \
> +		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
> +			memcpy(dst + dst_addrs[t], src + src_addrs[t], size);			  \
> +		total_time_std += rte_rdtsc() - start_time;					  \
> +	}											  \
> +	if (!(is_dst_cached && is_src_cached)) {						  \
> +		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
> +			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
> +					 src_addrs, is_src_cached, src_uoffset);		  \
> +			start_time = rte_rdtsc();						  \
> +			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
> +				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
> +					      flags | RTE_MEMOPS_F_DST_NT);			  \
> +			total_time_ntd += rte_rdtsc() - start_time;				  \
> +		}										  \
> +		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
> +			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
> +					 src_addrs, is_src_cached, src_uoffset);		  \
> +			start_time = rte_rdtsc();						  \
> +			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
> +				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
> +					      flags | RTE_MEMOPS_F_SRC_NT);			  \
> +			total_time_nts += rte_rdtsc() - start_time;				  \
> +		}										  \
> +		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
> +			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
> +					 src_addrs, is_src_cached, src_uoffset);		  \
> +			start_time = rte_rdtsc();						  \
> +			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
> +				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
> +					      flags | RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT); \
> +			total_time_nt += rte_rdtsc() - start_time;				  \
> +		}										  \
> +	}											  \
> +	printf(" %4.0f-", (double)total_time_rte / TEST_ITERATIONS);				  \
> +	printf("%4.0f",   (double)total_time_std / TEST_ITERATIONS);				  \
> +	printf("(%+4.0f%%)", ((double)total_time_rte - total_time_std) * 100 / total_time_std);   \
> +	if (!(is_dst_cached && is_src_cached)) {						  \
> +		printf(" %4.0f", (double)total_time_ntd / TEST_ITERATIONS);			  \
> +		printf(" %4.0f", (double)total_time_nts / TEST_ITERATIONS);			  \
> +		printf(" %4.0f", (double)total_time_nt / TEST_ITERATIONS);			  \
> +		if (total_time_nt / total_time_std > 9)						  \
> +			printf("(*%4.1f)", (double)total_time_nt / total_time_std);		  \
> +		else										  \
> +			printf("(%+4.0f%%)",							  \
> +			       ((double)total_time_nt - total_time_std) * 100 / total_time_std);  \
> +	}											  \
>   } while (0)
>   
>   /* Run aligned memcpy tests for each cached/uncached permutation */
> -#define ALL_PERF_TESTS_FOR_SIZE(n)                                       \
> -do {                                                                     \
> -    if (__builtin_constant_p(n))                                         \
> -        printf("\nC%6u", (unsigned)n);                                   \
> -    else                                                                 \
> -        printf("\n%7u", (unsigned)n);                                    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);    \
> +#define ALL_PERF_TESTS_FOR_SIZE(n)						\
> +do {										\
> +	if (__builtin_constant_p(n))						\
> +		printf("\nC%6u", (unsigned int)n);				\
> +	else									\
> +		printf("\n%7u", (unsigned int)n);				\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);	\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);	\
>   } while (0)
>   
>   /* Run unaligned memcpy tests for each cached/uncached permutation */
> -#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)                             \
> -do {                                                                     \
> -    if (__builtin_constant_p(n))                                         \
> -        printf("\nC%6u", (unsigned)n);                                   \
> -    else                                                                 \
> -        printf("\n%7u", (unsigned)n);                                    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);    \
> +#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)					\
> +do {										\
> +	if (__builtin_constant_p(n))						\
> +		printf("\nC%6u", (unsigned int)n);				\
> +	else									\
> +		printf("\n%7u", (unsigned int)n);				\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);	\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);	\
>   } while (0)
>   
>   /* Run memcpy tests for constant length */
> -#define ALL_PERF_TEST_FOR_CONSTANT                                      \
> -do {                                                                    \
> -    TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);         \
> -    TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);      \
> -    TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);    \
> +#define ALL_PERF_TEST_FOR_CONSTANT						\
> +do {										\
> +	TEST_CONSTANT(4U); TEST_CONSTANT(6U); TEST_CONSTANT(8U);		\
> +	TEST_CONSTANT(16U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);		\
> +	TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);		\
> +	TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);	\
> +	TEST_CONSTANT(2048U);							\
>   } while (0)
>   
>   /* Run all memcpy tests for aligned constant cases */
> @@ -251,7 +299,7 @@ perf_test_constant_unaligned(void)
>   static inline void
>   perf_test_variable_aligned(void)
>   {
> -	unsigned i;
> +	unsigned int i;
>   	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
>   		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
>   	}
> @@ -261,7 +309,7 @@ perf_test_variable_aligned(void)
>   static inline void
>   perf_test_variable_unaligned(void)
>   {
> -	unsigned i;
> +	unsigned int i;
>   	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
>   		ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]);
>   	}
> @@ -282,7 +330,7 @@ perf_test(void)
>   
>   #if TEST_VALUE_RANGE != 0
>   	/* Set up buf_sizes array, if required */
> -	unsigned i;
> +	unsigned int i;
>   	for (i = 0; i < TEST_VALUE_RANGE; i++)
>   		buf_sizes[i] = i;
>   #endif
> @@ -290,13 +338,14 @@ perf_test(void)
>   	/* See function comment */
>   	do_uncached_write(large_buf_write, 0, small_buf_read, 1, SMALL_BUFFER_SIZE);
>   
> -	printf("\n** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **\n"
> -		   "======= ================= ================= ================= =================\n"
> -		   "   Size   Cache to cache     Cache to mem      Mem to cache        Mem to mem\n"
> -		   "(bytes)          (ticks)          (ticks)           (ticks)           (ticks)\n"
> -		   "------- ----------------- ----------------- ----------------- -----------------");
> +	printf("\n** rte_memcpy(RTE)/memcpy(STD)/rte_memcpy_ex(NTD/NTS/NT) - memcpy perf. tests (C = compile-time constant) **\n"
> +		   "======= ================ ====================================== ====================================== ======================================\n"
> +		   "   Size  Cache to cache               Cache to mem                           Mem to cache                            Mem to mem\n"
> +		   "(bytes)         (ticks)                    (ticks)                                (ticks)                               (ticks)\n"
> +		   "         RTE- STD(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)\n"
> +		   "------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
>   
> -	printf("\n================================= %2dB aligned =================================",
> +	printf("\n================================================================ %2dB aligned ===============================================================",
>   		ALIGNMENT_UNIT);
>   	/* Do aligned tests where size is a variable */
>   	timespec_get(&tv_begin, TIME_UTC);
> @@ -304,28 +353,28 @@ perf_test(void)
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_aligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n------- ----------------- ----------------- ----------------- -----------------");
> +	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
>   	/* Do aligned tests where size is a compile-time constant */
>   	timespec_get(&tv_begin, TIME_UTC);
>   	perf_test_constant_aligned();
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_aligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n================================== Unaligned ==================================");
> +	printf("\n================================================================= Unaligned =================================================================");
>   	/* Do unaligned tests where size is a variable */
>   	timespec_get(&tv_begin, TIME_UTC);
>   	perf_test_variable_unaligned();
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_unaligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n------- ----------------- ----------------- ----------------- -----------------");
> +	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
>   	/* Do unaligned tests where size is a compile-time constant */
>   	timespec_get(&tv_begin, TIME_UTC);
>   	perf_test_constant_unaligned();
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_unaligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n======= ================= ================= ================= =================\n\n");
> +	printf("\n======= ================ ====================================== ====================================== ======================================\n\n");
>   
>   	printf("Test Execution Time (seconds):\n");
>   	printf("Aligned variable copy size   = %8.3f\n", time_aligned);
> diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
> index e7f0f8eaa9..b087f09c35 100644
> --- a/lib/eal/include/generic/rte_memcpy.h
> +++ b/lib/eal/include/generic/rte_memcpy.h
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #ifndef _RTE_MEMCPY_H_
> @@ -11,6 +12,13 @@
>    * Functions for vectorised implementation of memcpy().
>    */
>   
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
>   /**
>    * Copy 16 bytes from one location to another using optimised
>    * instructions. The locations should not overlap.
> @@ -113,4 +121,123 @@ rte_memcpy(void *dst, const void *src, size_t n);
>   
>   #endif /* __DOXYGEN__ */
>   
> +/*
> + * Advanced/Non-Temporal Memory Operations Flags.
> + */
> +
> +/** Length alignment hint mask. */
> +#define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)
> +/** Length alignment hint shift. */
> +#define RTE_MEMOPS_F_LENA_SHIFT 0
> +/** Hint: Length is 2 byte aligned. */
> +#define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)
> +/** Hint: Length is 4 byte aligned. */
> +#define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)
> +/** Hint: Length is 8 byte aligned. */
> +#define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)
> +/** Hint: Length is 16 byte aligned. */
> +#define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)
> +/** Hint: Length is 32 byte aligned. */
> +#define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)
> +/** Hint: Length is 64 byte aligned. */
> +#define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)
> +/** Hint: Length is 128 byte aligned. */
> +#define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)
> +
> +/** Prefer non-temporal access to source memory area.
> + */
> +#define RTE_MEMOPS_F_SRC_NT     (UINT64_C(1) << 8)
> +/** Source address alignment hint mask. */
> +#define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 8)
> +/** Source address alignment hint shift. */
> +#define RTE_MEMOPS_F_SRCA_SHIFT 8
> +/** Hint: Source address is 2 byte aligned. */
> +#define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 8)
> +/** Hint: Source address is 4 byte aligned. */
> +#define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 8)
> +/** Hint: Source address is 8 byte aligned. */
> +#define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 8)
> +/** Hint: Source address is 16 byte aligned. */
> +#define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 8)
> +/** Hint: Source address is 32 byte aligned. */
> +#define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 8)
> +/** Hint: Source address is 64 byte aligned. */
> +#define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 8)
> +/** Hint: Source address is 128 byte aligned. */
> +#define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 8)
> +
> +/** Prefer non-temporal access to destination memory area.
> + *
> + * On x86 architecture:
> + * Remember to call rte_wmb() after a sequence of copy operations.
> + */

NT memcpy should have memcpy() semantics by default, and there should be 
a flag if you don't want a sfence after any NT stores, or lfence before 
any NT loads, on x86. That is, assuming the x86 memcpy_ex w/ NT hints 
will always be using NT stores, as oppose to regular stores + cflushopt. 
For the latter case, or in x86 cases where the NT store variants aren̈́t 
supported, the fencing isn't needed, even on x86.

I don't know what the "ignore ordering" flag should be called.

RTE_MEMOPS_F_NO_MB
RTE_MEMOPS_F_UNORDERED
RTE_MEMOPS_F_NO_WMB
RTE_MEMOPS_F_NO_RMB

For those that use this "ignore ordering" flag (or for anyone using the 
API this patch proposes), there will be a need to insert a barrier at 
some point, unless the application is completely serial. It should be 
possible to do this in a portable manner. No #ifdef x86.

One way to attack this is to have two new functions rte_nt_wmb() and 
rte_nt_rmb() (or maybe rte_memcpy_nt_w|rmb()), which calls sfence/lfence 
(or whatever is needed on that architecture), to order the NT loads 
and/or NT stores with load/stores in the default memory consistency model.

> +#define RTE_MEMOPS_F_DST_NT     (UINT64_C(1) << 16)
> +/** Destination address alignment hint mask. */
> +#define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 16)
> +/** Destination address alignment hint shift. */
> +#define RTE_MEMOPS_F_DSTA_SHIFT 16
> +/** Hint: Destination address is 2 byte aligned. */
> +#define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 16)
> +/** Hint: Destination address is 4 byte aligned. */
> +#define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 16)
> +/** Hint: Destination address is 8 byte aligned. */
> +#define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 16)
> +/** Hint: Destination address is 16 byte aligned. */
> +#define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 16)
> +/** Hint: Destination address is 32 byte aligned. */
> +#define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 16)
> +/** Hint: Destination address is 64 byte aligned. */
> +#define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 16)
> +/** Hint: Destination address is 128 byte aligned. */
> +#define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 16)
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Advanced/non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the destination memory area.
> + * @param src
> + *   Pointer to the source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + *   Any of the RTE_MEMOPS_F_(SRC|DST)_NT, RTE_MEMOPS_F_(LEN|SRC|DST)nnnA flags.
> + *   Must be constant at build time.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags);
> +
> +#ifndef RTE_MEMCPY_EX_ARCH_DEFINED
> +
> +/* Fallback implementation, if no arch-specific implementation is provided. */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)

I like the rte_memcpy_ex() name, in particular that it doesn't say 
anything about NT.

Is there a point in having flags declared const?

> +{
> +	RTE_SET_USED(flags);
> +	memcpy(dst, src, len);

Fall back to rte_memcpy().

> +}
> +
> +#endif /* RTE_MEMCPY_EX_ARCH_DEFINED */
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
>   #endif /* _RTE_MEMCPY_H_ */
> diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
> index d4d7a5cfc8..31d0faf7a8 100644
> --- a/lib/eal/x86/include/rte_memcpy.h
> +++ b/lib/eal/x86/include/rte_memcpy.h
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #ifndef _RTE_MEMCPY_X86_64_H_
> @@ -17,6 +18,10 @@
>   #include <rte_vect.h>
>   #include <rte_common.h>
>   #include <rte_config.h>
> +#include <rte_debug.h>
> +
> +#define RTE_MEMCPY_EX_ARCH_DEFINED
> +#include "generic/rte_memcpy.h"
>   
>   #ifdef __cplusplus
>   extern "C" {
> @@ -868,6 +873,1239 @@ rte_memcpy(void *dst, const void *src, size_t n)
>   		return rte_memcpy_generic(dst, src, n);
>   }
>   
> +/*
> + * Advanced/Non-Temporal Memory Operations.
> + */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Workaround for _mm_stream_load_si128() missing const in the parameter.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__m128i _mm_stream_load_si128_const(const __m128i *const mem_addr)

I'm not sure it's wise to use the _mm namespace for this wrapper. There 
could be a fix to this issue, and this fix could be exactly where you 
landed here.

__rte_mm_stream_load_si128()?

> +{
> +	/* GCC 4.5.8 (in RHEL7) doesn't support the #pragma to ignore "-Wdiscarded-qualifiers".
> +	 * So we explicitly type cast mem_addr and use the #pragma to ignore "-Wcast-qual".
> +	 */
> +#if defined(RTE_TOOLCHAIN_GCC)
> +#pragma GCC diagnostic push
> +#pragma GCC diagnostic ignored "-Wcast-qual"
> +#elif defined(RTE_TOOLCHAIN_CLANG)
> +#pragma clang diagnostic push
> +#pragma clang diagnostic ignored "-Wcast-qual"
> +#endif
> +	return _mm_stream_load_si128((__m128i *)mem_addr);
> +#if defined(RTE_TOOLCHAIN_GCC)
> +#pragma GCC diagnostic pop
> +#elif defined(RTE_TOOLCHAIN_CLANG)
> +#pragma clang diagnostic pop
> +#endif
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Memory copy from non-temporal source area.
> + *
> + * @note
> + * Performance is optimal when source pointer is 16 byte aligned.
> + *
> + * @param dst
> + *   Pointer to the destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + *   Any of the RTE_MEMOPS_F_(LEN|SRC)nnnA flags.
> + *   The RTE_MEMOPS_F_SRC_NT flag must be set.
> + *   The RTE_MEMOPS_F_DST_NT flag must be clear.
> + *   The RTE_MEMOPS_F_DSTnnnA flags are ignored.
> + *   Must be constant at build time.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nts(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)

Why not have rte_memcpy_ex() as the single addition to the public API? 
Then you may have __rte-prefixed helpers as well, but not to be directly 
called by the application. Would simplify things from a 
documentation/user comprehension point of view, I think.

> +{
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;

Declare the xmm<N> in the scope they are used (those that are used).

Aren't you supposed to have a single whitespace between the type and the 
name in DPDK? I may be mistaken.

> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_SRC_NT);
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* If source is not 16 byte aligned, then copy first part of data via bounce buffer,
> +	 * to achieve 16 byte alignment of source pointer.
> +	 * This invalidates the source, destination and length alignment flags, and
> +	 * potentially makes the destination pointer unaligned.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)) {

I think it's worth giving this expression a name, especially since it's
repeatedly used.

const bool src_atleast_16a = (flags & RTE_MEMOPS_F_SRCA_MASK) >=
RTE_MEMOPS_F_SRC16A;

An alternative would be to have a macro RTE_MEMOPS_ATLEAST_SRC16A(flags).

> +		/* Source is not known to be 16 byte aligned, but might be. */
> +		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
> +		const size_t    offset = (uintptr_t)src & 15;
> +
> +		if (offset) {
offset > 0
> +			/* Source is not 16 byte aligned. */
> +			char            buffer[16] __rte_aligned(16);
> +			/** How many bytes is source away from 16 byte alignment
> +			 * (ceiling rounding).
> +			 */
> +			const size_t    first = 16 - offset;
> +
> +			xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
> +			_mm_store_si128((void *)buffer, xmm0);
> +
> +			/* Test for short length.
> +			 *
> +			 * Omitted if length is known to be >= 16.
> +			 */
> +			if (!(__builtin_constant_p(len) && len >= 16) &&

Why is __builtin_constant_p() used here?

> +					unlikely(len <= first)) {
> +				/* Short length. */
> +				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
> +				return;
> +			}
> +
> +			/* Copy until source pointer is 16 byte aligned. */
> +			rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), first);
> +			src = RTE_PTR_ADD(src, first);
> +			dst = RTE_PTR_ADD(dst, first);
> +			len -= first;
> +		}
> +	}
> +
> +	/* Source pointer is now 16 byte aligned. */
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);

Is this some attempt at manual register allocation, or why is "xmm2" 
used, and not "xmm0"?

> +		_mm_storeu_si128(dst, xmm2);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, if any, via bounce buffer.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid) and length is known to be 16 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
> +			(len & 15)) {
> +		char    buffer[16] __rte_aligned(16);
> +
> +		xmm3 = _mm_stream_load_si128_const(src);

If this is indeed a register allocation trick, it should be mentioned in 
a commment. Otherwise it's just confusing. If it's a trick, does it 
actually have a positive effect? I wouldn't expect the compiler to take 
"xmm3" so literally, and secondarly, register renaming in the CPU to fix 
the false dependency.

> +		_mm_store_si128((void *)buffer, xmm3);
> +		rte_mov15_or_less(dst, buffer, len & 15);
> +	}
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Memory copy to non-temporal destination area.
> + *
> + * @note
> + * If the destination and/or length is unaligned, the first and/or last copied
> + * bytes will be stored in the destination memory area using temporal access.
> + * @note
> + * Performance is optimal when destination pointer is 16 byte aligned.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + *   Any of the RTE_MEMOPS_F_(LEN|DST)nnnA flags.
> + *   The RTE_MEMOPS_F_SRC_NT flag must be clear.
> + *   The RTE_MEMOPS_F_DST_NT flag must be set.
> + *   The RTE_MEMOPS_F_SRCnnnA flags are ignored.
> + *   Must be constant at build time.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ntd(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)

This should also go into the __rte_memcpy namespace, rather than 
rte_memcpy*.

> +{
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_DST_NT);
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ||
> +			len >= 16) {

See my comments on the SRCA mask handling.

> +		/* Length >= 16 and/or destination is known to be 16 byte aligned. */
> +		register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +		/* If destination is not 16 byte aligned, then copy first part of data,
> +		 * to achieve 16 byte alignment of destination pointer.
> +		 * This invalidates the source, destination and length alignment flags, and
> +		 * potentially makes the source pointer unaligned.
> +		 *
> +		 * Omitted if destination is known to be 16 byte aligned.
> +		 */
> +		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A)) {
> +			/* Destination is not known to be 16 byte aligned, but might be. */
> +			/** How many bytes is destination offset from 16 byte alignment
> +			 * (floor rounding).
> +			 */
> +			const size_t    offset = (uintptr_t)dst & 15;
> +
> +			if (offset) {
> +				/* Destination is not 16 byte aligned. */
> +				/** How many bytes is destination away from 16 byte alignment
> +				 * (ceiling rounding).
> +				 */
> +				const size_t    first = 16 - offset;
> +
> +				if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
> +						(offset & 3) == 0) {
> +					/* Destination is (known to be) 4 byte aligned. */
> +					int32_t r0, r1, r2;
> +
> +					/* Copy until destination pointer is 16 byte aligned. */
> +					if (first & 8) {
> +						memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
> +						memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
> +						_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
> +						_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
> +						src = RTE_PTR_ADD(src, 8);
> +						dst = RTE_PTR_ADD(dst, 8);
> +						len -= 8;
> +					}
> +					if (first & 4) {
> +						memcpy(&r2, src, 4);
> +						_mm_stream_si32(dst, r2);
> +						src = RTE_PTR_ADD(src, 4);
> +						dst = RTE_PTR_ADD(dst, 4);
> +						len -= 4;
> +					}
> +				} else {
> +					/* Destination is not 4 byte aligned. */
> +					/* Copy until destination pointer is 16 byte aligned. */
> +					rte_mov15_or_less(dst, src, first);
> +					src = RTE_PTR_ADD(src, first);
> +					dst = RTE_PTR_ADD(dst, first);
> +					len -= first;
> +				}
> +			}
> +		}
> +
> +		/* Destination pointer is now 16 byte aligned. */
> +		RTE_ASSERT(rte_is_aligned(dst, 16));
> +
> +		/* Copy large portion of data in chunks of 64 byte. */
> +		while (len >= 64) {
> +			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
> +			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
> +			xmm2 = _mm_loadu_si128(RTE_PTR_ADD(src, 2 * 16));
> +			xmm3 = _mm_loadu_si128(RTE_PTR_ADD(src, 3 * 16));
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
> +			src = RTE_PTR_ADD(src, 64);
> +			dst = RTE_PTR_ADD(dst, 64);
> +			len -= 64;
> +		}
> +
> +		/* Copy following 32 and 16 byte portions of data.
> +		 *
> +		 * Omitted if destination is known to be 16 byte aligned (so the alignment
> +		 * flags are still valid)
> +		 * and length is known to be respectively 64 or 32 byte aligned.
> +		 */
> +		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
> +				(len & 32)) {
> +			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
> +			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +			src = RTE_PTR_ADD(src, 32);
> +			dst = RTE_PTR_ADD(dst, 32);
> +		}
> +		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
> +				(len & 16)) {
> +			xmm2 = _mm_loadu_si128(src);
> +			_mm_stream_si128(dst, xmm2);
> +			src = RTE_PTR_ADD(src, 16);
> +			dst = RTE_PTR_ADD(dst, 16);
> +		}
> +	} else {
> +		/* Length <= 15, and
> +		 * destination is not known to be 16 byte aligned (but might be).
> +		 */
> +		/* If destination is not 4 byte aligned, then
> +		 * use normal copy and return.
> +		 *
> +		 * Omitted if destination is known to be 4 byte aligned.
> +		 */
> +		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) &&
> +				!rte_is_aligned(dst, 4)) {
> +			/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
> +			rte_mov15_or_less(dst, src, len);
> +			return;
> +		}
> +		/* Destination is (known to be) 4 byte aligned. Proceed. */
> +	}
> +
> +	/* Destination pointer is now 4 byte (or 16 byte) aligned. */
> +	RTE_ASSERT(rte_is_aligned(dst, 4));
> +
> +	/* Copy following 8 and 4 byte portions of data.
> +	 *
> +	 * Omitted if destination is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 16 or 8 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
> +			(len & 8)) {
> +		int32_t r0, r1;
> +
> +		memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
> +		memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
> +		src = RTE_PTR_ADD(src, 8);
> +		dst = RTE_PTR_ADD(dst, 8);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A)) &&
> +			(len & 4)) {
> +		int32_t r2;
> +
> +		memcpy(&r2, src, 4);
> +		_mm_stream_si32(dst, r2);
> +		src = RTE_PTR_ADD(src, 4);
> +		dst = RTE_PTR_ADD(dst, 4);
> +	}
> +
> +	/* Copy remaining 2 and 1 byte portions of data.
> +	 *
> +	 * Omitted if destination is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 4 and 2 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A)) &&
> +			(len & 2)) {
> +		int16_t r3;
> +
> +		memcpy(&r3, src, 2);
> +		*(int16_t *)dst = r3;

Writing to 'dst' both through an int16_t pointer and a void pointer 
could cause type-based aliasing issues.

There's no reason not to use memcpy() here.

> +		src = RTE_PTR_ADD(src, 2);
> +		dst = RTE_PTR_ADD(dst, 2);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A)) &&
> +			(len & 1))
> +		*(char *)dst = *(const char *)src;
> +}
> +
> +/**
> + * Non-temporal memory copy of 15 or less byte
> + * from 16 byte aligned source via bounce buffer.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Only the 4 least significant bits of this parameter are used.
> + *   The 4 least significant bits of this holds the number of remaining bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_15_or_less_s16a(void *__rte_restrict dst,
> +		const void *__rte_restrict src, size_t len, const uint64_t flags)
> +{
> +	int32_t             buffer[4] __rte_aligned(16);
> +	register __m128i    xmm0;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if ((len & 15) == 0)
> +		return;
> +
> +	/* Non-temporal load into bounce buffer. */
> +	xmm0 = _mm_stream_load_si128_const(src);
> +	_mm_store_si128((void *)buffer, xmm0);
> +
> +	/* Store from bounce buffer. */
> +	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
> +			rte_is_aligned(dst, 4)) {
> +		/* Destination is (known to be) 4 byte aligned. */
> +		src = (const void *)buffer;

Redundant cast.

> +		if (len & 8) {
> +#ifdef RTE_ARCH_X86_64
> +			if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) {
> +				/* Destination is known to be 8 byte aligned. */
> +				_mm_stream_si64(dst, *(const int64_t *)src);
> +			} else {
> +#endif /* RTE_ARCH_X86_64 */
> +				_mm_stream_si32(RTE_PTR_ADD(dst, 0), buffer[0]);
> +				_mm_stream_si32(RTE_PTR_ADD(dst, 4), buffer[1]);
> +#ifdef RTE_ARCH_X86_64
> +			}
> +#endif /* RTE_ARCH_X86_64 */
> +			src = RTE_PTR_ADD(src, 8);
> +			dst = RTE_PTR_ADD(dst, 8);
> +		}
> +		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A) &&
> +				(len & 4)) {
> +			_mm_stream_si32(dst, *(const int32_t *)src);
> +			src = RTE_PTR_ADD(src, 4);
> +			dst = RTE_PTR_ADD(dst, 4);
> +		}
> +
> +		/* Non-temporal store is unavailble for the remaining 3 byte or less. */
> +		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) &&
> +				(len & 2)) {
> +			*(int16_t *)dst = *(const int16_t *)src;

Looks like another type-based aliasing issue.

> +			src = RTE_PTR_ADD(src, 2);
> +			dst = RTE_PTR_ADD(dst, 2);
> +		}
> +		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A) &&
> +				(len & 1)) {
> +			*(char *)dst = *(const char *)src;
> +		}
> +	} else {
> +		/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
> +		rte_mov15_or_less(dst, (const void *)buffer, len & 15);

This cast is not needed.

> +	}
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 16 byte aligned addresses non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + *   Must be 16 byte aligned.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d16s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)

This function should not be public. That goes for all the other public 
functions below as well.

That said, maybe there's a presedence against this, with all the various 
rte_memcpy() helpers being public already. I don't know.

> +{
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;

Reduce scope of this variable declarations.

> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 16));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_stream_si128(dst, xmm2);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, via bounce buffer.
> +	 *
> +	 * Omitted if length is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
> +		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
> +				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
> +				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ?
> +				flags : RTE_MEMOPS_F_DST16A) |
> +				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
> +				flags : RTE_MEMOPS_F_SRC16A));
> +}
> +
> +#ifdef RTE_ARCH_X86_64
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 8/16 byte aligned destination/source addresses non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + *   Must be 8 byte aligned.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d8s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	int64_t             buffer[8] __rte_cache_aligned /* at least __rte_aligned(16) */;
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 8));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
> +		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
> +		_mm_store_si128((void *)&buffer[3 * 2], xmm3);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 4 * 8), buffer[4]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 5 * 8), buffer[5]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 6 * 8), buffer[6]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 7 * 8), buffer[7]);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[4]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[5]);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, via bounce buffer.
> +	 *
> +	 * Omitted if length is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
> +		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
> +				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
> +				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) ?
> +				flags : RTE_MEMOPS_F_DST8A) |
> +				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
> +				flags : RTE_MEMOPS_F_SRC16A));
> +}
> +#endif /* RTE_ARCH_X86_64 */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 4/16 byte aligned destination/source addresses non-temporal memory copy.

/../ non-temporal source and destination /../

> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.

Delete "non-temporal" here and below. NT is not a property of a memory area.

> + *   Must be 4 byte aligned.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d4s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	int32_t             buffer[16] __rte_cache_aligned /* at least __rte_aligned(16) */;
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 4));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
> +		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
> +		_mm_store_si128((void *)&buffer[3 * 4], xmm3);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  0 * 4), buffer[0]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  1 * 4), buffer[1]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  2 * 4), buffer[2]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  3 * 4), buffer[3]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  4 * 4), buffer[4]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  5 * 4), buffer[5]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  6 * 4), buffer[6]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  7 * 4), buffer[7]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  8 * 4), buffer[8]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  9 * 4), buffer[9]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 10 * 4), buffer[10]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 11 * 4), buffer[11]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 12 * 4), buffer[12]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 13 * 4), buffer[13]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 14 * 4), buffer[14]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 15 * 4), buffer[15]);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[3]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 4 * 4), buffer[4]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 5 * 4), buffer[5]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 6 * 4), buffer[6]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 7 * 4), buffer[7]);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[8]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[9]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[10]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[11]);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, via bounce buffer.
> +	 *
> +	 * Omitted if length is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
> +		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
> +				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
> +				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ?
> +				flags : RTE_MEMOPS_F_DST4A) |
> +				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
> +				flags : RTE_MEMOPS_F_SRC16A));
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 4 byte aligned addresses (non-temporal) memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the (non-temporal) destination memory area.
> + *   Must be 4 byte aligned if using non-temporal store.
> + * @param src
> + *   Pointer to the (non-temporal) source memory area.
> + *   Must be 4 byte aligned if using non-temporal load.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d4s4a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)

If this isn't a NT memcpy, why is it named _nt_?

Why is it needed at all? Why not use rte_memcpy() in this case?

> +{
> +	/** How many bytes is source offset from 16 byte alignment (floor rounding). */
> +	const size_t    offset = (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A ?
> +			0 : (uintptr_t)src & 15;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 4));
> +	RTE_ASSERT(rte_is_aligned(src, 4));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	if (offset == 0) {
> +		/* Source is 16 byte aligned. */
> +		/* Copy everything, using upgraded source alignment flags. */
> +		rte_memcpy_nt_d4s16a(dst, src, len,
> +				(flags & ~RTE_MEMOPS_F_SRCA_MASK) | RTE_MEMOPS_F_SRC16A);
> +	} else {
> +		/* Source is not 16 byte aligned, so make it 16 byte aligned. */
> +		int32_t             buffer[4] __rte_aligned(16);
> +		const size_t        first = 16 - offset;
> +		register __m128i    xmm0;
> +
> +		/* First, copy first part of data in chunks of 4 byte,
> +		 * to achieve 16 byte alignment of source.
> +		 * This invalidates the source, destination and length alignment flags, and
> +		 * potentially makes the destination pointer 16 byte unaligned/aligned.
> +		 */
> +
> +		/** Copy from 16 byte aligned source pointer (floor rounding). */
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
> +		_mm_store_si128((void *)buffer, xmm0);
> +
> +		if (unlikely(len + offset <= 16)) {
> +			/* Short length. */
> +			if (((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ||
> +					(len & 3) == 0) {
> +				/* Length is 4 byte aligned. */
> +				switch (len) {
> +				case 1 * 4:
> +					/* Offset can be 1 * 4, 2 * 4 or 3 * 4. */
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
> +							buffer[offset / 4]);
> +					break;
> +				case 2 * 4:
> +					/* Offset can be 1 * 4 or 2 * 4. */
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
> +							buffer[offset / 4]);
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4),
> +							buffer[offset / 4 + 1]);
> +					break;
> +				case 3 * 4:
> +					/* Offset can only be 1 * 4. */
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
> +					break;
> +				}
> +			} else {
> +				/* Length is not 4 byte aligned. */
> +				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
> +			}
> +			return;
> +		}
> +
> +		switch (first) {
> +		case 1 * 4:
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[3]);
> +			break;
> +		case 2 * 4:
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[2]);
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[3]);
> +			break;
> +		case 3 * 4:
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
> +			break;
> +		}
> +
> +		src = RTE_PTR_ADD(src, first);
> +		dst = RTE_PTR_ADD(dst, first);
> +		len -= first;
> +
> +		/* Source pointer is now 16 byte aligned. */
> +		RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +		/* Then, copy the rest, using corrected alignment flags. */
> +		if (rte_is_aligned(dst, 16))
> +			rte_memcpy_nt_d16s16a(dst, src, len, (flags &
> +					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
> +					RTE_MEMOPS_F_LENA_MASK)) |
> +					RTE_MEMOPS_F_DST16A | RTE_MEMOPS_F_SRC16A |
> +					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
> +					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
> +#ifdef RTE_ARCH_X86_64
> +		else if (rte_is_aligned(dst, 8))
> +			rte_memcpy_nt_d8s16a(dst, src, len, (flags &
> +					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
> +					RTE_MEMOPS_F_LENA_MASK)) |
> +					RTE_MEMOPS_F_DST8A | RTE_MEMOPS_F_SRC16A |
> +					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
> +					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
> +#endif /* RTE_ARCH_X86_64 */
> +		else
> +			rte_memcpy_nt_d4s16a(dst, src, len, (flags &
> +					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
> +					RTE_MEMOPS_F_LENA_MASK)) |
> +					RTE_MEMOPS_F_DST4A | RTE_MEMOPS_F_SRC16A |
> +					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
> +					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
> +	}
> +}
> +
> +#ifndef RTE_MEMCPY_NT_BUFSIZE
> +
> +#include <lib/mbuf/rte_mbuf_core.h>
> +
> +/** Bounce buffer size for non-temporal memcpy.
> + *
> + * Must be 2^N and >= 128.
> + * The actual buffer will be slightly larger, due to added padding.
> + * The default is chosen to be able to handle a non-segmented packet.
> + */
> +#define RTE_MEMCPY_NT_BUFSIZE RTE_MBUF_DEFAULT_DATAROOM
> +
> +#endif  /* RTE_MEMCPY_NT_BUFSIZE */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Non-temporal memory copy via bounce buffer.
> + *
> + * @note
> + * If the destination and/or length is unaligned, the first and/or last copied
> + * bytes will be stored in the destination memory area using temporal access.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + *   Must be <= RTE_MEMCPY_NT_BUFSIZE.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_buf(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	/** Cache line aligned bounce buffer with preceding and trailing padding.
> +	 *
> +	 * The preceding padding is one cache line, so the data area itself
> +	 * is cache line aligned.
> +	 * The trailing padding is 16 bytes, leaving room for the trailing bytes
> +	 * of a 16 byte store operation.
> +	 */
> +	char			buffer[RTE_CACHE_LINE_SIZE + RTE_MEMCPY_NT_BUFSIZE +  16]
> +				__rte_cache_aligned;
> +	/** Pointer to bounce buffer's aligned data area. */
> +	char		* const buf0 = &buffer[RTE_CACHE_LINE_SIZE];
> +	void		       *buf;
> +	/** Number of bytes to copy from source, incl. any extra preceding bytes. */
> +	size_t			srclen;
> +	register __m128i	xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(len <= RTE_MEMCPY_NT_BUFSIZE);
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Step 1:
> +	 * Copy data from the source to the bounce buffer's aligned data area,
> +	 * using aligned non-temporal load from the source,
> +	 * and unaligned store in the bounce buffer.
> +	 *
> +	 * If the source is unaligned, the additional bytes preceding the data will be copied
> +	 * to the padding area preceding the bounce buffer's aligned data area.
> +	 * Similarly, if the source data ends at an unaligned address, the additional bytes
> +	 * trailing the data will be copied to the padding area trailing the bounce buffer's
> +	 * aligned data area.
> +	 */
> +
> +	/* Adjust for extra preceding bytes, unless source is known to be 16 byte aligned. */
> +	if ((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) {
> +		buf = buf0;
> +		srclen = len;
> +	} else {
> +		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
> +		const size_t offset = (uintptr_t)src & 15;
> +
> +		buf = RTE_PTR_SUB(buf0, offset);
> +		src = RTE_PTR_SUB(src, offset);
> +		srclen = len + offset;
> +	}
> +
> +	/* Copy large portion of data from source to bounce buffer in chunks of 64 byte. */
> +	while (srclen >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
> +		src = RTE_PTR_ADD(src, 64);
> +		buf = RTE_PTR_ADD(buf, 64);
> +		srclen -= 64;
> +	}
> +
> +	/* Copy remaining 32 and 16 byte portions of data from source to bounce buffer.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the length alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
> +			(srclen & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
> +		src = RTE_PTR_ADD(src, 32);
> +		buf = RTE_PTR_ADD(buf, 32);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
> +			(srclen & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_storeu_si128(buf, xmm2);
> +		src = RTE_PTR_ADD(src, 16);
> +		buf = RTE_PTR_ADD(buf, 16);
> +	}
> +	/* Copy any trailing bytes of data from source to bounce buffer.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the length alignment
> +	 * flags are still valid)
> +	 * and length is known to be 16 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
> +			(srclen & 15)) {
> +		xmm3 = _mm_stream_load_si128_const(src);
> +		_mm_storeu_si128(buf, xmm3);
> +	}
> +
> +	/* Step 2:
> +	 * Copy from the aligned bounce buffer to the non-temporal destination.
> +	 */
> +	rte_memcpy_ntd(dst, buf0, len,
> +			(flags & ~(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_SRCA_MASK)) |
> +			(RTE_CACHE_LINE_SIZE << RTE_MEMOPS_F_SRCA_SHIFT));
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @note
> + * If the destination and/or length is unaligned, some copied bytes will be
> + * stored in the destination memory area using temporal access.

Is temporal access the proper term?

I would describe it as "stored in the destination memory area without 
the use non-temporal hints", or something like that.

> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_generic(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +
> +	while (len > RTE_MEMCPY_NT_BUFSIZE) {
> +		rte_memcpy_nt_buf(dst, src, RTE_MEMCPY_NT_BUFSIZE,
> +				(flags & ~RTE_MEMOPS_F_LENA_MASK) | RTE_MEMOPS_F_LEN128A);
> +		dst = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
> +		src = RTE_PTR_ADD(src, RTE_MEMCPY_NT_BUFSIZE);
> +		len -= RTE_MEMCPY_NT_BUFSIZE;
> +	}
> +	rte_memcpy_nt_buf(dst, src, len, flags);
> +}
> +
> +/* Implementation. Refer to function declaration for documentation. */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	if ((flags & (RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) ==
> +			(RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) {
> +		/* Copy between non-temporal source and destination. */
> +		if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
> +			rte_memcpy_nt_d16s16a(dst, src, len, flags);
> +#ifdef RTE_ARCH_X86_64
> +		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
> +			rte_memcpy_nt_d8s16a(dst, src, len, flags);
> +#endif /* RTE_ARCH_X86_64 */
> +		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
> +			rte_memcpy_nt_d4s16a(dst, src, len, flags);
> +		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A)
> +			rte_memcpy_nt_d4s4a(dst, src, len, flags);
> +		else if (len <= RTE_MEMCPY_NT_BUFSIZE)
> +			rte_memcpy_nt_buf(dst, src, len, flags);
> +		else
> +			rte_memcpy_nt_generic(dst, src, len, flags);
> +	} else if (flags & RTE_MEMOPS_F_SRC_NT) {
> +		/* Copy from non-temporal source. */
> +		rte_memcpy_nts(dst, src, len, flags);
> +	} else if (flags & RTE_MEMOPS_F_DST_NT) {
> +		/* Copy to non-temporal destination. */
> +		rte_memcpy_ntd(dst, src, len, flags);
> +	} else
> +		rte_memcpy(dst, src, len);
> +}
> +
>   #undef ALIGNMENT_MASK
>   
>   #if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
> index a2307cebe6..aa96fb4cc8 100644
> --- a/lib/mbuf/rte_mbuf.c
> +++ b/lib/mbuf/rte_mbuf.c
> @@ -660,6 +660,83 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
>   	return mc;
>   }
>   
> +/* Create a deep copy of mbuf, using non-temporal memory access */
> +struct rte_mbuf *
> +rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
> +		 uint32_t off, uint32_t len, const uint64_t flags)
> +{
> +	const struct rte_mbuf *seg = m;
> +	struct rte_mbuf *mc, *m_last, **prev;
> +
> +	/* garbage in check */
> +	__rte_mbuf_sanity_check(m, 1);
> +
> +	/* check for request to copy at offset past end of mbuf */
> +	if (unlikely(off >= m->pkt_len))
> +		return NULL;
> +
> +	mc = rte_pktmbuf_alloc(mp);
> +	if (unlikely(mc == NULL))
> +		return NULL;
> +
> +	/* truncate requested length to available data */
> +	if (len > m->pkt_len - off)
> +		len = m->pkt_len - off;
> +
> +	__rte_pktmbuf_copy_hdr(mc, m);
> +
> +	/* copied mbuf is not indirect or external */
> +	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
> +
> +	prev = &mc->next;
> +	m_last = mc;
> +	while (len > 0) {
> +		uint32_t copy_len;
> +
> +		/* skip leading mbuf segments */
> +		while (off >= seg->data_len) {
> +			off -= seg->data_len;
> +			seg = seg->next;
> +		}
> +
> +		/* current buffer is full, chain a new one */
> +		if (rte_pktmbuf_tailroom(m_last) == 0) {
> +			m_last = rte_pktmbuf_alloc(mp);
> +			if (unlikely(m_last == NULL)) {
> +				rte_pktmbuf_free(mc);
> +				return NULL;
> +			}
> +			++mc->nb_segs;
> +			*prev = m_last;
> +			prev = &m_last->next;
> +		}
> +
> +		/*
> +		 * copy the min of data in input segment (seg)
> +		 * vs space available in output (m_last)
> +		 */
> +		copy_len = RTE_MIN(seg->data_len - off, len);
> +		if (copy_len > rte_pktmbuf_tailroom(m_last))
> +			copy_len = rte_pktmbuf_tailroom(m_last);
> +
> +		/* append from seg to m_last */
> +		rte_memcpy_ex(rte_pktmbuf_mtod_offset(m_last, char *,
> +						   m_last->data_len),
> +			   rte_pktmbuf_mtod_offset(seg, char *, off),
> +			   copy_len, flags);
> +
> +		/* update offsets and lengths */
> +		m_last->data_len += copy_len;
> +		mc->pkt_len += copy_len;
> +		off += copy_len;
> +		len -= copy_len;
> +	}
> +
> +	/* garbage out check */
> +	__rte_mbuf_sanity_check(mc, 1);
> +	return mc;
> +}
> +

This looks like a cut-and-paste from rte_pktmbuf_copy(). Make a 
__rte_pktmbuf_copy_generic() which takes either memcpy()-function 
pointer+flags, or just flags, as input, which both the new copy_ex() and 
the old copy function deleteges to.

>   /* dump a mbuf on console */
>   void
>   rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
> diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> index b6e23d98ce..030df396a3 100644
> --- a/lib/mbuf/rte_mbuf.h
> +++ b/lib/mbuf/rte_mbuf.h
> @@ -1443,6 +1443,38 @@ struct rte_mbuf *
>   rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
>   		 uint32_t offset, uint32_t length);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Create a full copy of a given packet mbuf,
> + * using non-temporal memory access as specified by flags.
> + *
> + * Copies all the data from a given packet mbuf to a newly allocated
> + * set of mbufs. The private data are is not copied.
> + *
> + * @param m
> + *   The packet mbuf to be copied.
> + * @param mp
> + *   The mempool from which the "clone" mbufs are allocated.
> + * @param offset
> + *   The number of bytes to skip before copying.
> + *   If the mbuf does not have that many bytes, it is an error
> + *   and NULL is returned.
> + * @param length
> + *   The upper limit on bytes to copy.  Passing UINT32_MAX
> + *   means all data (after offset).
> + * @param flags
> + *   Non-temporal memory access hints for rte_memcpy_ex.
> + * @return
> + *   - The pointer to the new "clone" mbuf on success.
> + *   - NULL if allocation fails.
> + */
> +__rte_experimental
> +struct rte_mbuf *
> +rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
> +		    uint32_t offset, uint32_t length, const uint64_t flags);

The same question about why flags is const.

> +
>   /**
>    * Adds given value to the refcnt of all packet mbuf segments.
>    *
> diff --git a/lib/mbuf/version.map b/lib/mbuf/version.map
> index ed486ed14e..b583364ad4 100644
> --- a/lib/mbuf/version.map
> +++ b/lib/mbuf/version.map
> @@ -47,5 +47,6 @@ EXPERIMENTAL {
>   	global:
>   
>   	rte_pktmbuf_pool_create_extbuf;
> +	rte_pktmbuf_copy_ex;
>   
>   };
> diff --git a/lib/pcapng/rte_pcapng.c b/lib/pcapng/rte_pcapng.c
> index af2b814251..ae871c4865 100644
> --- a/lib/pcapng/rte_pcapng.c
> +++ b/lib/pcapng/rte_pcapng.c
> @@ -466,7 +466,8 @@ rte_pcapng_copy(uint16_t port_id, uint32_t queue,
>   	orig_len = rte_pktmbuf_pkt_len(md);
>   
>   	/* Take snapshot of the data */
> -	mc = rte_pktmbuf_copy(md, mp, 0, length);
> +	mc = rte_pktmbuf_copy_ex(md, mp, 0, length,
> +				 RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
>   	if (unlikely(mc == NULL))
>   		return NULL;
>   
> diff --git a/lib/pdump/rte_pdump.c b/lib/pdump/rte_pdump.c
> index 98dcbc037b..6e61c75407 100644
> --- a/lib/pdump/rte_pdump.c
> +++ b/lib/pdump/rte_pdump.c
> @@ -124,7 +124,8 @@ pdump_copy(uint16_t port_id, uint16_t queue,
>   					    pkts[i], mp, cbs->snaplen,
>   					    ts, direction);
>   		else
> -			p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
> +			p = rte_pktmbuf_copy_ex(pkts[i], mp, 0, cbs->snaplen,
> +						RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
>   
>   		if (unlikely(p == NULL))
>   			__atomic_fetch_add(&stats->nombuf, 1, __ATOMIC_RELAXED);
> @@ -134,6 +135,9 @@ pdump_copy(uint16_t port_id, uint16_t queue,
>   
>   	__atomic_fetch_add(&stats->accepted, d_pkts, __ATOMIC_RELAXED);
>   
> +	/* Flush non-temporal stores regarding the packet copies. */
> +	rte_wmb();
> +

This is an unnessary barrier for many architectures.

>   	ring_enq = rte_ring_enqueue_burst(ring, (void *)dup_bufs, d_pkts, NULL);
>   	if (unlikely(ring_enq < d_pkts)) {
>   		unsigned int drops = d_pkts - ring_enq;
  
Mattias Rönnblom Oct. 16, 2022, 7:55 p.m. UTC | #2
On 2022-10-10 08:46, Morten Brørup wrote:
> This patch provides a function for memory copy using non-temporal store,
> load or both, controlled by flags passed to the function.
> 
> Applications sometimes copy data to another memory location, which is only
> used much later.
> In this case, it is inefficient to pollute the data cache with the copied
> data.
> 
> An example use case (originating from a real life application):
> Copying filtered packets, or the first part of them, into a capture buffer
> for offline analysis.
> 
> The purpose of the function is to achieve a performance gain by not
> polluting the cache when copying data.
> Although the throughput can be improved by further optimization, I do not
> have time to do it now.
> 
> The functional tests and performance tests for memory copy have been
> expanded to include non-temporal copying.
> 
> A non-temporal version of the mbuf library's function to create a full
> copy of a given packet mbuf is provided.
> 
> The packet capture and packet dump libraries have been updated to use
> non-temporal memory copy of the packets.
> 
> Implementation notes:
> 
> Implementations for non-x86 architectures can be provided by anyone at a
> later time. I am not going to do it.
> 
> x86 non-temporal load instructions must be 16 byte aligned [1], and
> non-temporal store instructions must be 4, 8 or 16 byte aligned [2].
> 
> ARM non-temporal load and store instructions seem to require 4 byte
> alignment [3].
> 
> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_load
> [2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_si
> [3] https://developer.arm.com/documentation/100076/0100/
> A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
> LDNP--SIMD-and-FP-
> 
> This patch is a major rewrite from the RFC v3, so no version log comparing
> to the RFC is provided.
> 
> v4
> * Also ignore the warning for clang int the workaround for
>    _mm_stream_load_si128() missing const in the parameter.
> * Add missing C linkage specifier in rte_memcpy.h.
> 
> v3
> * _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
>    use it on 64-bit x86 architecture.
> * CLANG warns that _mm_stream_load_si128_const() and
>    rte_memcpy_nt_15_or_less_s16a() are not public,
>    so remove __rte_internal from them. It also affects the documentation
>    for the functions, so the fix can't be limited to CLANG.
> * Use __rte_experimental instead of __rte_internal.
> * Replace <n> with nnn in function documentation; it doesn't look like
>    HTML.
> * Slightly modify the workaround for _mm_stream_load_si128() missing const
>    in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
>    #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
>    #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
> * Fixed one coding style issue missed in v2.
> 
> v2
> * The last 16 byte block of data, incl. any trailing bytes, were not
>    copied from the source memory area in rte_memcpy_nt_buf().
> * Fix many coding style issues.
> * Add some missing header files.
> * Fix build time warning for non-x86 architectures by using a different
>    method to mark the flags parameter unused.
> * CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
>    so omit it when using CLANG.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>   app/test/test_memcpy.c               |   65 +-
>   app/test/test_memcpy_perf.c          |  187 ++--
>   lib/eal/include/generic/rte_memcpy.h |  127 +++
>   lib/eal/x86/include/rte_memcpy.h     | 1238 ++++++++++++++++++++++++++
>   lib/mbuf/rte_mbuf.c                  |   77 ++
>   lib/mbuf/rte_mbuf.h                  |   32 +
>   lib/mbuf/version.map                 |    1 +
>   lib/pcapng/rte_pcapng.c              |    3 +-
>   lib/pdump/rte_pdump.c                |    6 +-
>   9 files changed, 1645 insertions(+), 91 deletions(-)
> 
> diff --git a/app/test/test_memcpy.c b/app/test/test_memcpy.c
> index 1ab86f4967..12410ce413 100644
> --- a/app/test/test_memcpy.c
> +++ b/app/test/test_memcpy.c
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #include <stdint.h>
> @@ -36,6 +37,19 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
>   /* Data is aligned on this many bytes (power of 2) */
>   #define ALIGNMENT_UNIT          32
>   
> +const uint64_t nt_mode_flags[4] = {
> +	0,
> +	RTE_MEMOPS_F_SRC_NT,
> +	RTE_MEMOPS_F_DST_NT,
> +	RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT
> +};
> +const char * const nt_mode_str[4] = {
> +	"none",
> +	"src",
> +	"dst",
> +	"src+dst"
> +};
> +
>   
>   /*
>    * Create two buffers, and initialise one with random values. These are copied
> @@ -44,12 +58,13 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
>    * changed.
>    */
>   static int
> -test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
> +test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size, unsigned int nt_mode)
>   {
>   	unsigned int i;
>   	uint8_t dest[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
>   	uint8_t src[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
>   	void * ret;
> +	const uint64_t flags = nt_mode_flags[nt_mode];
>   
>   	/* Setup buffers */
>   	for (i = 0; i < SMALL_BUFFER_SIZE + ALIGNMENT_UNIT; i++) {
> @@ -58,18 +73,23 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   	}
>   
>   	/* Do the copy */
> -	ret = rte_memcpy(dest + off_dst, src + off_src, size);
> -	if (ret != (dest + off_dst)) {
> -		printf("rte_memcpy() returned %p, not %p\n",
> -		       ret, dest + off_dst);
> +	if (nt_mode) {
> +		rte_memcpy_ex(dest + off_dst, src + off_src, size, flags);
> +	} else {
> +		ret = rte_memcpy(dest + off_dst, src + off_src, size);
> +		if (ret != (dest + off_dst)) {
> +			printf("rte_memcpy() returned %p, not %p\n",
> +			       ret, dest + off_dst);
> +		}
>   	}
>   
>   	/* Check nothing before offset is affected */
>   	for (i = 0; i < off_dst; i++) {
>   		if (dest[i] != 0) {
> -			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
> +			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
>   			       "[modified before start of dst].\n",
> -			       (unsigned)size, off_src, off_dst);
> +			       nt_mode ? "_ex" : "",
> +			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
>   			return -1;
>   		}
>   	}
> @@ -77,9 +97,11 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   	/* Check everything was copied */
>   	for (i = 0; i < size; i++) {
>   		if (dest[i + off_dst] != src[i + off_src]) {
> -			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
> -			       "[didn't copy byte %u].\n",
> -			       (unsigned)size, off_src, off_dst, i);
> +			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
> +			       "[didn't copy byte %u: 0x%02x!=0x%02x].\n",
> +			       nt_mode ? "_ex" : "",
> +			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode], i,
> +			       dest[i + off_dst], src[i + off_src]);
>   			return -1;
>   		}
>   	}
> @@ -87,9 +109,10 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   	/* Check nothing after copy was affected */
>   	for (i = size; i < SMALL_BUFFER_SIZE; i++) {
>   		if (dest[i + off_dst] != 0) {
> -			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
> +			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
>   			       "[copied too many].\n",
> -			       (unsigned)size, off_src, off_dst);
> +			       nt_mode ? "_ex" : "",
> +			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
>   			return -1;
>   		}
>   	}
> @@ -102,16 +125,18 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   static int
>   func_test(void)
>   {
> -	unsigned int off_src, off_dst, i;
> +	unsigned int off_src, off_dst, i, nt_mode;
>   	int ret;
>   
> -	for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
> -		for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
> -			for (i = 0; i < RTE_DIM(buf_sizes); i++) {
> -				ret = test_single_memcpy(off_src, off_dst,
> -				                         buf_sizes[i]);
> -				if (ret != 0)
> -					return -1;
> +	for (nt_mode = 0; nt_mode < 4; nt_mode++) {
> +		for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
> +			for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
> +				for (i = 0; i < RTE_DIM(buf_sizes); i++) {
> +					ret = test_single_memcpy(off_src, off_dst,
> +								 buf_sizes[i], nt_mode);
> +					if (ret != 0)
> +						return -1;
> +				}
>   			}
>   		}
>   	}
> diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
> index 3727c160e6..6bb52cba88 100644
> --- a/app/test/test_memcpy_perf.c
> +++ b/app/test/test_memcpy_perf.c
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #include <stdint.h>
> @@ -15,6 +16,7 @@
>   #include <rte_malloc.h>
>   
>   #include <rte_memcpy.h>
> +#include <rte_atomic.h>
>   
>   #include "test.h"
>   
> @@ -27,9 +29,9 @@
>   /* List of buffer sizes to test */
>   #if TEST_VALUE_RANGE == 0
>   static size_t buf_sizes[] = {
> -	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
> -	129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448,
> -	449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
> +	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 40, 48, 60, 63, 64, 65, 80, 92, 124,
> +	127, 128, 129, 140, 152, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447,
> +	448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
>   	2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
>   };
>   /* MUST be as large as largest packet size above */
> @@ -72,7 +74,7 @@ static uint8_t *small_buf_read, *small_buf_write;
>   static int
>   init_buffers(void)
>   {
> -	unsigned i;
> +	unsigned int i;
>   
>   	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
>   	if (large_buf_read == NULL)
> @@ -151,7 +153,7 @@ static void
>   do_uncached_write(uint8_t *dst, int is_dst_cached,
>   				  const uint8_t *src, int is_src_cached, size_t size)
>   {
> -	unsigned i, j;
> +	unsigned int i, j;
>   	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];
>   
>   	for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) {
> @@ -167,66 +169,112 @@ do_uncached_write(uint8_t *dst, int is_dst_cached,
>    * Run a single memcpy performance test. This is a macro to ensure that if
>    * the "size" parameter is a constant it won't be converted to a variable.
>    */
> -#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,                   \
> -                         src, is_src_cached, src_uoffset, size)             \
> -do {                                                                        \
> -    unsigned int iter, t;                                                   \
> -    size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];          \
> -    uint64_t start_time, total_time = 0;                                    \
> -    uint64_t total_time2 = 0;                                               \
> -    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
> -        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
> -                         src_addrs, is_src_cached, src_uoffset);            \
> -        start_time = rte_rdtsc();                                           \
> -        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
> -            rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size);           \
> -        total_time += rte_rdtsc() - start_time;                             \
> -    }                                                                       \
> -    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
> -        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
> -                         src_addrs, is_src_cached, src_uoffset);            \
> -        start_time = rte_rdtsc();                                           \
> -        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
> -            memcpy(dst+dst_addrs[t], src+src_addrs[t], size);               \
> -        total_time2 += rte_rdtsc() - start_time;                            \
> -    }                                                                       \
> -    printf("%3.0f -", (double)total_time  / TEST_ITERATIONS);                 \
> -    printf("%3.0f",   (double)total_time2 / TEST_ITERATIONS);                 \
> -    printf("(%6.2f%%) ", ((double)total_time - total_time2)*100/total_time2); \
> +#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,					  \
> +			 src, is_src_cached, src_uoffset, size)					  \
> +do {												  \
> +	unsigned int iter, t;									  \
> +	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];				  \
> +	uint64_t start_time;									  \
> +	uint64_t total_time_rte = 0, total_time_std = 0;					  \
> +	uint64_t total_time_ntd = 0, total_time_nts = 0, total_time_nt = 0;			  \
> +	const uint64_t flags = ((dst_uoffset == 0) ?						  \
> +				(ALIGNMENT_UNIT << RTE_MEMOPS_F_DSTA_SHIFT) : 0) |		  \
> +			       ((src_uoffset == 0) ?						  \
> +				(ALIGNMENT_UNIT << RTE_MEMOPS_F_SRCA_SHIFT) : 0);		  \
> +	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
> +		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
> +				 src_addrs, is_src_cached, src_uoffset);			  \
> +		start_time = rte_rdtsc();							  \
> +		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
> +			rte_memcpy(dst + dst_addrs[t], src + src_addrs[t], size);		  \
> +		total_time_rte += rte_rdtsc() - start_time;					  \
> +	}											  \
> +	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
> +		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
> +				 src_addrs, is_src_cached, src_uoffset);			  \
> +		start_time = rte_rdtsc();							  \
> +		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
> +			memcpy(dst + dst_addrs[t], src + src_addrs[t], size);			  \
> +		total_time_std += rte_rdtsc() - start_time;					  \
> +	}											  \
> +	if (!(is_dst_cached && is_src_cached)) {						  \
> +		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
> +			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
> +					 src_addrs, is_src_cached, src_uoffset);		  \
> +			start_time = rte_rdtsc();						  \
> +			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
> +				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
> +					      flags | RTE_MEMOPS_F_DST_NT);			  \
> +			total_time_ntd += rte_rdtsc() - start_time;				  \
> +		}										  \
> +		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
> +			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
> +					 src_addrs, is_src_cached, src_uoffset);		  \
> +			start_time = rte_rdtsc();						  \
> +			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
> +				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
> +					      flags | RTE_MEMOPS_F_SRC_NT);			  \
> +			total_time_nts += rte_rdtsc() - start_time;				  \
> +		}										  \
> +		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
> +			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
> +					 src_addrs, is_src_cached, src_uoffset);		  \
> +			start_time = rte_rdtsc();						  \
> +			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
> +				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
> +					      flags | RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT); \
> +			total_time_nt += rte_rdtsc() - start_time;				  \
> +		}										  \
> +	}											  \
> +	printf(" %4.0f-", (double)total_time_rte / TEST_ITERATIONS);				  \
> +	printf("%4.0f",   (double)total_time_std / TEST_ITERATIONS);				  \
> +	printf("(%+4.0f%%)", ((double)total_time_rte - total_time_std) * 100 / total_time_std);   \
> +	if (!(is_dst_cached && is_src_cached)) {						  \
> +		printf(" %4.0f", (double)total_time_ntd / TEST_ITERATIONS);			  \
> +		printf(" %4.0f", (double)total_time_nts / TEST_ITERATIONS);			  \
> +		printf(" %4.0f", (double)total_time_nt / TEST_ITERATIONS);			  \
> +		if (total_time_nt / total_time_std > 9)						  \
> +			printf("(*%4.1f)", (double)total_time_nt / total_time_std);		  \
> +		else										  \
> +			printf("(%+4.0f%%)",							  \
> +			       ((double)total_time_nt - total_time_std) * 100 / total_time_std);  \
> +	}											  \
>   } while (0)
>   
>   /* Run aligned memcpy tests for each cached/uncached permutation */
> -#define ALL_PERF_TESTS_FOR_SIZE(n)                                       \
> -do {                                                                     \
> -    if (__builtin_constant_p(n))                                         \
> -        printf("\nC%6u", (unsigned)n);                                   \
> -    else                                                                 \
> -        printf("\n%7u", (unsigned)n);                                    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);    \
> +#define ALL_PERF_TESTS_FOR_SIZE(n)						\
> +do {										\
> +	if (__builtin_constant_p(n))						\
> +		printf("\nC%6u", (unsigned int)n);				\
> +	else									\
> +		printf("\n%7u", (unsigned int)n);				\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);	\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);	\
>   } while (0)
>   
>   /* Run unaligned memcpy tests for each cached/uncached permutation */
> -#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)                             \
> -do {                                                                     \
> -    if (__builtin_constant_p(n))                                         \
> -        printf("\nC%6u", (unsigned)n);                                   \
> -    else                                                                 \
> -        printf("\n%7u", (unsigned)n);                                    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);    \
> +#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)					\
> +do {										\
> +	if (__builtin_constant_p(n))						\
> +		printf("\nC%6u", (unsigned int)n);				\
> +	else									\
> +		printf("\n%7u", (unsigned int)n);				\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);	\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);	\
>   } while (0)
>   
>   /* Run memcpy tests for constant length */
> -#define ALL_PERF_TEST_FOR_CONSTANT                                      \
> -do {                                                                    \
> -    TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);         \
> -    TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);      \
> -    TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);    \
> +#define ALL_PERF_TEST_FOR_CONSTANT						\
> +do {										\
> +	TEST_CONSTANT(4U); TEST_CONSTANT(6U); TEST_CONSTANT(8U);		\
> +	TEST_CONSTANT(16U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);		\
> +	TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);		\
> +	TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);	\
> +	TEST_CONSTANT(2048U);							\
>   } while (0)
>   
>   /* Run all memcpy tests for aligned constant cases */
> @@ -251,7 +299,7 @@ perf_test_constant_unaligned(void)
>   static inline void
>   perf_test_variable_aligned(void)
>   {
> -	unsigned i;
> +	unsigned int i;
>   	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
>   		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
>   	}
> @@ -261,7 +309,7 @@ perf_test_variable_aligned(void)
>   static inline void
>   perf_test_variable_unaligned(void)
>   {
> -	unsigned i;
> +	unsigned int i;
>   	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
>   		ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]);
>   	}
> @@ -282,7 +330,7 @@ perf_test(void)
>   
>   #if TEST_VALUE_RANGE != 0
>   	/* Set up buf_sizes array, if required */
> -	unsigned i;
> +	unsigned int i;
>   	for (i = 0; i < TEST_VALUE_RANGE; i++)
>   		buf_sizes[i] = i;
>   #endif
> @@ -290,13 +338,14 @@ perf_test(void)
>   	/* See function comment */
>   	do_uncached_write(large_buf_write, 0, small_buf_read, 1, SMALL_BUFFER_SIZE);
>   
> -	printf("\n** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **\n"
> -		   "======= ================= ================= ================= =================\n"
> -		   "   Size   Cache to cache     Cache to mem      Mem to cache        Mem to mem\n"
> -		   "(bytes)          (ticks)          (ticks)           (ticks)           (ticks)\n"
> -		   "------- ----------------- ----------------- ----------------- -----------------");
> +	printf("\n** rte_memcpy(RTE)/memcpy(STD)/rte_memcpy_ex(NTD/NTS/NT) - memcpy perf. tests (C = compile-time constant) **\n"
> +		   "======= ================ ====================================== ====================================== ======================================\n"
> +		   "   Size  Cache to cache               Cache to mem                           Mem to cache                            Mem to mem\n"
> +		   "(bytes)         (ticks)                    (ticks)                                (ticks)                               (ticks)\n"
> +		   "         RTE- STD(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)\n"
> +		   "------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
>   
> -	printf("\n================================= %2dB aligned =================================",
> +	printf("\n================================================================ %2dB aligned ===============================================================",
>   		ALIGNMENT_UNIT);
>   	/* Do aligned tests where size is a variable */
>   	timespec_get(&tv_begin, TIME_UTC);
> @@ -304,28 +353,28 @@ perf_test(void)
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_aligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n------- ----------------- ----------------- ----------------- -----------------");
> +	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
>   	/* Do aligned tests where size is a compile-time constant */
>   	timespec_get(&tv_begin, TIME_UTC);
>   	perf_test_constant_aligned();
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_aligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n================================== Unaligned ==================================");
> +	printf("\n================================================================= Unaligned =================================================================");
>   	/* Do unaligned tests where size is a variable */
>   	timespec_get(&tv_begin, TIME_UTC);
>   	perf_test_variable_unaligned();
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_unaligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n------- ----------------- ----------------- ----------------- -----------------");
> +	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
>   	/* Do unaligned tests where size is a compile-time constant */
>   	timespec_get(&tv_begin, TIME_UTC);
>   	perf_test_constant_unaligned();
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_unaligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n======= ================= ================= ================= =================\n\n");
> +	printf("\n======= ================ ====================================== ====================================== ======================================\n\n");
>   
>   	printf("Test Execution Time (seconds):\n");
>   	printf("Aligned variable copy size   = %8.3f\n", time_aligned);
> diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
> index e7f0f8eaa9..b087f09c35 100644
> --- a/lib/eal/include/generic/rte_memcpy.h
> +++ b/lib/eal/include/generic/rte_memcpy.h
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #ifndef _RTE_MEMCPY_H_
> @@ -11,6 +12,13 @@
>    * Functions for vectorised implementation of memcpy().
>    */
>   
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
>   /**
>    * Copy 16 bytes from one location to another using optimised
>    * instructions. The locations should not overlap.
> @@ -113,4 +121,123 @@ rte_memcpy(void *dst, const void *src, size_t n);
>   
>   #endif /* __DOXYGEN__ */
>   
> +/*
> + * Advanced/Non-Temporal Memory Operations Flags.
> + */
> +
> +/** Length alignment hint mask. */
> +#define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)
> +/** Length alignment hint shift. */
> +#define RTE_MEMOPS_F_LENA_SHIFT 0
> +/** Hint: Length is 2 byte aligned. */
> +#define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)
> +/** Hint: Length is 4 byte aligned. */
> +#define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)
> +/** Hint: Length is 8 byte aligned. */
> +#define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)
> +/** Hint: Length is 16 byte aligned. */
> +#define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)
> +/** Hint: Length is 32 byte aligned. */
> +#define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)
> +/** Hint: Length is 64 byte aligned. */
> +#define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)
> +/** Hint: Length is 128 byte aligned. */
> +#define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)
> +
> +/** Prefer non-temporal access to source memory area.
> + */
> +#define RTE_MEMOPS_F_SRC_NT     (UINT64_C(1) << 8)
> +/** Source address alignment hint mask. */
> +#define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 8)
> +/** Source address alignment hint shift. */
> +#define RTE_MEMOPS_F_SRCA_SHIFT 8
> +/** Hint: Source address is 2 byte aligned. */
> +#define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 8)
> +/** Hint: Source address is 4 byte aligned. */
> +#define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 8)
> +/** Hint: Source address is 8 byte aligned. */
> +#define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 8)
> +/** Hint: Source address is 16 byte aligned. */
> +#define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 8)
> +/** Hint: Source address is 32 byte aligned. */
> +#define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 8)
> +/** Hint: Source address is 64 byte aligned. */
> +#define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 8)
> +/** Hint: Source address is 128 byte aligned. */
> +#define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 8)
> +
> +/** Prefer non-temporal access to destination memory area.
> + *
> + * On x86 architecture:
> + * Remember to call rte_wmb() after a sequence of copy operations.
> + */
> +#define RTE_MEMOPS_F_DST_NT     (UINT64_C(1) << 16)
> +/** Destination address alignment hint mask. */
> +#define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 16)
> +/** Destination address alignment hint shift. */
> +#define RTE_MEMOPS_F_DSTA_SHIFT 16
> +/** Hint: Destination address is 2 byte aligned. */
> +#define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 16)
> +/** Hint: Destination address is 4 byte aligned. */
> +#define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 16)
> +/** Hint: Destination address is 8 byte aligned. */
> +#define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 16)
> +/** Hint: Destination address is 16 byte aligned. */
> +#define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 16)
> +/** Hint: Destination address is 32 byte aligned. */
> +#define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 16)
> +/** Hint: Destination address is 64 byte aligned. */
> +#define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 16)
> +/** Hint: Destination address is 128 byte aligned. */
> +#define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 16)
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Advanced/non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the destination memory area.
> + * @param src
> + *   Pointer to the source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + *   Any of the RTE_MEMOPS_F_(SRC|DST)_NT, RTE_MEMOPS_F_(LEN|SRC|DST)nnnA flags.
> + *   Must be constant at build time.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags);
> +
> +#ifndef RTE_MEMCPY_EX_ARCH_DEFINED
> +
> +/* Fallback implementation, if no arch-specific implementation is provided. */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	RTE_SET_USED(flags);
> +	memcpy(dst, src, len);
> +}
> +
> +#endif /* RTE_MEMCPY_EX_ARCH_DEFINED */
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
>   #endif /* _RTE_MEMCPY_H_ */
> diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
> index d4d7a5cfc8..31d0faf7a8 100644
> --- a/lib/eal/x86/include/rte_memcpy.h
> +++ b/lib/eal/x86/include/rte_memcpy.h
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #ifndef _RTE_MEMCPY_X86_64_H_
> @@ -17,6 +18,10 @@
>   #include <rte_vect.h>
>   #include <rte_common.h>
>   #include <rte_config.h>
> +#include <rte_debug.h>
> +
> +#define RTE_MEMCPY_EX_ARCH_DEFINED
> +#include "generic/rte_memcpy.h"
>   
>   #ifdef __cplusplus
>   extern "C" {
> @@ -868,6 +873,1239 @@ rte_memcpy(void *dst, const void *src, size_t n)
>   		return rte_memcpy_generic(dst, src, n);
>   }
>   
> +/*
> + * Advanced/Non-Temporal Memory Operations.
> + */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Workaround for _mm_stream_load_si128() missing const in the parameter.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__m128i _mm_stream_load_si128_const(const __m128i *const mem_addr)
> +{
> +	/* GCC 4.5.8 (in RHEL7) doesn't support the #pragma to ignore "-Wdiscarded-qualifiers".
> +	 * So we explicitly type cast mem_addr and use the #pragma to ignore "-Wcast-qual".
> +	 */
> +#if defined(RTE_TOOLCHAIN_GCC)
> +#pragma GCC diagnostic push
> +#pragma GCC diagnostic ignored "-Wcast-qual"
> +#elif defined(RTE_TOOLCHAIN_CLANG)
> +#pragma clang diagnostic push
> +#pragma clang diagnostic ignored "-Wcast-qual"
> +#endif
> +	return _mm_stream_load_si128((__m128i *)mem_addr);
> +#if defined(RTE_TOOLCHAIN_GCC)
> +#pragma GCC diagnostic pop
> +#elif defined(RTE_TOOLCHAIN_CLANG)
> +#pragma clang diagnostic pop
> +#endif
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Memory copy from non-temporal source area.
> + *
> + * @note
> + * Performance is optimal when source pointer is 16 byte aligned.
> + *
> + * @param dst
> + *   Pointer to the destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + *   Any of the RTE_MEMOPS_F_(LEN|SRC)nnnA flags.
> + *   The RTE_MEMOPS_F_SRC_NT flag must be set.
> + *   The RTE_MEMOPS_F_DST_NT flag must be clear.
> + *   The RTE_MEMOPS_F_DSTnnnA flags are ignored.
> + *   Must be constant at build time.

Why do the flags need to be build-time constants?

> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nts(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_SRC_NT);
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* If source is not 16 byte aligned, then copy first part of data via bounce buffer,
> +	 * to achieve 16 byte alignment of source pointer.
> +	 * This invalidates the source, destination and length alignment flags, and
> +	 * potentially makes the destination pointer unaligned.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)) {

An alternative to rely on compiler constant propagation to eliminate 
conditionals when various things are aligned, would be to use GCC's 
__builtin_assume_aligned().

The basic pattern then would look something like:

const void *aligned_source;
void *aligned_dst;
size_t aligned_len;

if (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
	aligned_source = __builtin_assume_aligned(source, 16);
else
	aligned_source = source;

then you would go on to do the same for dst, and len (w/ some uintptr_t 
casting required).

After this, the code may be written as if the pointers aligned were not 
known, and the compiler would properly eliminate any sections that dealt 
with unaligned cases, in case the proper flags were set.

Another more radical change would be to just drop all the src, dst, and 
len flags altogheter, and provide a __built_assume_aligned() wrapper 
instead, for the application to use, i.e.:

#define rte_assume_aligned(ptr, n) __builtin_assume_aligned(ptr, n)

With this API, the user code would look something like:

rte_memcpy_ex(rte_assume_aligned(my_dst, 16), my_src, len, 
RTE_MEMOPS_F_DST_NT);

...if it knew my_dst to have a particular alignment. The rte_memcpy_ex() 
implemementation wouldn't assume any particular alignment of any input 
parameters.

> +		/* Source is not known to be 16 byte aligned, but might be. */
> +		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
> +		const size_t    offset = (uintptr_t)src & 15;

I would argue "(uintptr_t)src % 16" is more readable, and it generates 
the same code.

Sorry for breaking up the review into two parts.

> +
> +		if (offset) {
> +			/* Source is not 16 byte aligned. */
> +			char            buffer[16] __rte_aligned(16);
> +			/** How many bytes is source away from 16 byte alignment
> +			 * (ceiling rounding).
> +			 */
> +			const size_t    first = 16 - offset;
> +
> +			xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
> +			_mm_store_si128((void *)buffer, xmm0);
> +
> +			/* Test for short length.
> +			 *
> +			 * Omitted if length is known to be >= 16.
> +			 */
> +			if (!(__builtin_constant_p(len) && len >= 16) &&
> +					unlikely(len <= first)) {
> +				/* Short length. */
> +				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
> +				return;
> +			}
> +
> +			/* Copy until source pointer is 16 byte aligned. */
> +			rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), first);
> +			src = RTE_PTR_ADD(src, first);
> +			dst = RTE_PTR_ADD(dst, first);
> +			len -= first;
> +		}
> +	}
> +
> +	/* Source pointer is now 16 byte aligned. */
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_storeu_si128(dst, xmm2);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, if any, via bounce buffer.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid) and length is known to be 16 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
> +			(len & 15)) {
> +		char    buffer[16] __rte_aligned(16);
> +
> +		xmm3 = _mm_stream_load_si128_const(src);
> +		_mm_store_si128((void *)buffer, xmm3);
> +		rte_mov15_or_less(dst, buffer, len & 15);
> +	}
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Memory copy to non-temporal destination area.
> + *
> + * @note
> + * If the destination and/or length is unaligned, the first and/or last copied
> + * bytes will be stored in the destination memory area using temporal access.
> + * @note
> + * Performance is optimal when destination pointer is 16 byte aligned.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + *   Any of the RTE_MEMOPS_F_(LEN|DST)nnnA flags.
> + *   The RTE_MEMOPS_F_SRC_NT flag must be clear.
> + *   The RTE_MEMOPS_F_DST_NT flag must be set.
> + *   The RTE_MEMOPS_F_SRCnnnA flags are ignored.
> + *   Must be constant at build time.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ntd(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_DST_NT);
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ||
> +			len >= 16) {
> +		/* Length >= 16 and/or destination is known to be 16 byte aligned. */
> +		register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +		/* If destination is not 16 byte aligned, then copy first part of data,
> +		 * to achieve 16 byte alignment of destination pointer.
> +		 * This invalidates the source, destination and length alignment flags, and
> +		 * potentially makes the source pointer unaligned.
> +		 *
> +		 * Omitted if destination is known to be 16 byte aligned.
> +		 */
> +		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A)) {
> +			/* Destination is not known to be 16 byte aligned, but might be. */
> +			/** How many bytes is destination offset from 16 byte alignment
> +			 * (floor rounding).
> +			 */
> +			const size_t    offset = (uintptr_t)dst & 15;
> +
> +			if (offset) {
> +				/* Destination is not 16 byte aligned. */
> +				/** How many bytes is destination away from 16 byte alignment
> +				 * (ceiling rounding).
> +				 */
> +				const size_t    first = 16 - offset;
> +
> +				if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
> +						(offset & 3) == 0) {
> +					/* Destination is (known to be) 4 byte aligned. */
> +					int32_t r0, r1, r2;
> +
> +					/* Copy until destination pointer is 16 byte aligned. */
> +					if (first & 8) {
> +						memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
> +						memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
> +						_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
> +						_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
> +						src = RTE_PTR_ADD(src, 8);
> +						dst = RTE_PTR_ADD(dst, 8);
> +						len -= 8;
> +					}
> +					if (first & 4) {
> +						memcpy(&r2, src, 4);
> +						_mm_stream_si32(dst, r2);
> +						src = RTE_PTR_ADD(src, 4);
> +						dst = RTE_PTR_ADD(dst, 4);
> +						len -= 4;
> +					}
> +				} else {
> +					/* Destination is not 4 byte aligned. */
> +					/* Copy until destination pointer is 16 byte aligned. */
> +					rte_mov15_or_less(dst, src, first);
> +					src = RTE_PTR_ADD(src, first);
> +					dst = RTE_PTR_ADD(dst, first);
> +					len -= first;
> +				}
> +			}
> +		}
> +
> +		/* Destination pointer is now 16 byte aligned. */
> +		RTE_ASSERT(rte_is_aligned(dst, 16));
> +
> +		/* Copy large portion of data in chunks of 64 byte. */
> +		while (len >= 64) {
> +			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
> +			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
> +			xmm2 = _mm_loadu_si128(RTE_PTR_ADD(src, 2 * 16));
> +			xmm3 = _mm_loadu_si128(RTE_PTR_ADD(src, 3 * 16));
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
> +			src = RTE_PTR_ADD(src, 64);
> +			dst = RTE_PTR_ADD(dst, 64);
> +			len -= 64;
> +		}
> +
> +		/* Copy following 32 and 16 byte portions of data.
> +		 *
> +		 * Omitted if destination is known to be 16 byte aligned (so the alignment
> +		 * flags are still valid)
> +		 * and length is known to be respectively 64 or 32 byte aligned.
> +		 */
> +		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
> +				(len & 32)) {
> +			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
> +			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +			src = RTE_PTR_ADD(src, 32);
> +			dst = RTE_PTR_ADD(dst, 32);
> +		}
> +		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
> +				(len & 16)) {
> +			xmm2 = _mm_loadu_si128(src);
> +			_mm_stream_si128(dst, xmm2);
> +			src = RTE_PTR_ADD(src, 16);
> +			dst = RTE_PTR_ADD(dst, 16);
> +		}
> +	} else {
> +		/* Length <= 15, and
> +		 * destination is not known to be 16 byte aligned (but might be).
> +		 */
> +		/* If destination is not 4 byte aligned, then
> +		 * use normal copy and return.
> +		 *
> +		 * Omitted if destination is known to be 4 byte aligned.
> +		 */
> +		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) &&
> +				!rte_is_aligned(dst, 4)) {
> +			/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
> +			rte_mov15_or_less(dst, src, len);
> +			return;
> +		}
> +		/* Destination is (known to be) 4 byte aligned. Proceed. */
> +	}
> +
> +	/* Destination pointer is now 4 byte (or 16 byte) aligned. */
> +	RTE_ASSERT(rte_is_aligned(dst, 4));
> +
> +	/* Copy following 8 and 4 byte portions of data.
> +	 *
> +	 * Omitted if destination is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 16 or 8 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
> +			(len & 8)) {
> +		int32_t r0, r1;
> +
> +		memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
> +		memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
> +		src = RTE_PTR_ADD(src, 8);
> +		dst = RTE_PTR_ADD(dst, 8);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A)) &&
> +			(len & 4)) {
> +		int32_t r2;
> +
> +		memcpy(&r2, src, 4);
> +		_mm_stream_si32(dst, r2);
> +		src = RTE_PTR_ADD(src, 4);
> +		dst = RTE_PTR_ADD(dst, 4);
> +	}
> +
> +	/* Copy remaining 2 and 1 byte portions of data.
> +	 *
> +	 * Omitted if destination is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 4 and 2 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A)) &&
> +			(len & 2)) {
> +		int16_t r3;
> +
> +		memcpy(&r3, src, 2);
> +		*(int16_t *)dst = r3;
> +		src = RTE_PTR_ADD(src, 2);
> +		dst = RTE_PTR_ADD(dst, 2);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A)) &&
> +			(len & 1))
> +		*(char *)dst = *(const char *)src;
> +}
> +
> +/**
> + * Non-temporal memory copy of 15 or less byte
> + * from 16 byte aligned source via bounce buffer.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Only the 4 least significant bits of this parameter are used.
> + *   The 4 least significant bits of this holds the number of remaining bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_15_or_less_s16a(void *__rte_restrict dst,
> +		const void *__rte_restrict src, size_t len, const uint64_t flags)
> +{
> +	int32_t             buffer[4] __rte_aligned(16);
> +	register __m128i    xmm0;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if ((len & 15) == 0)
> +		return;
> +
> +	/* Non-temporal load into bounce buffer. */
> +	xmm0 = _mm_stream_load_si128_const(src);
> +	_mm_store_si128((void *)buffer, xmm0);
> +
> +	/* Store from bounce buffer. */
> +	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
> +			rte_is_aligned(dst, 4)) {
> +		/* Destination is (known to be) 4 byte aligned. */
> +		src = (const void *)buffer;
> +		if (len & 8) {
> +#ifdef RTE_ARCH_X86_64
> +			if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) {
> +				/* Destination is known to be 8 byte aligned. */
> +				_mm_stream_si64(dst, *(const int64_t *)src);
> +			} else {
> +#endif /* RTE_ARCH_X86_64 */
> +				_mm_stream_si32(RTE_PTR_ADD(dst, 0), buffer[0]);
> +				_mm_stream_si32(RTE_PTR_ADD(dst, 4), buffer[1]);
> +#ifdef RTE_ARCH_X86_64
> +			}
> +#endif /* RTE_ARCH_X86_64 */
> +			src = RTE_PTR_ADD(src, 8);
> +			dst = RTE_PTR_ADD(dst, 8);
> +		}
> +		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A) &&
> +				(len & 4)) {
> +			_mm_stream_si32(dst, *(const int32_t *)src);
> +			src = RTE_PTR_ADD(src, 4);
> +			dst = RTE_PTR_ADD(dst, 4);
> +		}
> +
> +		/* Non-temporal store is unavailble for the remaining 3 byte or less. */
> +		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) &&
> +				(len & 2)) {
> +			*(int16_t *)dst = *(const int16_t *)src;
> +			src = RTE_PTR_ADD(src, 2);
> +			dst = RTE_PTR_ADD(dst, 2);
> +		}
> +		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A) &&
> +				(len & 1)) {
> +			*(char *)dst = *(const char *)src;
> +		}
> +	} else {
> +		/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
> +		rte_mov15_or_less(dst, (const void *)buffer, len & 15);
> +	}
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 16 byte aligned addresses non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + *   Must be 16 byte aligned.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d16s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 16));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_stream_si128(dst, xmm2);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, via bounce buffer.
> +	 *
> +	 * Omitted if length is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
> +		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
> +				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
> +				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ?
> +				flags : RTE_MEMOPS_F_DST16A) |
> +				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
> +				flags : RTE_MEMOPS_F_SRC16A));
> +}
> +
> +#ifdef RTE_ARCH_X86_64
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 8/16 byte aligned destination/source addresses non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + *   Must be 8 byte aligned.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d8s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	int64_t             buffer[8] __rte_cache_aligned /* at least __rte_aligned(16) */;
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 8));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
> +		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
> +		_mm_store_si128((void *)&buffer[3 * 2], xmm3);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 4 * 8), buffer[4]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 5 * 8), buffer[5]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 6 * 8), buffer[6]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 7 * 8), buffer[7]);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[4]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[5]);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, via bounce buffer.
> +	 *
> +	 * Omitted if length is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
> +		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
> +				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
> +				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) ?
> +				flags : RTE_MEMOPS_F_DST8A) |
> +				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
> +				flags : RTE_MEMOPS_F_SRC16A));
> +}
> +#endif /* RTE_ARCH_X86_64 */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 4/16 byte aligned destination/source addresses non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + *   Must be 4 byte aligned.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d4s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	int32_t             buffer[16] __rte_cache_aligned /* at least __rte_aligned(16) */;
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 4));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
> +		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
> +		_mm_store_si128((void *)&buffer[3 * 4], xmm3);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  0 * 4), buffer[0]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  1 * 4), buffer[1]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  2 * 4), buffer[2]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  3 * 4), buffer[3]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  4 * 4), buffer[4]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  5 * 4), buffer[5]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  6 * 4), buffer[6]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  7 * 4), buffer[7]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  8 * 4), buffer[8]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  9 * 4), buffer[9]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 10 * 4), buffer[10]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 11 * 4), buffer[11]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 12 * 4), buffer[12]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 13 * 4), buffer[13]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 14 * 4), buffer[14]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 15 * 4), buffer[15]);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[3]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 4 * 4), buffer[4]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 5 * 4), buffer[5]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 6 * 4), buffer[6]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 7 * 4), buffer[7]);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[8]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[9]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[10]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[11]);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, via bounce buffer.
> +	 *
> +	 * Omitted if length is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
> +		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
> +				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
> +				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ?
> +				flags : RTE_MEMOPS_F_DST4A) |
> +				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
> +				flags : RTE_MEMOPS_F_SRC16A));
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 4 byte aligned addresses (non-temporal) memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the (non-temporal) destination memory area.
> + *   Must be 4 byte aligned if using non-temporal store.
> + * @param src
> + *   Pointer to the (non-temporal) source memory area.
> + *   Must be 4 byte aligned if using non-temporal load.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d4s4a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	/** How many bytes is source offset from 16 byte alignment (floor rounding). */
> +	const size_t    offset = (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A ?
> +			0 : (uintptr_t)src & 15;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 4));
> +	RTE_ASSERT(rte_is_aligned(src, 4));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	if (offset == 0) {
> +		/* Source is 16 byte aligned. */
> +		/* Copy everything, using upgraded source alignment flags. */
> +		rte_memcpy_nt_d4s16a(dst, src, len,
> +				(flags & ~RTE_MEMOPS_F_SRCA_MASK) | RTE_MEMOPS_F_SRC16A);
> +	} else {
> +		/* Source is not 16 byte aligned, so make it 16 byte aligned. */
> +		int32_t             buffer[4] __rte_aligned(16);
> +		const size_t        first = 16 - offset;
> +		register __m128i    xmm0;
> +
> +		/* First, copy first part of data in chunks of 4 byte,
> +		 * to achieve 16 byte alignment of source.
> +		 * This invalidates the source, destination and length alignment flags, and
> +		 * potentially makes the destination pointer 16 byte unaligned/aligned.
> +		 */
> +
> +		/** Copy from 16 byte aligned source pointer (floor rounding). */
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
> +		_mm_store_si128((void *)buffer, xmm0);
> +
> +		if (unlikely(len + offset <= 16)) {
> +			/* Short length. */
> +			if (((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ||
> +					(len & 3) == 0) {
> +				/* Length is 4 byte aligned. */
> +				switch (len) {
> +				case 1 * 4:
> +					/* Offset can be 1 * 4, 2 * 4 or 3 * 4. */
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
> +							buffer[offset / 4]);
> +					break;
> +				case 2 * 4:
> +					/* Offset can be 1 * 4 or 2 * 4. */
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
> +							buffer[offset / 4]);
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4),
> +							buffer[offset / 4 + 1]);
> +					break;
> +				case 3 * 4:
> +					/* Offset can only be 1 * 4. */
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
> +					break;
> +				}
> +			} else {
> +				/* Length is not 4 byte aligned. */
> +				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
> +			}
> +			return;
> +		}
> +
> +		switch (first) {
> +		case 1 * 4:
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[3]);
> +			break;
> +		case 2 * 4:
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[2]);
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[3]);
> +			break;
> +		case 3 * 4:
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
> +			break;
> +		}
> +
> +		src = RTE_PTR_ADD(src, first);
> +		dst = RTE_PTR_ADD(dst, first);
> +		len -= first;
> +
> +		/* Source pointer is now 16 byte aligned. */
> +		RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +		/* Then, copy the rest, using corrected alignment flags. */
> +		if (rte_is_aligned(dst, 16))
> +			rte_memcpy_nt_d16s16a(dst, src, len, (flags &
> +					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
> +					RTE_MEMOPS_F_LENA_MASK)) |
> +					RTE_MEMOPS_F_DST16A | RTE_MEMOPS_F_SRC16A |
> +					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
> +					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
> +#ifdef RTE_ARCH_X86_64
> +		else if (rte_is_aligned(dst, 8))
> +			rte_memcpy_nt_d8s16a(dst, src, len, (flags &
> +					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
> +					RTE_MEMOPS_F_LENA_MASK)) |
> +					RTE_MEMOPS_F_DST8A | RTE_MEMOPS_F_SRC16A |
> +					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
> +					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
> +#endif /* RTE_ARCH_X86_64 */
> +		else
> +			rte_memcpy_nt_d4s16a(dst, src, len, (flags &
> +					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
> +					RTE_MEMOPS_F_LENA_MASK)) |
> +					RTE_MEMOPS_F_DST4A | RTE_MEMOPS_F_SRC16A |
> +					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
> +					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
> +	}
> +}
> +
> +#ifndef RTE_MEMCPY_NT_BUFSIZE
> +
> +#include <lib/mbuf/rte_mbuf_core.h>
> +
> +/** Bounce buffer size for non-temporal memcpy.
> + *
> + * Must be 2^N and >= 128.
> + * The actual buffer will be slightly larger, due to added padding.
> + * The default is chosen to be able to handle a non-segmented packet.
> + */
> +#define RTE_MEMCPY_NT_BUFSIZE RTE_MBUF_DEFAULT_DATAROOM
> +
> +#endif  /* RTE_MEMCPY_NT_BUFSIZE */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Non-temporal memory copy via bounce buffer.
> + *
> + * @note
> + * If the destination and/or length is unaligned, the first and/or last copied
> + * bytes will be stored in the destination memory area using temporal access.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + *   Must be <= RTE_MEMCPY_NT_BUFSIZE.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_buf(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	/** Cache line aligned bounce buffer with preceding and trailing padding.
> +	 *
> +	 * The preceding padding is one cache line, so the data area itself
> +	 * is cache line aligned.
> +	 * The trailing padding is 16 bytes, leaving room for the trailing bytes
> +	 * of a 16 byte store operation.
> +	 */
> +	char			buffer[RTE_CACHE_LINE_SIZE + RTE_MEMCPY_NT_BUFSIZE +  16]
> +				__rte_cache_aligned;
> +	/** Pointer to bounce buffer's aligned data area. */
> +	char		* const buf0 = &buffer[RTE_CACHE_LINE_SIZE];
> +	void		       *buf;
> +	/** Number of bytes to copy from source, incl. any extra preceding bytes. */
> +	size_t			srclen;
> +	register __m128i	xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(len <= RTE_MEMCPY_NT_BUFSIZE);
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Step 1:
> +	 * Copy data from the source to the bounce buffer's aligned data area,
> +	 * using aligned non-temporal load from the source,
> +	 * and unaligned store in the bounce buffer.
> +	 *
> +	 * If the source is unaligned, the additional bytes preceding the data will be copied
> +	 * to the padding area preceding the bounce buffer's aligned data area.
> +	 * Similarly, if the source data ends at an unaligned address, the additional bytes
> +	 * trailing the data will be copied to the padding area trailing the bounce buffer's
> +	 * aligned data area.
> +	 */
> +
> +	/* Adjust for extra preceding bytes, unless source is known to be 16 byte aligned. */
> +	if ((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) {
> +		buf = buf0;
> +		srclen = len;
> +	} else {
> +		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
> +		const size_t offset = (uintptr_t)src & 15;
> +
> +		buf = RTE_PTR_SUB(buf0, offset);
> +		src = RTE_PTR_SUB(src, offset);
> +		srclen = len + offset;
> +	}
> +
> +	/* Copy large portion of data from source to bounce buffer in chunks of 64 byte. */
> +	while (srclen >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
> +		src = RTE_PTR_ADD(src, 64);
> +		buf = RTE_PTR_ADD(buf, 64);
> +		srclen -= 64;
> +	}
> +
> +	/* Copy remaining 32 and 16 byte portions of data from source to bounce buffer.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the length alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
> +			(srclen & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
> +		src = RTE_PTR_ADD(src, 32);
> +		buf = RTE_PTR_ADD(buf, 32);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
> +			(srclen & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_storeu_si128(buf, xmm2);
> +		src = RTE_PTR_ADD(src, 16);
> +		buf = RTE_PTR_ADD(buf, 16);
> +	}
> +	/* Copy any trailing bytes of data from source to bounce buffer.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the length alignment
> +	 * flags are still valid)
> +	 * and length is known to be 16 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
> +			(srclen & 15)) {
> +		xmm3 = _mm_stream_load_si128_const(src);
> +		_mm_storeu_si128(buf, xmm3);
> +	}
> +
> +	/* Step 2:
> +	 * Copy from the aligned bounce buffer to the non-temporal destination.
> +	 */
> +	rte_memcpy_ntd(dst, buf0, len,
> +			(flags & ~(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_SRCA_MASK)) |
> +			(RTE_CACHE_LINE_SIZE << RTE_MEMOPS_F_SRCA_SHIFT));
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @note
> + * If the destination and/or length is unaligned, some copied bytes will be
> + * stored in the destination memory area using temporal access.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_generic(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +
> +	while (len > RTE_MEMCPY_NT_BUFSIZE) {
> +		rte_memcpy_nt_buf(dst, src, RTE_MEMCPY_NT_BUFSIZE,
> +				(flags & ~RTE_MEMOPS_F_LENA_MASK) | RTE_MEMOPS_F_LEN128A);
> +		dst = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
> +		src = RTE_PTR_ADD(src, RTE_MEMCPY_NT_BUFSIZE);
> +		len -= RTE_MEMCPY_NT_BUFSIZE;
> +	}
> +	rte_memcpy_nt_buf(dst, src, len, flags);
> +}
> +
> +/* Implementation. Refer to function declaration for documentation. */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	if ((flags & (RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) ==
> +			(RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) {
> +		/* Copy between non-temporal source and destination. */
> +		if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
> +			rte_memcpy_nt_d16s16a(dst, src, len, flags);
> +#ifdef RTE_ARCH_X86_64
> +		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
> +			rte_memcpy_nt_d8s16a(dst, src, len, flags);
> +#endif /* RTE_ARCH_X86_64 */
> +		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
> +			rte_memcpy_nt_d4s16a(dst, src, len, flags);
> +		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A)
> +			rte_memcpy_nt_d4s4a(dst, src, len, flags);
> +		else if (len <= RTE_MEMCPY_NT_BUFSIZE)
> +			rte_memcpy_nt_buf(dst, src, len, flags);
> +		else
> +			rte_memcpy_nt_generic(dst, src, len, flags);
> +	} else if (flags & RTE_MEMOPS_F_SRC_NT) {
> +		/* Copy from non-temporal source. */
> +		rte_memcpy_nts(dst, src, len, flags);
> +	} else if (flags & RTE_MEMOPS_F_DST_NT) {
> +		/* Copy to non-temporal destination. */
> +		rte_memcpy_ntd(dst, src, len, flags);
> +	} else
> +		rte_memcpy(dst, src, len);
> +}
> +
>   #undef ALIGNMENT_MASK
>   
>   #if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
> index a2307cebe6..aa96fb4cc8 100644
> --- a/lib/mbuf/rte_mbuf.c
> +++ b/lib/mbuf/rte_mbuf.c
> @@ -660,6 +660,83 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
>   	return mc;
>   }
>   
> +/* Create a deep copy of mbuf, using non-temporal memory access */
> +struct rte_mbuf *
> +rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
> +		 uint32_t off, uint32_t len, const uint64_t flags)
> +{
> +	const struct rte_mbuf *seg = m;
> +	struct rte_mbuf *mc, *m_last, **prev;
> +
> +	/* garbage in check */
> +	__rte_mbuf_sanity_check(m, 1);
> +
> +	/* check for request to copy at offset past end of mbuf */
> +	if (unlikely(off >= m->pkt_len))
> +		return NULL;
> +
> +	mc = rte_pktmbuf_alloc(mp);
> +	if (unlikely(mc == NULL))
> +		return NULL;
> +
> +	/* truncate requested length to available data */
> +	if (len > m->pkt_len - off)
> +		len = m->pkt_len - off;
> +
> +	__rte_pktmbuf_copy_hdr(mc, m);
> +
> +	/* copied mbuf is not indirect or external */
> +	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
> +
> +	prev = &mc->next;
> +	m_last = mc;
> +	while (len > 0) {
> +		uint32_t copy_len;
> +
> +		/* skip leading mbuf segments */
> +		while (off >= seg->data_len) {
> +			off -= seg->data_len;
> +			seg = seg->next;
> +		}
> +
> +		/* current buffer is full, chain a new one */
> +		if (rte_pktmbuf_tailroom(m_last) == 0) {
> +			m_last = rte_pktmbuf_alloc(mp);
> +			if (unlikely(m_last == NULL)) {
> +				rte_pktmbuf_free(mc);
> +				return NULL;
> +			}
> +			++mc->nb_segs;
> +			*prev = m_last;
> +			prev = &m_last->next;
> +		}
> +
> +		/*
> +		 * copy the min of data in input segment (seg)
> +		 * vs space available in output (m_last)
> +		 */
> +		copy_len = RTE_MIN(seg->data_len - off, len);
> +		if (copy_len > rte_pktmbuf_tailroom(m_last))
> +			copy_len = rte_pktmbuf_tailroom(m_last);
> +
> +		/* append from seg to m_last */
> +		rte_memcpy_ex(rte_pktmbuf_mtod_offset(m_last, char *,
> +						   m_last->data_len),
> +			   rte_pktmbuf_mtod_offset(seg, char *, off),
> +			   copy_len, flags);
> +
> +		/* update offsets and lengths */
> +		m_last->data_len += copy_len;
> +		mc->pkt_len += copy_len;
> +		off += copy_len;
> +		len -= copy_len;
> +	}
> +
> +	/* garbage out check */
> +	__rte_mbuf_sanity_check(mc, 1);
> +	return mc;
> +}
> +
>   /* dump a mbuf on console */
>   void
>   rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
> diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> index b6e23d98ce..030df396a3 100644
> --- a/lib/mbuf/rte_mbuf.h
> +++ b/lib/mbuf/rte_mbuf.h
> @@ -1443,6 +1443,38 @@ struct rte_mbuf *
>   rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
>   		 uint32_t offset, uint32_t length);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Create a full copy of a given packet mbuf,
> + * using non-temporal memory access as specified by flags.
> + *
> + * Copies all the data from a given packet mbuf to a newly allocated
> + * set of mbufs. The private data are is not copied.
> + *
> + * @param m
> + *   The packet mbuf to be copied.
> + * @param mp
> + *   The mempool from which the "clone" mbufs are allocated.
> + * @param offset
> + *   The number of bytes to skip before copying.
> + *   If the mbuf does not have that many bytes, it is an error
> + *   and NULL is returned.
> + * @param length
> + *   The upper limit on bytes to copy.  Passing UINT32_MAX
> + *   means all data (after offset).
> + * @param flags
> + *   Non-temporal memory access hints for rte_memcpy_ex.
> + * @return
> + *   - The pointer to the new "clone" mbuf on success.
> + *   - NULL if allocation fails.
> + */
> +__rte_experimental
> +struct rte_mbuf *
> +rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
> +		    uint32_t offset, uint32_t length, const uint64_t flags);
> +
>   /**
>    * Adds given value to the refcnt of all packet mbuf segments.
>    *
> diff --git a/lib/mbuf/version.map b/lib/mbuf/version.map
> index ed486ed14e..b583364ad4 100644
> --- a/lib/mbuf/version.map
> +++ b/lib/mbuf/version.map
> @@ -47,5 +47,6 @@ EXPERIMENTAL {
>   	global:
>   
>   	rte_pktmbuf_pool_create_extbuf;
> +	rte_pktmbuf_copy_ex;
>   
>   };
> diff --git a/lib/pcapng/rte_pcapng.c b/lib/pcapng/rte_pcapng.c
> index af2b814251..ae871c4865 100644
> --- a/lib/pcapng/rte_pcapng.c
> +++ b/lib/pcapng/rte_pcapng.c
> @@ -466,7 +466,8 @@ rte_pcapng_copy(uint16_t port_id, uint32_t queue,
>   	orig_len = rte_pktmbuf_pkt_len(md);
>   
>   	/* Take snapshot of the data */
> -	mc = rte_pktmbuf_copy(md, mp, 0, length);
> +	mc = rte_pktmbuf_copy_ex(md, mp, 0, length,
> +				 RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
>   	if (unlikely(mc == NULL))
>   		return NULL;
>   
> diff --git a/lib/pdump/rte_pdump.c b/lib/pdump/rte_pdump.c
> index 98dcbc037b..6e61c75407 100644
> --- a/lib/pdump/rte_pdump.c
> +++ b/lib/pdump/rte_pdump.c
> @@ -124,7 +124,8 @@ pdump_copy(uint16_t port_id, uint16_t queue,
>   					    pkts[i], mp, cbs->snaplen,
>   					    ts, direction);
>   		else
> -			p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
> +			p = rte_pktmbuf_copy_ex(pkts[i], mp, 0, cbs->snaplen,
> +						RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
>   
>   		if (unlikely(p == NULL))
>   			__atomic_fetch_add(&stats->nombuf, 1, __ATOMIC_RELAXED);
> @@ -134,6 +135,9 @@ pdump_copy(uint16_t port_id, uint16_t queue,
>   
>   	__atomic_fetch_add(&stats->accepted, d_pkts, __ATOMIC_RELAXED);
>   
> +	/* Flush non-temporal stores regarding the packet copies. */
> +	rte_wmb();
> +
>   	ring_enq = rte_ring_enqueue_burst(ring, (void *)dup_bufs, d_pkts, NULL);
>   	if (unlikely(ring_enq < d_pkts)) {
>   		unsigned int drops = d_pkts - ring_enq;
  
Thomas Monjalon July 31, 2023, 12:14 p.m. UTC | #3
Hello,

What's the status of this feature?


10/10/2022 08:46, Morten Brørup:
> This patch provides a function for memory copy using non-temporal store,
> load or both, controlled by flags passed to the function.
> 
> Applications sometimes copy data to another memory location, which is only
> used much later.
> In this case, it is inefficient to pollute the data cache with the copied
> data.
> 
> An example use case (originating from a real life application):
> Copying filtered packets, or the first part of them, into a capture buffer
> for offline analysis.
> 
> The purpose of the function is to achieve a performance gain by not
> polluting the cache when copying data.
> Although the throughput can be improved by further optimization, I do not
> have time to do it now.
> 
> The functional tests and performance tests for memory copy have been
> expanded to include non-temporal copying.
> 
> A non-temporal version of the mbuf library's function to create a full
> copy of a given packet mbuf is provided.
> 
> The packet capture and packet dump libraries have been updated to use
> non-temporal memory copy of the packets.
> 
> Implementation notes:
> 
> Implementations for non-x86 architectures can be provided by anyone at a
> later time. I am not going to do it.
> 
> x86 non-temporal load instructions must be 16 byte aligned [1], and
> non-temporal store instructions must be 4, 8 or 16 byte aligned [2].
> 
> ARM non-temporal load and store instructions seem to require 4 byte
> alignment [3].
> 
> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_load
> [2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_si
> [3] https://developer.arm.com/documentation/100076/0100/
> A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
> LDNP--SIMD-and-FP-
> 
> This patch is a major rewrite from the RFC v3, so no version log comparing
> to the RFC is provided.
> 
> v4
> * Also ignore the warning for clang int the workaround for
>   _mm_stream_load_si128() missing const in the parameter.
> * Add missing C linkage specifier in rte_memcpy.h.
> 
> v3
> * _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
>   use it on 64-bit x86 architecture.
> * CLANG warns that _mm_stream_load_si128_const() and
>   rte_memcpy_nt_15_or_less_s16a() are not public,
>   so remove __rte_internal from them. It also affects the documentation
>   for the functions, so the fix can't be limited to CLANG.
> * Use __rte_experimental instead of __rte_internal.
> * Replace <n> with nnn in function documentation; it doesn't look like
>   HTML.
> * Slightly modify the workaround for _mm_stream_load_si128() missing const
>   in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
>   #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
>   #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
> * Fixed one coding style issue missed in v2.
> 
> v2
> * The last 16 byte block of data, incl. any trailing bytes, were not
>   copied from the source memory area in rte_memcpy_nt_buf().
> * Fix many coding style issues.
> * Add some missing header files.
> * Fix build time warning for non-x86 architectures by using a different
>   method to mark the flags parameter unused.
> * CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
>   so omit it when using CLANG.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  app/test/test_memcpy.c               |   65 +-
>  app/test/test_memcpy_perf.c          |  187 ++--
>  lib/eal/include/generic/rte_memcpy.h |  127 +++
>  lib/eal/x86/include/rte_memcpy.h     | 1238 ++++++++++++++++++++++++++
>  lib/mbuf/rte_mbuf.c                  |   77 ++
>  lib/mbuf/rte_mbuf.h                  |   32 +
>  lib/mbuf/version.map                 |    1 +
>  lib/pcapng/rte_pcapng.c              |    3 +-
>  lib/pdump/rte_pdump.c                |    6 +-
>  9 files changed, 1645 insertions(+), 91 deletions(-)
  
Morten Brørup July 31, 2023, 12:25 p.m. UTC | #4
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Monday, 31 July 2023 14.14
> 
> Hello,
> 
> What's the status of this feature?

I haven't given up on upstreaming this feature, but there doesn't seem to be much demand for it, so working on it has low priority.

> 
> 
> 10/10/2022 08:46, Morten Brørup:
> > This patch provides a function for memory copy using non-temporal store,
> > load or both, controlled by flags passed to the function.
> >
> > Applications sometimes copy data to another memory location, which is only
> > used much later.
> > In this case, it is inefficient to pollute the data cache with the copied
> > data.
> >
> > An example use case (originating from a real life application):
> > Copying filtered packets, or the first part of them, into a capture buffer
> > for offline analysis.
> >
> > The purpose of the function is to achieve a performance gain by not
> > polluting the cache when copying data.
> > Although the throughput can be improved by further optimization, I do not
> > have time to do it now.
> >
> > The functional tests and performance tests for memory copy have been
> > expanded to include non-temporal copying.
> >
> > A non-temporal version of the mbuf library's function to create a full
> > copy of a given packet mbuf is provided.
> >
> > The packet capture and packet dump libraries have been updated to use
> > non-temporal memory copy of the packets.
> >
> > Implementation notes:
> >
> > Implementations for non-x86 architectures can be provided by anyone at a
> > later time. I am not going to do it.
> >
> > x86 non-temporal load instructions must be 16 byte aligned [1], and
> > non-temporal store instructions must be 4, 8 or 16 byte aligned [2].
> >
> > ARM non-temporal load and store instructions seem to require 4 byte
> > alignment [3].
> >
> > [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> > index.html#text=_mm_stream_load
> > [2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> > index.html#text=_mm_stream_si
> > [3] https://developer.arm.com/documentation/100076/0100/
> > A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
> > LDNP--SIMD-and-FP-
> >
> > This patch is a major rewrite from the RFC v3, so no version log comparing
> > to the RFC is provided.
> >
> > v4
> > * Also ignore the warning for clang int the workaround for
> >   _mm_stream_load_si128() missing const in the parameter.
> > * Add missing C linkage specifier in rte_memcpy.h.
> >
> > v3
> > * _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
> >   use it on 64-bit x86 architecture.
> > * CLANG warns that _mm_stream_load_si128_const() and
> >   rte_memcpy_nt_15_or_less_s16a() are not public,
> >   so remove __rte_internal from them. It also affects the documentation
> >   for the functions, so the fix can't be limited to CLANG.
> > * Use __rte_experimental instead of __rte_internal.
> > * Replace <n> with nnn in function documentation; it doesn't look like
> >   HTML.
> > * Slightly modify the workaround for _mm_stream_load_si128() missing const
> >   in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
> >   #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
> >   #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
> > * Fixed one coding style issue missed in v2.
> >
> > v2
> > * The last 16 byte block of data, incl. any trailing bytes, were not
> >   copied from the source memory area in rte_memcpy_nt_buf().
> > * Fix many coding style issues.
> > * Add some missing header files.
> > * Fix build time warning for non-x86 architectures by using a different
> >   method to mark the flags parameter unused.
> > * CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
> >   so omit it when using CLANG.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  app/test/test_memcpy.c               |   65 +-
> >  app/test/test_memcpy_perf.c          |  187 ++--
> >  lib/eal/include/generic/rte_memcpy.h |  127 +++
> >  lib/eal/x86/include/rte_memcpy.h     | 1238 ++++++++++++++++++++++++++
> >  lib/mbuf/rte_mbuf.c                  |   77 ++
> >  lib/mbuf/rte_mbuf.h                  |   32 +
> >  lib/mbuf/version.map                 |    1 +
> >  lib/pcapng/rte_pcapng.c              |    3 +-
> >  lib/pdump/rte_pdump.c                |    6 +-
> >  9 files changed, 1645 insertions(+), 91 deletions(-)
> 
> 
>
  
Mattias Rönnblom Aug. 4, 2023, 5:49 a.m. UTC | #5
On 2023-07-31 14:25, Morten Brørup wrote:
>> From: Thomas Monjalon [mailto:thomas@monjalon.net]
>> Sent: Monday, 31 July 2023 14.14
>>
>> Hello,
>>
>> What's the status of this feature?
> 
> I haven't given up on upstreaming this feature, but there doesn't seem to be much demand for it, so working on it has low priority.
> 

This would definitely be a useful addition to the EAL, IMO.

It's also a case where it's difficult to provide a generic and portable 
solution with both good performance and reasonable semantics. The upside 
is you seem to come pretty far already.

>>
>>
>> 10/10/2022 08:46, Morten Brørup:
>>> This patch provides a function for memory copy using non-temporal store,
>>> load or both, controlled by flags passed to the function.
>>>
>>> Applications sometimes copy data to another memory location, which is only
>>> used much later.
>>> In this case, it is inefficient to pollute the data cache with the copied
>>> data.
>>>
>>> An example use case (originating from a real life application):
>>> Copying filtered packets, or the first part of them, into a capture buffer
>>> for offline analysis.
>>>
>>> The purpose of the function is to achieve a performance gain by not
>>> polluting the cache when copying data.
>>> Although the throughput can be improved by further optimization, I do not
>>> have time to do it now.
>>>
>>> The functional tests and performance tests for memory copy have been
>>> expanded to include non-temporal copying.
>>>
>>> A non-temporal version of the mbuf library's function to create a full
>>> copy of a given packet mbuf is provided.
>>>
>>> The packet capture and packet dump libraries have been updated to use
>>> non-temporal memory copy of the packets.
>>>
>>> Implementation notes:
>>>
>>> Implementations for non-x86 architectures can be provided by anyone at a
>>> later time. I am not going to do it.
>>>
>>> x86 non-temporal load instructions must be 16 byte aligned [1], and
>>> non-temporal store instructions must be 4, 8 or 16 byte aligned [2].
>>>
>>> ARM non-temporal load and store instructions seem to require 4 byte
>>> alignment [3].
>>>
>>> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
>>> index.html#text=_mm_stream_load
>>> [2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
>>> index.html#text=_mm_stream_si
>>> [3] https://developer.arm.com/documentation/100076/0100/
>>> A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
>>> LDNP--SIMD-and-FP-
>>>
>>> This patch is a major rewrite from the RFC v3, so no version log comparing
>>> to the RFC is provided.
>>>
>>> v4
>>> * Also ignore the warning for clang int the workaround for
>>>    _mm_stream_load_si128() missing const in the parameter.
>>> * Add missing C linkage specifier in rte_memcpy.h.
>>>
>>> v3
>>> * _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
>>>    use it on 64-bit x86 architecture.
>>> * CLANG warns that _mm_stream_load_si128_const() and
>>>    rte_memcpy_nt_15_or_less_s16a() are not public,
>>>    so remove __rte_internal from them. It also affects the documentation
>>>    for the functions, so the fix can't be limited to CLANG.
>>> * Use __rte_experimental instead of __rte_internal.
>>> * Replace <n> with nnn in function documentation; it doesn't look like
>>>    HTML.
>>> * Slightly modify the workaround for _mm_stream_load_si128() missing const
>>>    in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
>>>    #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
>>>    #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
>>> * Fixed one coding style issue missed in v2.
>>>
>>> v2
>>> * The last 16 byte block of data, incl. any trailing bytes, were not
>>>    copied from the source memory area in rte_memcpy_nt_buf().
>>> * Fix many coding style issues.
>>> * Add some missing header files.
>>> * Fix build time warning for non-x86 architectures by using a different
>>>    method to mark the flags parameter unused.
>>> * CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
>>>    so omit it when using CLANG.
>>>
>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>>> ---
>>>   app/test/test_memcpy.c               |   65 +-
>>>   app/test/test_memcpy_perf.c          |  187 ++--
>>>   lib/eal/include/generic/rte_memcpy.h |  127 +++
>>>   lib/eal/x86/include/rte_memcpy.h     | 1238 ++++++++++++++++++++++++++
>>>   lib/mbuf/rte_mbuf.c                  |   77 ++
>>>   lib/mbuf/rte_mbuf.h                  |   32 +
>>>   lib/mbuf/version.map                 |    1 +
>>>   lib/pcapng/rte_pcapng.c              |    3 +-
>>>   lib/pdump/rte_pdump.c                |    6 +-
>>>   9 files changed, 1645 insertions(+), 91 deletions(-)
>>
>>
>>
>
  

Patch

diff --git a/app/test/test_memcpy.c b/app/test/test_memcpy.c
index 1ab86f4967..12410ce413 100644
--- a/app/test/test_memcpy.c
+++ b/app/test/test_memcpy.c
@@ -1,5 +1,6 @@ 
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #include <stdint.h>
@@ -36,6 +37,19 @@  static size_t buf_sizes[TEST_VALUE_RANGE];
 /* Data is aligned on this many bytes (power of 2) */
 #define ALIGNMENT_UNIT          32
 
+const uint64_t nt_mode_flags[4] = {
+	0,
+	RTE_MEMOPS_F_SRC_NT,
+	RTE_MEMOPS_F_DST_NT,
+	RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT
+};
+const char * const nt_mode_str[4] = {
+	"none",
+	"src",
+	"dst",
+	"src+dst"
+};
+
 
 /*
  * Create two buffers, and initialise one with random values. These are copied
@@ -44,12 +58,13 @@  static size_t buf_sizes[TEST_VALUE_RANGE];
  * changed.
  */
 static int
-test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
+test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size, unsigned int nt_mode)
 {
 	unsigned int i;
 	uint8_t dest[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
 	uint8_t src[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
 	void * ret;
+	const uint64_t flags = nt_mode_flags[nt_mode];
 
 	/* Setup buffers */
 	for (i = 0; i < SMALL_BUFFER_SIZE + ALIGNMENT_UNIT; i++) {
@@ -58,18 +73,23 @@  test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	}
 
 	/* Do the copy */
-	ret = rte_memcpy(dest + off_dst, src + off_src, size);
-	if (ret != (dest + off_dst)) {
-		printf("rte_memcpy() returned %p, not %p\n",
-		       ret, dest + off_dst);
+	if (nt_mode) {
+		rte_memcpy_ex(dest + off_dst, src + off_src, size, flags);
+	} else {
+		ret = rte_memcpy(dest + off_dst, src + off_src, size);
+		if (ret != (dest + off_dst)) {
+			printf("rte_memcpy() returned %p, not %p\n",
+			       ret, dest + off_dst);
+		}
 	}
 
 	/* Check nothing before offset is affected */
 	for (i = 0; i < off_dst; i++) {
 		if (dest[i] != 0) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
 			       "[modified before start of dst].\n",
-			       (unsigned)size, off_src, off_dst);
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
 			return -1;
 		}
 	}
@@ -77,9 +97,11 @@  test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	/* Check everything was copied */
 	for (i = 0; i < size; i++) {
 		if (dest[i + off_dst] != src[i + off_src]) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
-			       "[didn't copy byte %u].\n",
-			       (unsigned)size, off_src, off_dst, i);
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
+			       "[didn't copy byte %u: 0x%02x!=0x%02x].\n",
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode], i,
+			       dest[i + off_dst], src[i + off_src]);
 			return -1;
 		}
 	}
@@ -87,9 +109,10 @@  test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	/* Check nothing after copy was affected */
 	for (i = size; i < SMALL_BUFFER_SIZE; i++) {
 		if (dest[i + off_dst] != 0) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
 			       "[copied too many].\n",
-			       (unsigned)size, off_src, off_dst);
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
 			return -1;
 		}
 	}
@@ -102,16 +125,18 @@  test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 static int
 func_test(void)
 {
-	unsigned int off_src, off_dst, i;
+	unsigned int off_src, off_dst, i, nt_mode;
 	int ret;
 
-	for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
-		for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
-			for (i = 0; i < RTE_DIM(buf_sizes); i++) {
-				ret = test_single_memcpy(off_src, off_dst,
-				                         buf_sizes[i]);
-				if (ret != 0)
-					return -1;
+	for (nt_mode = 0; nt_mode < 4; nt_mode++) {
+		for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
+			for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
+				for (i = 0; i < RTE_DIM(buf_sizes); i++) {
+					ret = test_single_memcpy(off_src, off_dst,
+								 buf_sizes[i], nt_mode);
+					if (ret != 0)
+						return -1;
+				}
 			}
 		}
 	}
diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
index 3727c160e6..6bb52cba88 100644
--- a/app/test/test_memcpy_perf.c
+++ b/app/test/test_memcpy_perf.c
@@ -1,5 +1,6 @@ 
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #include <stdint.h>
@@ -15,6 +16,7 @@ 
 #include <rte_malloc.h>
 
 #include <rte_memcpy.h>
+#include <rte_atomic.h>
 
 #include "test.h"
 
@@ -27,9 +29,9 @@ 
 /* List of buffer sizes to test */
 #if TEST_VALUE_RANGE == 0
 static size_t buf_sizes[] = {
-	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
-	129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448,
-	449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
+	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 40, 48, 60, 63, 64, 65, 80, 92, 124,
+	127, 128, 129, 140, 152, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447,
+	448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
 	2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
 };
 /* MUST be as large as largest packet size above */
@@ -72,7 +74,7 @@  static uint8_t *small_buf_read, *small_buf_write;
 static int
 init_buffers(void)
 {
-	unsigned i;
+	unsigned int i;
 
 	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
 	if (large_buf_read == NULL)
@@ -151,7 +153,7 @@  static void
 do_uncached_write(uint8_t *dst, int is_dst_cached,
 				  const uint8_t *src, int is_src_cached, size_t size)
 {
-	unsigned i, j;
+	unsigned int i, j;
 	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];
 
 	for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) {
@@ -167,66 +169,112 @@  do_uncached_write(uint8_t *dst, int is_dst_cached,
  * Run a single memcpy performance test. This is a macro to ensure that if
  * the "size" parameter is a constant it won't be converted to a variable.
  */
-#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,                   \
-                         src, is_src_cached, src_uoffset, size)             \
-do {                                                                        \
-    unsigned int iter, t;                                                   \
-    size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];          \
-    uint64_t start_time, total_time = 0;                                    \
-    uint64_t total_time2 = 0;                                               \
-    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
-        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
-                         src_addrs, is_src_cached, src_uoffset);            \
-        start_time = rte_rdtsc();                                           \
-        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
-            rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size);           \
-        total_time += rte_rdtsc() - start_time;                             \
-    }                                                                       \
-    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
-        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
-                         src_addrs, is_src_cached, src_uoffset);            \
-        start_time = rte_rdtsc();                                           \
-        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
-            memcpy(dst+dst_addrs[t], src+src_addrs[t], size);               \
-        total_time2 += rte_rdtsc() - start_time;                            \
-    }                                                                       \
-    printf("%3.0f -", (double)total_time  / TEST_ITERATIONS);                 \
-    printf("%3.0f",   (double)total_time2 / TEST_ITERATIONS);                 \
-    printf("(%6.2f%%) ", ((double)total_time - total_time2)*100/total_time2); \
+#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,					  \
+			 src, is_src_cached, src_uoffset, size)					  \
+do {												  \
+	unsigned int iter, t;									  \
+	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];				  \
+	uint64_t start_time;									  \
+	uint64_t total_time_rte = 0, total_time_std = 0;					  \
+	uint64_t total_time_ntd = 0, total_time_nts = 0, total_time_nt = 0;			  \
+	const uint64_t flags = ((dst_uoffset == 0) ?						  \
+				(ALIGNMENT_UNIT << RTE_MEMOPS_F_DSTA_SHIFT) : 0) |		  \
+			       ((src_uoffset == 0) ?						  \
+				(ALIGNMENT_UNIT << RTE_MEMOPS_F_SRCA_SHIFT) : 0);		  \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
+		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
+				 src_addrs, is_src_cached, src_uoffset);			  \
+		start_time = rte_rdtsc();							  \
+		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
+			rte_memcpy(dst + dst_addrs[t], src + src_addrs[t], size);		  \
+		total_time_rte += rte_rdtsc() - start_time;					  \
+	}											  \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
+		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
+				 src_addrs, is_src_cached, src_uoffset);			  \
+		start_time = rte_rdtsc();							  \
+		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
+			memcpy(dst + dst_addrs[t], src + src_addrs[t], size);			  \
+		total_time_std += rte_rdtsc() - start_time;					  \
+	}											  \
+	if (!(is_dst_cached && is_src_cached)) {						  \
+		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
+			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
+					 src_addrs, is_src_cached, src_uoffset);		  \
+			start_time = rte_rdtsc();						  \
+			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
+				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
+					      flags | RTE_MEMOPS_F_DST_NT);			  \
+			total_time_ntd += rte_rdtsc() - start_time;				  \
+		}										  \
+		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
+			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
+					 src_addrs, is_src_cached, src_uoffset);		  \
+			start_time = rte_rdtsc();						  \
+			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
+				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
+					      flags | RTE_MEMOPS_F_SRC_NT);			  \
+			total_time_nts += rte_rdtsc() - start_time;				  \
+		}										  \
+		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
+			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
+					 src_addrs, is_src_cached, src_uoffset);		  \
+			start_time = rte_rdtsc();						  \
+			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
+				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
+					      flags | RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT); \
+			total_time_nt += rte_rdtsc() - start_time;				  \
+		}										  \
+	}											  \
+	printf(" %4.0f-", (double)total_time_rte / TEST_ITERATIONS);				  \
+	printf("%4.0f",   (double)total_time_std / TEST_ITERATIONS);				  \
+	printf("(%+4.0f%%)", ((double)total_time_rte - total_time_std) * 100 / total_time_std);   \
+	if (!(is_dst_cached && is_src_cached)) {						  \
+		printf(" %4.0f", (double)total_time_ntd / TEST_ITERATIONS);			  \
+		printf(" %4.0f", (double)total_time_nts / TEST_ITERATIONS);			  \
+		printf(" %4.0f", (double)total_time_nt / TEST_ITERATIONS);			  \
+		if (total_time_nt / total_time_std > 9)						  \
+			printf("(*%4.1f)", (double)total_time_nt / total_time_std);		  \
+		else										  \
+			printf("(%+4.0f%%)",							  \
+			       ((double)total_time_nt - total_time_std) * 100 / total_time_std);  \
+	}											  \
 } while (0)
 
 /* Run aligned memcpy tests for each cached/uncached permutation */
-#define ALL_PERF_TESTS_FOR_SIZE(n)                                       \
-do {                                                                     \
-    if (__builtin_constant_p(n))                                         \
-        printf("\nC%6u", (unsigned)n);                                   \
-    else                                                                 \
-        printf("\n%7u", (unsigned)n);                                    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);    \
+#define ALL_PERF_TESTS_FOR_SIZE(n)						\
+do {										\
+	if (__builtin_constant_p(n))						\
+		printf("\nC%6u", (unsigned int)n);				\
+	else									\
+		printf("\n%7u", (unsigned int)n);				\
+	SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);	\
+	SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);	\
 } while (0)
 
 /* Run unaligned memcpy tests for each cached/uncached permutation */
-#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)                             \
-do {                                                                     \
-    if (__builtin_constant_p(n))                                         \
-        printf("\nC%6u", (unsigned)n);                                   \
-    else                                                                 \
-        printf("\n%7u", (unsigned)n);                                    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);    \
+#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)					\
+do {										\
+	if (__builtin_constant_p(n))						\
+		printf("\nC%6u", (unsigned int)n);				\
+	else									\
+		printf("\n%7u", (unsigned int)n);				\
+	SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);	\
+	SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);	\
 } while (0)
 
 /* Run memcpy tests for constant length */
-#define ALL_PERF_TEST_FOR_CONSTANT                                      \
-do {                                                                    \
-    TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);         \
-    TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);      \
-    TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);    \
+#define ALL_PERF_TEST_FOR_CONSTANT						\
+do {										\
+	TEST_CONSTANT(4U); TEST_CONSTANT(6U); TEST_CONSTANT(8U);		\
+	TEST_CONSTANT(16U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);		\
+	TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);		\
+	TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);	\
+	TEST_CONSTANT(2048U);							\
 } while (0)
 
 /* Run all memcpy tests for aligned constant cases */
@@ -251,7 +299,7 @@  perf_test_constant_unaligned(void)
 static inline void
 perf_test_variable_aligned(void)
 {
-	unsigned i;
+	unsigned int i;
 	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
 		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
 	}
@@ -261,7 +309,7 @@  perf_test_variable_aligned(void)
 static inline void
 perf_test_variable_unaligned(void)
 {
-	unsigned i;
+	unsigned int i;
 	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
 		ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]);
 	}
@@ -282,7 +330,7 @@  perf_test(void)
 
 #if TEST_VALUE_RANGE != 0
 	/* Set up buf_sizes array, if required */
-	unsigned i;
+	unsigned int i;
 	for (i = 0; i < TEST_VALUE_RANGE; i++)
 		buf_sizes[i] = i;
 #endif
@@ -290,13 +338,14 @@  perf_test(void)
 	/* See function comment */
 	do_uncached_write(large_buf_write, 0, small_buf_read, 1, SMALL_BUFFER_SIZE);
 
-	printf("\n** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **\n"
-		   "======= ================= ================= ================= =================\n"
-		   "   Size   Cache to cache     Cache to mem      Mem to cache        Mem to mem\n"
-		   "(bytes)          (ticks)          (ticks)           (ticks)           (ticks)\n"
-		   "------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n** rte_memcpy(RTE)/memcpy(STD)/rte_memcpy_ex(NTD/NTS/NT) - memcpy perf. tests (C = compile-time constant) **\n"
+		   "======= ================ ====================================== ====================================== ======================================\n"
+		   "   Size  Cache to cache               Cache to mem                           Mem to cache                            Mem to mem\n"
+		   "(bytes)         (ticks)                    (ticks)                                (ticks)                               (ticks)\n"
+		   "         RTE- STD(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)\n"
+		   "------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 
-	printf("\n================================= %2dB aligned =================================",
+	printf("\n================================================================ %2dB aligned ===============================================================",
 		ALIGNMENT_UNIT);
 	/* Do aligned tests where size is a variable */
 	timespec_get(&tv_begin, TIME_UTC);
@@ -304,28 +353,28 @@  perf_test(void)
 	timespec_get(&tv_end, TIME_UTC);
 	time_aligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 	/* Do aligned tests where size is a compile-time constant */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_constant_aligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_aligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n================================== Unaligned ==================================");
+	printf("\n================================================================= Unaligned =================================================================");
 	/* Do unaligned tests where size is a variable */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_variable_unaligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_unaligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 	/* Do unaligned tests where size is a compile-time constant */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_constant_unaligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_unaligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n======= ================= ================= ================= =================\n\n");
+	printf("\n======= ================ ====================================== ====================================== ======================================\n\n");
 
 	printf("Test Execution Time (seconds):\n");
 	printf("Aligned variable copy size   = %8.3f\n", time_aligned);
diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
index e7f0f8eaa9..b087f09c35 100644
--- a/lib/eal/include/generic/rte_memcpy.h
+++ b/lib/eal/include/generic/rte_memcpy.h
@@ -1,5 +1,6 @@ 
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #ifndef _RTE_MEMCPY_H_
@@ -11,6 +12,13 @@ 
  * Functions for vectorised implementation of memcpy().
  */
 
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
 /**
  * Copy 16 bytes from one location to another using optimised
  * instructions. The locations should not overlap.
@@ -113,4 +121,123 @@  rte_memcpy(void *dst, const void *src, size_t n);
 
 #endif /* __DOXYGEN__ */
 
+/*
+ * Advanced/Non-Temporal Memory Operations Flags.
+ */
+
+/** Length alignment hint mask. */
+#define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)
+/** Length alignment hint shift. */
+#define RTE_MEMOPS_F_LENA_SHIFT 0
+/** Hint: Length is 2 byte aligned. */
+#define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)
+/** Hint: Length is 4 byte aligned. */
+#define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)
+/** Hint: Length is 8 byte aligned. */
+#define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)
+/** Hint: Length is 16 byte aligned. */
+#define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)
+/** Hint: Length is 32 byte aligned. */
+#define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)
+/** Hint: Length is 64 byte aligned. */
+#define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)
+/** Hint: Length is 128 byte aligned. */
+#define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)
+
+/** Prefer non-temporal access to source memory area.
+ */
+#define RTE_MEMOPS_F_SRC_NT     (UINT64_C(1) << 8)
+/** Source address alignment hint mask. */
+#define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 8)
+/** Source address alignment hint shift. */
+#define RTE_MEMOPS_F_SRCA_SHIFT 8
+/** Hint: Source address is 2 byte aligned. */
+#define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 8)
+/** Hint: Source address is 4 byte aligned. */
+#define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 8)
+/** Hint: Source address is 8 byte aligned. */
+#define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 8)
+/** Hint: Source address is 16 byte aligned. */
+#define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 8)
+/** Hint: Source address is 32 byte aligned. */
+#define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 8)
+/** Hint: Source address is 64 byte aligned. */
+#define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 8)
+/** Hint: Source address is 128 byte aligned. */
+#define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 8)
+
+/** Prefer non-temporal access to destination memory area.
+ *
+ * On x86 architecture:
+ * Remember to call rte_wmb() after a sequence of copy operations.
+ */
+#define RTE_MEMOPS_F_DST_NT     (UINT64_C(1) << 16)
+/** Destination address alignment hint mask. */
+#define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 16)
+/** Destination address alignment hint shift. */
+#define RTE_MEMOPS_F_DSTA_SHIFT 16
+/** Hint: Destination address is 2 byte aligned. */
+#define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 16)
+/** Hint: Destination address is 4 byte aligned. */
+#define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 16)
+/** Hint: Destination address is 8 byte aligned. */
+#define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 16)
+/** Hint: Destination address is 16 byte aligned. */
+#define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 16)
+/** Hint: Destination address is 32 byte aligned. */
+#define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 16)
+/** Hint: Destination address is 64 byte aligned. */
+#define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 16)
+/** Hint: Destination address is 128 byte aligned. */
+#define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 16)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Advanced/non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the destination memory area.
+ * @param src
+ *   Pointer to the source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(SRC|DST)_NT, RTE_MEMOPS_F_(LEN|SRC|DST)nnnA flags.
+ *   Must be constant at build time.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags);
+
+#ifndef RTE_MEMCPY_EX_ARCH_DEFINED
+
+/* Fallback implementation, if no arch-specific implementation is provided. */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	RTE_SET_USED(flags);
+	memcpy(dst, src, len);
+}
+
+#endif /* RTE_MEMCPY_EX_ARCH_DEFINED */
+
+#ifdef __cplusplus
+}
+#endif
+
 #endif /* _RTE_MEMCPY_H_ */
diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
index d4d7a5cfc8..31d0faf7a8 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -1,5 +1,6 @@ 
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #ifndef _RTE_MEMCPY_X86_64_H_
@@ -17,6 +18,10 @@ 
 #include <rte_vect.h>
 #include <rte_common.h>
 #include <rte_config.h>
+#include <rte_debug.h>
+
+#define RTE_MEMCPY_EX_ARCH_DEFINED
+#include "generic/rte_memcpy.h"
 
 #ifdef __cplusplus
 extern "C" {
@@ -868,6 +873,1239 @@  rte_memcpy(void *dst, const void *src, size_t n)
 		return rte_memcpy_generic(dst, src, n);
 }
 
+/*
+ * Advanced/Non-Temporal Memory Operations.
+ */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Workaround for _mm_stream_load_si128() missing const in the parameter.
+ */
+__rte_experimental
+static __rte_always_inline
+__m128i _mm_stream_load_si128_const(const __m128i *const mem_addr)
+{
+	/* GCC 4.5.8 (in RHEL7) doesn't support the #pragma to ignore "-Wdiscarded-qualifiers".
+	 * So we explicitly type cast mem_addr and use the #pragma to ignore "-Wcast-qual".
+	 */
+#if defined(RTE_TOOLCHAIN_GCC)
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#elif defined(RTE_TOOLCHAIN_CLANG)
+#pragma clang diagnostic push
+#pragma clang diagnostic ignored "-Wcast-qual"
+#endif
+	return _mm_stream_load_si128((__m128i *)mem_addr);
+#if defined(RTE_TOOLCHAIN_GCC)
+#pragma GCC diagnostic pop
+#elif defined(RTE_TOOLCHAIN_CLANG)
+#pragma clang diagnostic pop
+#endif
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Memory copy from non-temporal source area.
+ *
+ * @note
+ * Performance is optimal when source pointer is 16 byte aligned.
+ *
+ * @param dst
+ *   Pointer to the destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(LEN|SRC)nnnA flags.
+ *   The RTE_MEMOPS_F_SRC_NT flag must be set.
+ *   The RTE_MEMOPS_F_DST_NT flag must be clear.
+ *   The RTE_MEMOPS_F_DSTnnnA flags are ignored.
+ *   Must be constant at build time.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nts(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_SRC_NT);
+
+	if (unlikely(len == 0))
+		return;
+
+	/* If source is not 16 byte aligned, then copy first part of data via bounce buffer,
+	 * to achieve 16 byte alignment of source pointer.
+	 * This invalidates the source, destination and length alignment flags, and
+	 * potentially makes the destination pointer unaligned.
+	 *
+	 * Omitted if source is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)) {
+		/* Source is not known to be 16 byte aligned, but might be. */
+		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+		const size_t    offset = (uintptr_t)src & 15;
+
+		if (offset) {
+			/* Source is not 16 byte aligned. */
+			char            buffer[16] __rte_aligned(16);
+			/** How many bytes is source away from 16 byte alignment
+			 * (ceiling rounding).
+			 */
+			const size_t    first = 16 - offset;
+
+			xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
+			_mm_store_si128((void *)buffer, xmm0);
+
+			/* Test for short length.
+			 *
+			 * Omitted if length is known to be >= 16.
+			 */
+			if (!(__builtin_constant_p(len) && len >= 16) &&
+					unlikely(len <= first)) {
+				/* Short length. */
+				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
+				return;
+			}
+
+			/* Copy until source pointer is 16 byte aligned. */
+			rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), first);
+			src = RTE_PTR_ADD(src, first);
+			dst = RTE_PTR_ADD(dst, first);
+			len -= first;
+		}
+	}
+
+	/* Source pointer is now 16 byte aligned. */
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(dst, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, if any, via bounce buffer.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the alignment
+	 * flags are still valid) and length is known to be 16 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(len & 15)) {
+		char    buffer[16] __rte_aligned(16);
+
+		xmm3 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)buffer, xmm3);
+		rte_mov15_or_less(dst, buffer, len & 15);
+	}
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Memory copy to non-temporal destination area.
+ *
+ * @note
+ * If the destination and/or length is unaligned, the first and/or last copied
+ * bytes will be stored in the destination memory area using temporal access.
+ * @note
+ * Performance is optimal when destination pointer is 16 byte aligned.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(LEN|DST)nnnA flags.
+ *   The RTE_MEMOPS_F_SRC_NT flag must be clear.
+ *   The RTE_MEMOPS_F_DST_NT flag must be set.
+ *   The RTE_MEMOPS_F_SRCnnnA flags are ignored.
+ *   Must be constant at build time.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ntd(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_DST_NT);
+
+	if (unlikely(len == 0))
+		return;
+
+	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ||
+			len >= 16) {
+		/* Length >= 16 and/or destination is known to be 16 byte aligned. */
+		register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+		/* If destination is not 16 byte aligned, then copy first part of data,
+		 * to achieve 16 byte alignment of destination pointer.
+		 * This invalidates the source, destination and length alignment flags, and
+		 * potentially makes the source pointer unaligned.
+		 *
+		 * Omitted if destination is known to be 16 byte aligned.
+		 */
+		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A)) {
+			/* Destination is not known to be 16 byte aligned, but might be. */
+			/** How many bytes is destination offset from 16 byte alignment
+			 * (floor rounding).
+			 */
+			const size_t    offset = (uintptr_t)dst & 15;
+
+			if (offset) {
+				/* Destination is not 16 byte aligned. */
+				/** How many bytes is destination away from 16 byte alignment
+				 * (ceiling rounding).
+				 */
+				const size_t    first = 16 - offset;
+
+				if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
+						(offset & 3) == 0) {
+					/* Destination is (known to be) 4 byte aligned. */
+					int32_t r0, r1, r2;
+
+					/* Copy until destination pointer is 16 byte aligned. */
+					if (first & 8) {
+						memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
+						memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
+						_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
+						_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
+						src = RTE_PTR_ADD(src, 8);
+						dst = RTE_PTR_ADD(dst, 8);
+						len -= 8;
+					}
+					if (first & 4) {
+						memcpy(&r2, src, 4);
+						_mm_stream_si32(dst, r2);
+						src = RTE_PTR_ADD(src, 4);
+						dst = RTE_PTR_ADD(dst, 4);
+						len -= 4;
+					}
+				} else {
+					/* Destination is not 4 byte aligned. */
+					/* Copy until destination pointer is 16 byte aligned. */
+					rte_mov15_or_less(dst, src, first);
+					src = RTE_PTR_ADD(src, first);
+					dst = RTE_PTR_ADD(dst, first);
+					len -= first;
+				}
+			}
+		}
+
+		/* Destination pointer is now 16 byte aligned. */
+		RTE_ASSERT(rte_is_aligned(dst, 16));
+
+		/* Copy large portion of data in chunks of 64 byte. */
+		while (len >= 64) {
+			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
+			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
+			xmm2 = _mm_loadu_si128(RTE_PTR_ADD(src, 2 * 16));
+			xmm3 = _mm_loadu_si128(RTE_PTR_ADD(src, 3 * 16));
+			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+			src = RTE_PTR_ADD(src, 64);
+			dst = RTE_PTR_ADD(dst, 64);
+			len -= 64;
+		}
+
+		/* Copy following 32 and 16 byte portions of data.
+		 *
+		 * Omitted if destination is known to be 16 byte aligned (so the alignment
+		 * flags are still valid)
+		 * and length is known to be respectively 64 or 32 byte aligned.
+		 */
+		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+				(len & 32)) {
+			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
+			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
+			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+			src = RTE_PTR_ADD(src, 32);
+			dst = RTE_PTR_ADD(dst, 32);
+		}
+		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+				(len & 16)) {
+			xmm2 = _mm_loadu_si128(src);
+			_mm_stream_si128(dst, xmm2);
+			src = RTE_PTR_ADD(src, 16);
+			dst = RTE_PTR_ADD(dst, 16);
+		}
+	} else {
+		/* Length <= 15, and
+		 * destination is not known to be 16 byte aligned (but might be).
+		 */
+		/* If destination is not 4 byte aligned, then
+		 * use normal copy and return.
+		 *
+		 * Omitted if destination is known to be 4 byte aligned.
+		 */
+		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) &&
+				!rte_is_aligned(dst, 4)) {
+			/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
+			rte_mov15_or_less(dst, src, len);
+			return;
+		}
+		/* Destination is (known to be) 4 byte aligned. Proceed. */
+	}
+
+	/* Destination pointer is now 4 byte (or 16 byte) aligned. */
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+
+	/* Copy following 8 and 4 byte portions of data.
+	 *
+	 * Omitted if destination is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 16 or 8 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(len & 8)) {
+		int32_t r0, r1;
+
+		memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
+		memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
+		src = RTE_PTR_ADD(src, 8);
+		dst = RTE_PTR_ADD(dst, 8);
+	}
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A)) &&
+			(len & 4)) {
+		int32_t r2;
+
+		memcpy(&r2, src, 4);
+		_mm_stream_si32(dst, r2);
+		src = RTE_PTR_ADD(src, 4);
+		dst = RTE_PTR_ADD(dst, 4);
+	}
+
+	/* Copy remaining 2 and 1 byte portions of data.
+	 *
+	 * Omitted if destination is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 4 and 2 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A)) &&
+			(len & 2)) {
+		int16_t r3;
+
+		memcpy(&r3, src, 2);
+		*(int16_t *)dst = r3;
+		src = RTE_PTR_ADD(src, 2);
+		dst = RTE_PTR_ADD(dst, 2);
+	}
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A)) &&
+			(len & 1))
+		*(char *)dst = *(const char *)src;
+}
+
+/**
+ * Non-temporal memory copy of 15 or less byte
+ * from 16 byte aligned source via bounce buffer.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Only the 4 least significant bits of this parameter are used.
+ *   The 4 least significant bits of this holds the number of remaining bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_15_or_less_s16a(void *__rte_restrict dst,
+		const void *__rte_restrict src, size_t len, const uint64_t flags)
+{
+	int32_t             buffer[4] __rte_aligned(16);
+	register __m128i    xmm0;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if ((len & 15) == 0)
+		return;
+
+	/* Non-temporal load into bounce buffer. */
+	xmm0 = _mm_stream_load_si128_const(src);
+	_mm_store_si128((void *)buffer, xmm0);
+
+	/* Store from bounce buffer. */
+	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
+			rte_is_aligned(dst, 4)) {
+		/* Destination is (known to be) 4 byte aligned. */
+		src = (const void *)buffer;
+		if (len & 8) {
+#ifdef RTE_ARCH_X86_64
+			if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) {
+				/* Destination is known to be 8 byte aligned. */
+				_mm_stream_si64(dst, *(const int64_t *)src);
+			} else {
+#endif /* RTE_ARCH_X86_64 */
+				_mm_stream_si32(RTE_PTR_ADD(dst, 0), buffer[0]);
+				_mm_stream_si32(RTE_PTR_ADD(dst, 4), buffer[1]);
+#ifdef RTE_ARCH_X86_64
+			}
+#endif /* RTE_ARCH_X86_64 */
+			src = RTE_PTR_ADD(src, 8);
+			dst = RTE_PTR_ADD(dst, 8);
+		}
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A) &&
+				(len & 4)) {
+			_mm_stream_si32(dst, *(const int32_t *)src);
+			src = RTE_PTR_ADD(src, 4);
+			dst = RTE_PTR_ADD(dst, 4);
+		}
+
+		/* Non-temporal store is unavailble for the remaining 3 byte or less. */
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) &&
+				(len & 2)) {
+			*(int16_t *)dst = *(const int16_t *)src;
+			src = RTE_PTR_ADD(src, 2);
+			dst = RTE_PTR_ADD(dst, 2);
+		}
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A) &&
+				(len & 1)) {
+			*(char *)dst = *(const char *)src;
+		}
+	} else {
+		/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
+		rte_mov15_or_less(dst, (const void *)buffer, len & 15);
+	}
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * 16 byte aligned addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 16 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d16s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 16));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_stream_si128(dst, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ?
+				flags : RTE_MEMOPS_F_DST16A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+
+#ifdef RTE_ARCH_X86_64
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * 8/16 byte aligned destination/source addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 8 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d8s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	int64_t             buffer[8] __rte_cache_aligned /* at least __rte_aligned(16) */;
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 8));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
+		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
+		_mm_store_si128((void *)&buffer[3 * 2], xmm3);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 4 * 8), buffer[4]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 5 * 8), buffer[5]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 6 * 8), buffer[6]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 7 * 8), buffer[7]);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[4]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[5]);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) ?
+				flags : RTE_MEMOPS_F_DST8A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+#endif /* RTE_ARCH_X86_64 */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * 4/16 byte aligned destination/source addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 4 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d4s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	int32_t             buffer[16] __rte_cache_aligned /* at least __rte_aligned(16) */;
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
+		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
+		_mm_store_si128((void *)&buffer[3 * 4], xmm3);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  0 * 4), buffer[0]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  1 * 4), buffer[1]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  2 * 4), buffer[2]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  3 * 4), buffer[3]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  4 * 4), buffer[4]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  5 * 4), buffer[5]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  6 * 4), buffer[6]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  7 * 4), buffer[7]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  8 * 4), buffer[8]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  9 * 4), buffer[9]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 10 * 4), buffer[10]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 11 * 4), buffer[11]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 12 * 4), buffer[12]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 13 * 4), buffer[13]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 14 * 4), buffer[14]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 15 * 4), buffer[15]);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[3]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 4 * 4), buffer[4]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 5 * 4), buffer[5]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 6 * 4), buffer[6]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 7 * 4), buffer[7]);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[8]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[9]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[10]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[11]);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ?
+				flags : RTE_MEMOPS_F_DST4A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * 4 byte aligned addresses (non-temporal) memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the (non-temporal) destination memory area.
+ *   Must be 4 byte aligned if using non-temporal store.
+ * @param src
+ *   Pointer to the (non-temporal) source memory area.
+ *   Must be 4 byte aligned if using non-temporal load.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d4s4a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+	const size_t    offset = (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A ?
+			0 : (uintptr_t)src & 15;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+	RTE_ASSERT(rte_is_aligned(src, 4));
+
+	if (unlikely(len == 0))
+		return;
+
+	if (offset == 0) {
+		/* Source is 16 byte aligned. */
+		/* Copy everything, using upgraded source alignment flags. */
+		rte_memcpy_nt_d4s16a(dst, src, len,
+				(flags & ~RTE_MEMOPS_F_SRCA_MASK) | RTE_MEMOPS_F_SRC16A);
+	} else {
+		/* Source is not 16 byte aligned, so make it 16 byte aligned. */
+		int32_t             buffer[4] __rte_aligned(16);
+		const size_t        first = 16 - offset;
+		register __m128i    xmm0;
+
+		/* First, copy first part of data in chunks of 4 byte,
+		 * to achieve 16 byte alignment of source.
+		 * This invalidates the source, destination and length alignment flags, and
+		 * potentially makes the destination pointer 16 byte unaligned/aligned.
+		 */
+
+		/** Copy from 16 byte aligned source pointer (floor rounding). */
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
+		_mm_store_si128((void *)buffer, xmm0);
+
+		if (unlikely(len + offset <= 16)) {
+			/* Short length. */
+			if (((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ||
+					(len & 3) == 0) {
+				/* Length is 4 byte aligned. */
+				switch (len) {
+				case 1 * 4:
+					/* Offset can be 1 * 4, 2 * 4 or 3 * 4. */
+					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
+							buffer[offset / 4]);
+					break;
+				case 2 * 4:
+					/* Offset can be 1 * 4 or 2 * 4. */
+					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
+							buffer[offset / 4]);
+					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4),
+							buffer[offset / 4 + 1]);
+					break;
+				case 3 * 4:
+					/* Offset can only be 1 * 4. */
+					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
+					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
+					_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
+					break;
+				}
+			} else {
+				/* Length is not 4 byte aligned. */
+				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
+			}
+			return;
+		}
+
+		switch (first) {
+		case 1 * 4:
+			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[3]);
+			break;
+		case 2 * 4:
+			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[2]);
+			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[3]);
+			break;
+		case 3 * 4:
+			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
+			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
+			_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
+			break;
+		}
+
+		src = RTE_PTR_ADD(src, first);
+		dst = RTE_PTR_ADD(dst, first);
+		len -= first;
+
+		/* Source pointer is now 16 byte aligned. */
+		RTE_ASSERT(rte_is_aligned(src, 16));
+
+		/* Then, copy the rest, using corrected alignment flags. */
+		if (rte_is_aligned(dst, 16))
+			rte_memcpy_nt_d16s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST16A | RTE_MEMOPS_F_SRC16A |
+					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
+#ifdef RTE_ARCH_X86_64
+		else if (rte_is_aligned(dst, 8))
+			rte_memcpy_nt_d8s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST8A | RTE_MEMOPS_F_SRC16A |
+					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
+#endif /* RTE_ARCH_X86_64 */
+		else
+			rte_memcpy_nt_d4s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST4A | RTE_MEMOPS_F_SRC16A |
+					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
+	}
+}
+
+#ifndef RTE_MEMCPY_NT_BUFSIZE
+
+#include <lib/mbuf/rte_mbuf_core.h>
+
+/** Bounce buffer size for non-temporal memcpy.
+ *
+ * Must be 2^N and >= 128.
+ * The actual buffer will be slightly larger, due to added padding.
+ * The default is chosen to be able to handle a non-segmented packet.
+ */
+#define RTE_MEMCPY_NT_BUFSIZE RTE_MBUF_DEFAULT_DATAROOM
+
+#endif  /* RTE_MEMCPY_NT_BUFSIZE */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Non-temporal memory copy via bounce buffer.
+ *
+ * @note
+ * If the destination and/or length is unaligned, the first and/or last copied
+ * bytes will be stored in the destination memory area using temporal access.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ *   Must be <= RTE_MEMCPY_NT_BUFSIZE.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_buf(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	/** Cache line aligned bounce buffer with preceding and trailing padding.
+	 *
+	 * The preceding padding is one cache line, so the data area itself
+	 * is cache line aligned.
+	 * The trailing padding is 16 bytes, leaving room for the trailing bytes
+	 * of a 16 byte store operation.
+	 */
+	char			buffer[RTE_CACHE_LINE_SIZE + RTE_MEMCPY_NT_BUFSIZE +  16]
+				__rte_cache_aligned;
+	/** Pointer to bounce buffer's aligned data area. */
+	char		* const buf0 = &buffer[RTE_CACHE_LINE_SIZE];
+	void		       *buf;
+	/** Number of bytes to copy from source, incl. any extra preceding bytes. */
+	size_t			srclen;
+	register __m128i	xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(len <= RTE_MEMCPY_NT_BUFSIZE);
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Step 1:
+	 * Copy data from the source to the bounce buffer's aligned data area,
+	 * using aligned non-temporal load from the source,
+	 * and unaligned store in the bounce buffer.
+	 *
+	 * If the source is unaligned, the additional bytes preceding the data will be copied
+	 * to the padding area preceding the bounce buffer's aligned data area.
+	 * Similarly, if the source data ends at an unaligned address, the additional bytes
+	 * trailing the data will be copied to the padding area trailing the bounce buffer's
+	 * aligned data area.
+	 */
+
+	/* Adjust for extra preceding bytes, unless source is known to be 16 byte aligned. */
+	if ((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) {
+		buf = buf0;
+		srclen = len;
+	} else {
+		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+		const size_t offset = (uintptr_t)src & 15;
+
+		buf = RTE_PTR_SUB(buf0, offset);
+		src = RTE_PTR_SUB(src, offset);
+		srclen = len + offset;
+	}
+
+	/* Copy large portion of data from source to bounce buffer in chunks of 64 byte. */
+	while (srclen >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		buf = RTE_PTR_ADD(buf, 64);
+		srclen -= 64;
+	}
+
+	/* Copy remaining 32 and 16 byte portions of data from source to bounce buffer.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the length alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+			(srclen & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		buf = RTE_PTR_ADD(buf, 32);
+	}
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+			(srclen & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(buf, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		buf = RTE_PTR_ADD(buf, 16);
+	}
+	/* Copy any trailing bytes of data from source to bounce buffer.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the length alignment
+	 * flags are still valid)
+	 * and length is known to be 16 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(srclen & 15)) {
+		xmm3 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(buf, xmm3);
+	}
+
+	/* Step 2:
+	 * Copy from the aligned bounce buffer to the non-temporal destination.
+	 */
+	rte_memcpy_ntd(dst, buf0, len,
+			(flags & ~(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_SRCA_MASK)) |
+			(RTE_CACHE_LINE_SIZE << RTE_MEMOPS_F_SRCA_SHIFT));
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @note
+ * If the destination and/or length is unaligned, some copied bytes will be
+ * stored in the destination memory area using temporal access.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_generic(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+
+	while (len > RTE_MEMCPY_NT_BUFSIZE) {
+		rte_memcpy_nt_buf(dst, src, RTE_MEMCPY_NT_BUFSIZE,
+				(flags & ~RTE_MEMOPS_F_LENA_MASK) | RTE_MEMOPS_F_LEN128A);
+		dst = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
+		src = RTE_PTR_ADD(src, RTE_MEMCPY_NT_BUFSIZE);
+		len -= RTE_MEMCPY_NT_BUFSIZE;
+	}
+	rte_memcpy_nt_buf(dst, src, len, flags);
+}
+
+/* Implementation. Refer to function declaration for documentation. */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	if ((flags & (RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) ==
+			(RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) {
+		/* Copy between non-temporal source and destination. */
+		if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d16s16a(dst, src, len, flags);
+#ifdef RTE_ARCH_X86_64
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d8s16a(dst, src, len, flags);
+#endif /* RTE_ARCH_X86_64 */
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d4s16a(dst, src, len, flags);
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A)
+			rte_memcpy_nt_d4s4a(dst, src, len, flags);
+		else if (len <= RTE_MEMCPY_NT_BUFSIZE)
+			rte_memcpy_nt_buf(dst, src, len, flags);
+		else
+			rte_memcpy_nt_generic(dst, src, len, flags);
+	} else if (flags & RTE_MEMOPS_F_SRC_NT) {
+		/* Copy from non-temporal source. */
+		rte_memcpy_nts(dst, src, len, flags);
+	} else if (flags & RTE_MEMOPS_F_DST_NT) {
+		/* Copy to non-temporal destination. */
+		rte_memcpy_ntd(dst, src, len, flags);
+	} else
+		rte_memcpy(dst, src, len);
+}
+
 #undef ALIGNMENT_MASK
 
 #if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index a2307cebe6..aa96fb4cc8 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -660,6 +660,83 @@  rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	return mc;
 }
 
+/* Create a deep copy of mbuf, using non-temporal memory access */
+struct rte_mbuf *
+rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
+		 uint32_t off, uint32_t len, const uint64_t flags)
+{
+	const struct rte_mbuf *seg = m;
+	struct rte_mbuf *mc, *m_last, **prev;
+
+	/* garbage in check */
+	__rte_mbuf_sanity_check(m, 1);
+
+	/* check for request to copy at offset past end of mbuf */
+	if (unlikely(off >= m->pkt_len))
+		return NULL;
+
+	mc = rte_pktmbuf_alloc(mp);
+	if (unlikely(mc == NULL))
+		return NULL;
+
+	/* truncate requested length to available data */
+	if (len > m->pkt_len - off)
+		len = m->pkt_len - off;
+
+	__rte_pktmbuf_copy_hdr(mc, m);
+
+	/* copied mbuf is not indirect or external */
+	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
+
+	prev = &mc->next;
+	m_last = mc;
+	while (len > 0) {
+		uint32_t copy_len;
+
+		/* skip leading mbuf segments */
+		while (off >= seg->data_len) {
+			off -= seg->data_len;
+			seg = seg->next;
+		}
+
+		/* current buffer is full, chain a new one */
+		if (rte_pktmbuf_tailroom(m_last) == 0) {
+			m_last = rte_pktmbuf_alloc(mp);
+			if (unlikely(m_last == NULL)) {
+				rte_pktmbuf_free(mc);
+				return NULL;
+			}
+			++mc->nb_segs;
+			*prev = m_last;
+			prev = &m_last->next;
+		}
+
+		/*
+		 * copy the min of data in input segment (seg)
+		 * vs space available in output (m_last)
+		 */
+		copy_len = RTE_MIN(seg->data_len - off, len);
+		if (copy_len > rte_pktmbuf_tailroom(m_last))
+			copy_len = rte_pktmbuf_tailroom(m_last);
+
+		/* append from seg to m_last */
+		rte_memcpy_ex(rte_pktmbuf_mtod_offset(m_last, char *,
+						   m_last->data_len),
+			   rte_pktmbuf_mtod_offset(seg, char *, off),
+			   copy_len, flags);
+
+		/* update offsets and lengths */
+		m_last->data_len += copy_len;
+		mc->pkt_len += copy_len;
+		off += copy_len;
+		len -= copy_len;
+	}
+
+	/* garbage out check */
+	__rte_mbuf_sanity_check(mc, 1);
+	return mc;
+}
+
 /* dump a mbuf on console */
 void
 rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index b6e23d98ce..030df396a3 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1443,6 +1443,38 @@  struct rte_mbuf *
 rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t offset, uint32_t length);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Create a full copy of a given packet mbuf,
+ * using non-temporal memory access as specified by flags.
+ *
+ * Copies all the data from a given packet mbuf to a newly allocated
+ * set of mbufs. The private data are is not copied.
+ *
+ * @param m
+ *   The packet mbuf to be copied.
+ * @param mp
+ *   The mempool from which the "clone" mbufs are allocated.
+ * @param offset
+ *   The number of bytes to skip before copying.
+ *   If the mbuf does not have that many bytes, it is an error
+ *   and NULL is returned.
+ * @param length
+ *   The upper limit on bytes to copy.  Passing UINT32_MAX
+ *   means all data (after offset).
+ * @param flags
+ *   Non-temporal memory access hints for rte_memcpy_ex.
+ * @return
+ *   - The pointer to the new "clone" mbuf on success.
+ *   - NULL if allocation fails.
+ */
+__rte_experimental
+struct rte_mbuf *
+rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
+		    uint32_t offset, uint32_t length, const uint64_t flags);
+
 /**
  * Adds given value to the refcnt of all packet mbuf segments.
  *
diff --git a/lib/mbuf/version.map b/lib/mbuf/version.map
index ed486ed14e..b583364ad4 100644
--- a/lib/mbuf/version.map
+++ b/lib/mbuf/version.map
@@ -47,5 +47,6 @@  EXPERIMENTAL {
 	global:
 
 	rte_pktmbuf_pool_create_extbuf;
+	rte_pktmbuf_copy_ex;
 
 };
diff --git a/lib/pcapng/rte_pcapng.c b/lib/pcapng/rte_pcapng.c
index af2b814251..ae871c4865 100644
--- a/lib/pcapng/rte_pcapng.c
+++ b/lib/pcapng/rte_pcapng.c
@@ -466,7 +466,8 @@  rte_pcapng_copy(uint16_t port_id, uint32_t queue,
 	orig_len = rte_pktmbuf_pkt_len(md);
 
 	/* Take snapshot of the data */
-	mc = rte_pktmbuf_copy(md, mp, 0, length);
+	mc = rte_pktmbuf_copy_ex(md, mp, 0, length,
+				 RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
 	if (unlikely(mc == NULL))
 		return NULL;
 
diff --git a/lib/pdump/rte_pdump.c b/lib/pdump/rte_pdump.c
index 98dcbc037b..6e61c75407 100644
--- a/lib/pdump/rte_pdump.c
+++ b/lib/pdump/rte_pdump.c
@@ -124,7 +124,8 @@  pdump_copy(uint16_t port_id, uint16_t queue,
 					    pkts[i], mp, cbs->snaplen,
 					    ts, direction);
 		else
-			p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
+			p = rte_pktmbuf_copy_ex(pkts[i], mp, 0, cbs->snaplen,
+						RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
 
 		if (unlikely(p == NULL))
 			__atomic_fetch_add(&stats->nombuf, 1, __ATOMIC_RELAXED);
@@ -134,6 +135,9 @@  pdump_copy(uint16_t port_id, uint16_t queue,
 
 	__atomic_fetch_add(&stats->accepted, d_pkts, __ATOMIC_RELAXED);
 
+	/* Flush non-temporal stores regarding the packet copies. */
+	rte_wmb();
+
 	ring_enq = rte_ring_enqueue_burst(ring, (void *)dup_bufs, d_pkts, NULL);
 	if (unlikely(ring_enq < d_pkts)) {
 		unsigned int drops = d_pkts - ring_enq;