net: add support for AVX512 when generating CRC

Message ID 1599739271-16605-1-git-send-email-mairtin.oloingsigh@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series net: add support for AVX512 when generating CRC |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Performance-Testing fail build patch failure
ci/travis-robot success Travis build: passed
ci/Intel-compilation fail Compilation issues

Commit Message

Mairtin o Loingsigh Sept. 10, 2020, 12:01 p.m. UTC
  This patch enables the generation of CRC using AVX512 instruction
set when available on the host platform.

Signed-off-by: Mairtin o Loingsigh <mairtin.oloingsigh@intel.com>
---

v1:
* Initial version, with AVX512 support for CRC32 Ethernet only
 (requires further updates)
  * AVX512 support for CRC16-CCITT and final implementation of
    CRC32 Ethernet will be added in v2
---
 doc/guides/rel_notes/release_20_11.rst |    4 +
 lib/librte_net/net_crc_avx.h           |  331 ++++++++++++++++++++++++++++++++
 lib/librte_net/rte_net_crc.c           |   23 ++-
 lib/librte_net/rte_net_crc.h           |    1 +
 4 files changed, 358 insertions(+), 1 deletions(-)
 create mode 100644 lib/librte_net/net_crc_avx.h
  

Comments

Bruce Richardson Sept. 10, 2020, 12:27 p.m. UTC | #1
On Thu, Sep 10, 2020 at 01:01:11PM +0100, Mairtin o Loingsigh wrote:
> This patch enables the generation of CRC using AVX512 instruction
> set when available on the host platform.
> 
> Signed-off-by: Mairtin o Loingsigh <mairtin.oloingsigh@intel.com>
> ---
> 
> v1:
> * Initial version, with AVX512 support for CRC32 Ethernet only
>  (requires further updates)
>   * AVX512 support for CRC16-CCITT and final implementation of
>     CRC32 Ethernet will be added in v2
> ---
>  doc/guides/rel_notes/release_20_11.rst |    4 +
>  lib/librte_net/net_crc_avx.h           |  331 ++++++++++++++++++++++++++++++++
>  lib/librte_net/rte_net_crc.c           |   23 ++-
>  lib/librte_net/rte_net_crc.h           |    1 +
>  4 files changed, 358 insertions(+), 1 deletions(-)
>  create mode 100644 lib/librte_net/net_crc_avx.h
> 
<snip>
> --- a/lib/librte_net/rte_net_crc.c
> +++ b/lib/librte_net/rte_net_crc.c
> @@ -10,12 +10,18 @@
>  #include <rte_common.h>
>  #include <rte_net_crc.h>
>  
> -#if defined(RTE_ARCH_X86_64) && defined(RTE_MACHINE_CPUFLAG_PCLMULQDQ)
> +#if defined(RTE_ARCH_X86_64) && defined(RTE_MACHINE_CPUFLAG_PCLMULQDQ) \
> +	&& defined(RTE_MACHINE_CPUFLAG_AVX512F)
> +#define X86_64_AVX512F_PCLMULQDQ     1
> +#elif defined(RTE_ARCH_X86_64) && defined(RTE_MACHINE_CPUFLAG_PCLMULQDQ)

This all seems to be build-time selection of path. Can you perhaps
investigate adding runtime selection instead, so that this can be used from
distro packages, or DPDK compiled on older systems but used on newer.
See also patchset: http://patches.dpdk.org/project/dpdk/list/?series=11831
which is relevant to this too.

/Bruce
  
Mairtin o Loingsigh Sept. 10, 2020, 12:52 p.m. UTC | #2
> -----Original Message-----
> From: Bruce Richardson <bruce.richardson@intel.com>
> Sent: Thursday, September 10, 2020 1:28 PM
> To: O'loingsigh, Mairtin <mairtin.oloingsigh@intel.com>
> Cc: Singh, Jasvinder <jasvinder.singh@intel.com>; dev@dpdk.org; Ryan,
> Brendan <brendan.ryan@intel.com>; Coyle, David <david.coyle@intel.com>;
> De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>
> Subject: Re: [dpdk-dev] [PATCH] net: add support for AVX512 when
> generating CRC
> 
> On Thu, Sep 10, 2020 at 01:01:11PM +0100, Mairtin o Loingsigh wrote:
> > This patch enables the generation of CRC using AVX512 instruction set
> > when available on the host platform.
> >
> > Signed-off-by: Mairtin o Loingsigh <mairtin.oloingsigh@intel.com>
> > ---
> >
> > v1:
> > * Initial version, with AVX512 support for CRC32 Ethernet only
> > (requires further updates)
> >   * AVX512 support for CRC16-CCITT and final implementation of
> >     CRC32 Ethernet will be added in v2
> > ---
> >  doc/guides/rel_notes/release_20_11.rst |    4 +
> >  lib/librte_net/net_crc_avx.h           |  331
> ++++++++++++++++++++++++++++++++
> >  lib/librte_net/rte_net_crc.c           |   23 ++-
> >  lib/librte_net/rte_net_crc.h           |    1 +
> >  4 files changed, 358 insertions(+), 1 deletions(-)  create mode
> > 100644 lib/librte_net/net_crc_avx.h
> >
> <snip>
> > --- a/lib/librte_net/rte_net_crc.c
> > +++ b/lib/librte_net/rte_net_crc.c
> > @@ -10,12 +10,18 @@
> >  #include <rte_common.h>
> >  #include <rte_net_crc.h>
> >
> > -#if defined(RTE_ARCH_X86_64) &&
> > defined(RTE_MACHINE_CPUFLAG_PCLMULQDQ)
> > +#if defined(RTE_ARCH_X86_64) &&
> defined(RTE_MACHINE_CPUFLAG_PCLMULQDQ) \
> > +	&& defined(RTE_MACHINE_CPUFLAG_AVX512F)
> > +#define X86_64_AVX512F_PCLMULQDQ     1
> > +#elif defined(RTE_ARCH_X86_64) &&
> > +defined(RTE_MACHINE_CPUFLAG_PCLMULQDQ)
> 
> This all seems to be build-time selection of path. Can you perhaps investigate
> adding runtime selection instead, so that this can be used from distro
> packages, or DPDK compiled on older systems but used on newer.
> See also patchset: http://patches.dpdk.org/project/dpdk/list/?series=11831
> which is relevant to this too.
> 
> /Bruce

Sure. I will look at options for run time selection of intrinsic path
  
De Lara Guarch, Pablo Sept. 11, 2020, 9:57 a.m. UTC | #3
Hi Mairtin,

> -----Original Message-----
> From: O'loingsigh, Mairtin <mairtin.oloingsigh@intel.com>
> Sent: Thursday, September 10, 2020 1:01 PM
> To: Singh, Jasvinder <jasvinder.singh@intel.com>
> Cc: dev@dpdk.org; Ryan, Brendan <brendan.ryan@intel.com>; Coyle, David
> <david.coyle@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; O'loingsigh, Mairtin
> <mairtin.oloingsigh@intel.com>
> Subject: [PATCH] net: add support for AVX512 when generating CRC
> 
> This patch enables the generation of CRC using AVX512 instruction set when
> available on the host platform.
> 
> Signed-off-by: Mairtin o Loingsigh <mairtin.oloingsigh@intel.com>
> ---
> 
> v1:
> * Initial version, with AVX512 support for CRC32 Ethernet only  (requires further
> updates)
>   * AVX512 support for CRC16-CCITT and final implementation of
>     CRC32 Ethernet will be added in v2
> ---
>  doc/guides/rel_notes/release_20_11.rst |    4 +
>  lib/librte_net/net_crc_avx.h           |  331 ++++++++++++++++++++++++++++++++
>  lib/librte_net/rte_net_crc.c           |   23 ++-
>  lib/librte_net/rte_net_crc.h           |    1 +
>  4 files changed, 358 insertions(+), 1 deletions(-)  create mode 100644
> lib/librte_net/net_crc_avx.h
> 
> diff --git a/doc/guides/rel_notes/release_20_11.rst
> b/doc/guides/rel_notes/release_20_11.rst
> index df227a1..d6a84ca 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -55,6 +55,10 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =======================================================
> 
> +* **Added support for AVX512 in rte_net CRC calculations.**
> +
> +  Added new CRC32 calculation code using AVX512 instruction set  Added
> + new CRC16-CCITT calculation code using AVX512 instruction set
> 
>  Removed Items
>  -------------
> diff --git a/lib/librte_net/net_crc_avx.h b/lib/librte_net/net_crc_avx.h new file
> mode 100644 index 0000000..d9481d5
> --- /dev/null
> +++ b/lib/librte_net/net_crc_avx.h

...

> +static __rte_always_inline uint32_t
> +crc32_eth_calc_pclmulqdq(
> +	const uint8_t *data,
> +	uint32_t data_len,
> +	uint32_t crc,
> +	const struct crc_pclmulqdq512_ctx *params) {
> +	__m256i b;
> +	__m512i temp, k;
> +	__m512i qw0 = _mm512_set1_epi64(0);
> +	__m512i fold0;
> +	uint32_t n;

This is loading 64 bytes of data, but if seems like only 16 are available, right? Should we use _mm_loadu_si128?

> +			fold0 = _mm512_xor_si512(fold0, temp);
> +			goto reduction_128_64;
> +		}
> +
> +		if (unlikely(data_len < 16)) {
> +			/* 0 to 15 bytes */
> +			uint8_t buffer[16] __rte_aligned(16);
> +
> +			memset(buffer, 0, sizeof(buffer));
> +			memcpy(buffer, data, data_len);

I would use _mm_maskz_loadu_epi8, passing a mask register with ((1 << data_len) - 1).

> +
> +			fold0 = _mm512_load_si512((const __m128i *)buffer);
> +			fold0 = _mm512_xor_si512(fold0, temp);
> +			if (unlikely(data_len < 4)) {
> +				fold0 = xmm_shift_left(fold0, 8 - data_len);
> +				goto barret_reduction;
> +			}
> +			fold0 = xmm_shift_left(fold0, 16 - data_len);
> +			goto reduction_128_64;
> +		}
> +		/* 17 to 31 bytes */
> +		fold0 = _mm512_loadu_si512((const __m512i *)data);

Same here. Looks like you are loading too much data?

> +		fold0 = _mm512_xor_si512(fold0, temp);
> +		n = 16;
> +		k = params->rk1_rk2;
> +		goto partial_bytes;
> +	}

...

> +
> +		fold0 = _mm512_xor_si512(fold0, temp);
> +		fold0 = _mm512_xor_si512(fold0, b);

You could use _mm512_ternarylogic_epi64 with 0x96 as to do 2x XORs in one instruction.

> +	}
> +
> +	/** Reduction 128 -> 32 Assumes: fold holds 128bit folded data */
> +reduction_128_64:
> +	k = params->rk5_rk6;
> +
> +barret_reduction:
> +	k = params->rk7_rk8;
> +	n = crcr32_reduce_64_to_32(fold0, k);
> +
> +	return n;
> +}
> +
> +
  
Mairtin o Loingsigh Sept. 29, 2020, 3:45 p.m. UTC | #4
Hi,

> -----Original Message-----
> From: Bruce Richardson <bruce.richardson@intel.com>
> Sent: Thursday, September 10, 2020 1:28 PM
> To: O'loingsigh, Mairtin <mairtin.oloingsigh@intel.com>
> Cc: Singh, Jasvinder <jasvinder.singh@intel.com>; dev@dpdk.org; Ryan,
> Brendan <brendan.ryan@intel.com>; Coyle, David <david.coyle@intel.com>;
> De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>
> Subject: Re: [dpdk-dev] [PATCH] net: add support for AVX512 when
> generating CRC
> 
> On Thu, Sep 10, 2020 at 01:01:11PM +0100, Mairtin o Loingsigh wrote:
> > This patch enables the generation of CRC using AVX512 instruction set
> > when available on the host platform.
> >
> > Signed-off-by: Mairtin o Loingsigh <mairtin.oloingsigh@intel.com>
> > ---
> >
> > v1:
> > * Initial version, with AVX512 support for CRC32 Ethernet only
> > (requires further updates)
> >   * AVX512 support for CRC16-CCITT and final implementation of
> >     CRC32 Ethernet will be added in v2
> > ---
> >  doc/guides/rel_notes/release_20_11.rst |    4 +
> >  lib/librte_net/net_crc_avx.h           |  331
> ++++++++++++++++++++++++++++++++
> >  lib/librte_net/rte_net_crc.c           |   23 ++-
> >  lib/librte_net/rte_net_crc.h           |    1 +
> >  4 files changed, 358 insertions(+), 1 deletions(-)  create mode
> > 100644 lib/librte_net/net_crc_avx.h
> >
> <snip>
> > --- a/lib/librte_net/rte_net_crc.c
> > +++ b/lib/librte_net/rte_net_crc.c
> > @@ -10,12 +10,18 @@
> >  #include <rte_common.h>
> >  #include <rte_net_crc.h>
> >
> > -#if defined(RTE_ARCH_X86_64) &&
> > defined(RTE_MACHINE_CPUFLAG_PCLMULQDQ)
> > +#if defined(RTE_ARCH_X86_64) &&
> defined(RTE_MACHINE_CPUFLAG_PCLMULQDQ) \
> > +	&& defined(RTE_MACHINE_CPUFLAG_AVX512F)
> > +#define X86_64_AVX512F_PCLMULQDQ     1
> > +#elif defined(RTE_ARCH_X86_64) &&
> > +defined(RTE_MACHINE_CPUFLAG_PCLMULQDQ)
> 
> This all seems to be build-time selection of path. Can you perhaps investigate
> adding runtime selection instead, so that this can be used from distro
> packages, or DPDK compiled on older systems but used on newer.
> See also patchset: http://patches.dpdk.org/project/dpdk/list/?series=11831
> which is relevant to this too.
> 
> /Bruce

We have added runtime check for v3 of patch which we have submitted

Mairtin
  
Mairtin o Loingsigh Sept. 29, 2020, 3:47 p.m. UTC | #5
Hi,

> -----Original Message-----
> From: De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>
> Sent: Friday, September 11, 2020 10:58 AM
> To: O'loingsigh, Mairtin <mairtin.oloingsigh@intel.com>; Singh, Jasvinder
> <jasvinder.singh@intel.com>
> Cc: dev@dpdk.org; Ryan, Brendan <brendan.ryan@intel.com>; Coyle, David
> <david.coyle@intel.com>
> Subject: RE: [PATCH] net: add support for AVX512 when generating CRC
> 
> Hi Mairtin,
> 
> > -----Original Message-----
> > From: O'loingsigh, Mairtin <mairtin.oloingsigh@intel.com>
> > Sent: Thursday, September 10, 2020 1:01 PM
> > To: Singh, Jasvinder <jasvinder.singh@intel.com>
> > Cc: dev@dpdk.org; Ryan, Brendan <brendan.ryan@intel.com>; Coyle,
> David
> > <david.coyle@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>; O'loingsigh, Mairtin
> > <mairtin.oloingsigh@intel.com>
> > Subject: [PATCH] net: add support for AVX512 when generating CRC
> >
> > This patch enables the generation of CRC using AVX512 instruction set
> > when available on the host platform.
> >
> > Signed-off-by: Mairtin o Loingsigh <mairtin.oloingsigh@intel.com>
> > ---
> >
> > v1:
> > * Initial version, with AVX512 support for CRC32 Ethernet only
> > (requires further
> > updates)
> >   * AVX512 support for CRC16-CCITT and final implementation of
> >     CRC32 Ethernet will be added in v2
> > ---
> >  doc/guides/rel_notes/release_20_11.rst |    4 +
> >  lib/librte_net/net_crc_avx.h           |  331
> ++++++++++++++++++++++++++++++++
> >  lib/librte_net/rte_net_crc.c           |   23 ++-
> >  lib/librte_net/rte_net_crc.h           |    1 +
> >  4 files changed, 358 insertions(+), 1 deletions(-)  create mode
> > 100644 lib/librte_net/net_crc_avx.h
> >
> > diff --git a/doc/guides/rel_notes/release_20_11.rst
> > b/doc/guides/rel_notes/release_20_11.rst
> > index df227a1..d6a84ca 100644
> > --- a/doc/guides/rel_notes/release_20_11.rst
> > +++ b/doc/guides/rel_notes/release_20_11.rst
> > @@ -55,6 +55,10 @@ New Features
> >       Also, make sure to start the actual text at the margin.
> >       =======================================================
> >
> > +* **Added support for AVX512 in rte_net CRC calculations.**
> > +
> > +  Added new CRC32 calculation code using AVX512 instruction set
> > + Added new CRC16-CCITT calculation code using AVX512 instruction set
> >
> >  Removed Items
> >  -------------
> > diff --git a/lib/librte_net/net_crc_avx.h
> > b/lib/librte_net/net_crc_avx.h new file mode 100644 index
> > 0000000..d9481d5
> > --- /dev/null
> > +++ b/lib/librte_net/net_crc_avx.h
> 
> ...
> 
> > +static __rte_always_inline uint32_t
> > +crc32_eth_calc_pclmulqdq(
> > +	const uint8_t *data,
> > +	uint32_t data_len,
> > +	uint32_t crc,
> > +	const struct crc_pclmulqdq512_ctx *params) {
> > +	__m256i b;
> > +	__m512i temp, k;
> > +	__m512i qw0 = _mm512_set1_epi64(0);
> > +	__m512i fold0;
> > +	uint32_t n;
> 
> This is loading 64 bytes of data, but if seems like only 16 are available, right?
> Should we use _mm_loadu_si128?
> 
> > +			fold0 = _mm512_xor_si512(fold0, temp);
> > +			goto reduction_128_64;
> > +		}
> > +
> > +		if (unlikely(data_len < 16)) {
> > +			/* 0 to 15 bytes */
> > +			uint8_t buffer[16] __rte_aligned(16);
> > +
> > +			memset(buffer, 0, sizeof(buffer));
> > +			memcpy(buffer, data, data_len);
> 
> I would use _mm_maskz_loadu_epi8, passing a mask register with ((1 <<
> data_len) - 1).
> 
> > +
> > +			fold0 = _mm512_load_si512((const __m128i
> *)buffer);
> > +			fold0 = _mm512_xor_si512(fold0, temp);
> > +			if (unlikely(data_len < 4)) {
> > +				fold0 = xmm_shift_left(fold0, 8 - data_len);
> > +				goto barret_reduction;
> > +			}
> > +			fold0 = xmm_shift_left(fold0, 16 - data_len);
> > +			goto reduction_128_64;
> > +		}
> > +		/* 17 to 31 bytes */
> > +		fold0 = _mm512_loadu_si512((const __m512i *)data);
> 
> Same here. Looks like you are loading too much data?
> 
> > +		fold0 = _mm512_xor_si512(fold0, temp);
> > +		n = 16;
> > +		k = params->rk1_rk2;
> > +		goto partial_bytes;
> > +	}
> 
> ...
> 
> > +
> > +		fold0 = _mm512_xor_si512(fold0, temp);
> > +		fold0 = _mm512_xor_si512(fold0, b);
> 
> You could use _mm512_ternarylogic_epi64 with 0x96 as to do 2x XORs in one
> instruction.
> 
> > +	}
> > +
> > +	/** Reduction 128 -> 32 Assumes: fold holds 128bit folded data */
> > +reduction_128_64:
> > +	k = params->rk5_rk6;
> > +
> > +barret_reduction:
> > +	k = params->rk7_rk8;
> > +	n = crcr32_reduce_64_to_32(fold0, k);
> > +
> > +	return n;
> > +}
> > +
> > +

The latest version of this patch (v3) reworks a lot of this code and address the issues noted above

Mairtin
  

Patch

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index df227a1..d6a84ca 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -55,6 +55,10 @@  New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added support for AVX512 in rte_net CRC calculations.**
+
+  Added new CRC32 calculation code using AVX512 instruction set
+  Added new CRC16-CCITT calculation code using AVX512 instruction set
 
 Removed Items
 -------------
diff --git a/lib/librte_net/net_crc_avx.h b/lib/librte_net/net_crc_avx.h
new file mode 100644
index 0000000..d9481d5
--- /dev/null
+++ b/lib/librte_net/net_crc_avx.h
@@ -0,0 +1,331 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_NET_CRC_AVX_H_
+#define _RTE_NET_CRC_AVX_H_
+
+#include <rte_branch_prediction.h>
+
+#include <rte_vect.h>
+#include <immintrin.h>
+#include <x86intrin.h>
+#include <cpuid.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/** PCLMULQDQ CRC computation context structure */
+struct crc_pclmulqdq512_ctx {
+	__m512i rk1_rk2;
+	__m512i rk3_rk4;
+	__m512i rk5_rk6;
+	__m512i rk7_rk8;
+};
+
+static struct crc_pclmulqdq512_ctx crc32_eth_pclmulqdq __rte_aligned(16);
+
+/**
+ * @brief Performs one folding round
+ *
+ * Logically function operates as follows:
+ *     DATA = READ_NEXT_64BYTES();
+ *     F1 = LSB8(FOLD)
+ *     F2 = MSB8(FOLD)
+ *     T1 = CLMUL(F1, RK1)
+ *     T2 = CLMUL(F2, RK2)
+ *     FOLD = XOR(T1, T2, DATA)
+ *
+ * @param data_block
+ *   64 byte data block
+ * @param precomp
+ *   Precomputed rk1 constant
+ * @param fold
+ *   Current16 byte folded data
+ *
+ * @return
+ *   New 16 byte folded data
+ */
+static __rte_always_inline __m512i
+crcr32_folding_round(__m512i data_block,
+		__m512i precomp,
+		__m512i fold)
+{
+	__m512i tmp0 = _mm512_clmulepi64_epi128(fold, precomp, 0x01);
+	__m512i tmp1 = _mm512_clmulepi64_epi128(fold, precomp, 0x10);
+
+	return _mm512_xor_si512(tmp1, _mm512_xor_si512(data_block, tmp0));
+}
+
+/**
+ * Performs reduction from 128 bits to 64 bits
+ *
+ * @param data128
+ *   128 bits data to be reduced
+ * @param precomp
+ *   precomputed constants rk5, rk6
+ *
+ * @return
+ *  64 bits reduced data
+ */
+
+static __rte_always_inline __m128i
+crcr32_reduce_128_to_64(__m128i data128, __m128i precomp)
+{
+	__m128i tmp0, tmp1, tmp2;
+
+	/* 64b fold */
+	tmp0 = _mm_clmulepi64_si128(data128, precomp, 0x00);
+	tmp1 = _mm_srli_si128(data128, 8);
+	tmp0 = _mm_xor_si128(tmp0, tmp1);
+
+	/* 32b fold */
+	tmp2 = _mm_slli_si128(tmp0, 4);
+	tmp1 = _mm_clmulepi64_si128(tmp2, precomp, 0x10);
+
+	return _mm_xor_si128(tmp1, tmp0);
+}
+
+/**
+ * Performs Barret's reduction from 64 bits to 32 bits
+ *
+ * @param data64
+ *   64 bits data to be reduced
+ * @param precomp
+ *   rk7 precomputed constant
+ *
+ * @return
+ *   reduced 32 bits data
+ */
+
+static __rte_always_inline uint32_t
+crcr32_reduce_64_to_32(__m512i data64, __m512i precomp)
+{
+	static const uint32_t mask1[4] __rte_aligned(64) = {
+		0xffffffff, 0xffffffff, 0x00000000, 0x00000000
+	};
+
+	static const uint32_t mask2[4] __rte_aligned(64) = {
+		0x00000000, 0xffffffff, 0xffffffff, 0xffffffff
+	};
+	__m512i tmp0, tmp1, tmp2;
+
+	tmp0 = _mm512_and_si512(data64, _mm512_load_si512(
+		(const __m512i *)mask2));
+
+	tmp1 = _mm512_clmulepi64_epi128(tmp0, precomp, 0x00);
+	tmp1 = _mm512_xor_si512(tmp1, tmp0);
+	tmp1 = _mm512_and_si512(tmp1, _mm512_load_si512(
+		(const __m128i *)mask1));
+
+	tmp2 = _mm512_clmulepi64_epi128(tmp1, precomp, 0x10);
+	tmp2 = _mm512_xor_si512(tmp2, tmp1);
+	tmp2 = _mm512_xor_si512(tmp2, tmp0);
+
+	return 0;
+}
+
+static const uint8_t crc_xmm_shift_tab[48] __rte_aligned(64) = {
+	0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+	0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+	0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+	0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
+	0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+	0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+};
+
+/**
+ * Shifts left 128 bit register by specified number of bytes
+ *
+ * @param reg
+ *   128 bit value
+ * @param num
+ *   number of bytes to shift left reg by (0-16)
+ *
+ * @return
+ *   reg << (num * 8)
+ */
+
+static __rte_always_inline __m512i
+xmm_shift_left(__m512i reg, const unsigned int num)
+{
+	const __m512i *p = (const __m512i *)(crc_xmm_shift_tab + 16 - num);
+
+	/* TODO: Check unaligned load*/
+	return _mm512_shuffle_epi8(reg, _mm512_load_si512(p));
+}
+
+static __rte_always_inline uint32_t
+crc32_eth_calc_pclmulqdq(
+	const uint8_t *data,
+	uint32_t data_len,
+	uint32_t crc,
+	const struct crc_pclmulqdq512_ctx *params)
+{
+	__m256i b;
+	__m512i temp, k;
+	__m512i qw0 = _mm512_set1_epi64(0);
+	__m512i fold0;
+	uint32_t n;
+
+	/* Get CRC init value */
+	b = _mm256_insert_epi32(_mm256_setzero_si256(), crc, 0);
+	temp = _mm512_inserti32x8(_mm512_setzero_si512(), b, 0);
+
+	/* align data to 16B*/
+	if (unlikely(data_len < 64)) {
+		if (unlikely(data_len == 16)) {
+			/* 16 bytes */
+			/* TODO: Unaligned load not working */
+			fold0 = _mm512_load_epi64((const __m512i *)data);
+			fold0 = _mm512_xor_si512(fold0, temp);
+			goto reduction_128_64;
+		}
+
+		if (unlikely(data_len < 16)) {
+			/* 0 to 15 bytes */
+			uint8_t buffer[16] __rte_aligned(16);
+
+			memset(buffer, 0, sizeof(buffer));
+			memcpy(buffer, data, data_len);
+
+			fold0 = _mm512_load_si512((const __m128i *)buffer);
+			fold0 = _mm512_xor_si512(fold0, temp);
+			if (unlikely(data_len < 4)) {
+				fold0 = xmm_shift_left(fold0, 8 - data_len);
+				goto barret_reduction;
+			}
+			fold0 = xmm_shift_left(fold0, 16 - data_len);
+			goto reduction_128_64;
+		}
+		/* 17 to 31 bytes */
+		fold0 = _mm512_loadu_si512((const __m512i *)data);
+		fold0 = _mm512_xor_si512(fold0, temp);
+		n = 16;
+		k = params->rk1_rk2;
+		goto partial_bytes;
+	}
+
+	/*Loop of folds*/
+	/** At least 32 bytes in the buffer */
+	/** Apply CRC initial value */
+	fold0 = _mm512_loadu_si512((const __m512i *)data);
+	fold0 = _mm512_xor_si512(fold0, temp);
+
+	/** Main folding loop - the last 32 bytes is processed separately */
+	k = params->rk1_rk2;
+	for (n = 64; (n + 64) <= data_len; n += 64) {
+		qw0 = _mm512_loadu_si512((const __m512i *)&data[n]);
+		fold0 = crcr32_folding_round(qw0, k, fold0);
+	}
+
+	/* 256 to 128 fold */
+	/* Check this */
+	k = params->rk3_rk4;
+	fold0 = crcr32_folding_round(temp, k, fold0);
+	n += 64;
+
+	/* Remainder */
+partial_bytes:
+	if (likely(n < data_len)) {
+
+		const uint32_t mask3[4] __rte_aligned(16) = {
+			0x80808080, 0x80808080, 0x80808080, 0x80808080
+		};
+
+		const uint8_t shf_table[32] __rte_aligned(16) = {
+			0x00, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87,
+			0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f,
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
+		};
+
+		__m128i last16;
+		__m512i a, b;
+
+		last16 = _mm_loadu_si128((const __m128i *)&data[data_len - 16]);
+
+		RTE_SET_USED(last16);
+
+		temp = _mm512_loadu_si512((const __m512i *)
+			&shf_table[data_len & 15]);
+		a = _mm512_shuffle_epi8(fold0, temp);
+
+		temp = _mm512_xor_si512(temp,
+			_mm512_load_si512((const __m512i *)mask3));
+		b = _mm512_shuffle_epi8(fold0, temp);
+
+		/* k = rk1 & rk2 */
+		temp = _mm512_clmulepi64_epi128(a, k, 0x01);
+		fold0 = _mm512_clmulepi64_epi128(a, k, 0x10);
+
+		fold0 = _mm512_xor_si512(fold0, temp);
+		fold0 = _mm512_xor_si512(fold0, b);
+	}
+
+	/** Reduction 128 -> 32 Assumes: fold holds 128bit folded data */
+reduction_128_64:
+	k = params->rk5_rk6;
+
+barret_reduction:
+	k = params->rk7_rk8;
+	n = crcr32_reduce_64_to_32(fold0, k);
+
+	return n;
+}
+
+
+static inline void
+rte_net_crc_avx512_init(void)
+{
+	__m128i a;
+	uint64_t k1, k2, k3, k4, k5, k6;
+	uint64_t p = 0, q = 0;
+
+	/** Initialize CRC32 data */
+	/* 256 fold constants*/
+	k1 = 0xe95c1271LLU;
+	k2 = 0xce3371cbLLU;
+
+	/*256 - 128 fold constants */
+	k3 = 0x910eeec1LLU;
+	k4 = 0x33fff533LLU;
+
+	k5 = 0xccaa009eLLU;
+	k6 = 0x163cd6124LLU;
+	q =  0x1f7011640LLU;
+	p =  0x1db710641LLU;
+
+	/** Save the params in context structure */
+	a = _mm_set_epi64x(k2, k1);
+	crc32_eth_pclmulqdq.rk1_rk2 = _mm512_broadcast_i32x4(a);
+	crc32_eth_pclmulqdq.rk3_rk4 = _mm512_setr_epi64(
+		k3, k4, 0, 0, 0, 0, 0, 0);
+	crc32_eth_pclmulqdq.rk5_rk6 = _mm512_setr_epi64(
+		k5, k6, 0, 0, 0, 0, 0, 0);
+	crc32_eth_pclmulqdq.rk7_rk8 = _mm512_setr_epi64(
+		q, p, 0, 0, 0, 0, 0, 0);
+	/**
+	 * Reset the register as following calculation may
+	 * use other data types such as float, double, etc.
+	 */
+	_mm_empty();
+
+}
+
+static inline uint32_t
+rte_crc32_eth_avx512_handler(const uint8_t *data,
+	uint32_t data_len)
+{
+	return ~crc32_eth_calc_pclmulqdq(data,
+		data_len,
+		0xffffffffUL,
+		&crc32_eth_pclmulqdq);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_NET_CRC_AVX_H_ */
diff --git a/lib/librte_net/rte_net_crc.c b/lib/librte_net/rte_net_crc.c
index 9fd4794..b2b2bc1 100644
--- a/lib/librte_net/rte_net_crc.c
+++ b/lib/librte_net/rte_net_crc.c
@@ -10,12 +10,18 @@ 
 #include <rte_common.h>
 #include <rte_net_crc.h>
 
-#if defined(RTE_ARCH_X86_64) && defined(RTE_MACHINE_CPUFLAG_PCLMULQDQ)
+#if defined(RTE_ARCH_X86_64) && defined(RTE_MACHINE_CPUFLAG_PCLMULQDQ) \
+	&& defined(RTE_MACHINE_CPUFLAG_AVX512F)
+#define X86_64_AVX512F_PCLMULQDQ     1
+#elif defined(RTE_ARCH_X86_64) && defined(RTE_MACHINE_CPUFLAG_PCLMULQDQ)
 #define X86_64_SSE42_PCLMULQDQ     1
 #elif defined(RTE_ARCH_ARM64) && defined(RTE_MACHINE_CPUFLAG_PMULL)
 #define ARM64_NEON_PMULL           1
 #endif
 
+#ifdef X86_64_AVX512F_PCLMULQDQ
+#include <net_crc_avx.h>
+#endif
 #ifdef X86_64_SSE42_PCLMULQDQ
 #include <net_crc_sse.h>
 #elif defined ARM64_NEON_PMULL
@@ -48,6 +54,12 @@ 
 	[RTE_NET_CRC32_ETH] = rte_crc32_eth_handler,
 };
 
+#ifdef X86_64_AVX512F_PCLMULQDQ
+static rte_net_crc_handler handlers_avx512[] = {
+	[RTE_NET_CRC32_ETH] = rte_crc32_eth_avx512_handler,
+};
+#endif
+
 #ifdef X86_64_SSE42_PCLMULQDQ
 static rte_net_crc_handler handlers_sse42[] = {
 	[RTE_NET_CRC16_CCITT] = rte_crc16_ccitt_sse42_handler,
@@ -157,6 +169,11 @@ 
 			handlers = handlers_neon;
 			break;
 		}
+#elif defined X86_64_AVX512F_PCLMULQDQ
+		/* fall-through */
+	case RTE_NET_CRC_AVX512:
+			handlers = handlers_avx512;
+			break;
 #endif
 		/* fall-through */
 	case RTE_NET_CRC_SCALAR:
@@ -197,6 +214,10 @@ 
 		rte_net_crc_neon_init();
 	}
 #endif
+#ifdef X86_64_AVX512F_PCLMULQDQ
+	alg = RTE_NET_CRC_AVX512;
+	rte_net_crc_avx512_init();
+#endif
 
 	rte_net_crc_set_alg(alg);
 }
diff --git a/lib/librte_net/rte_net_crc.h b/lib/librte_net/rte_net_crc.h
index 16e85ca..a7d2ed0 100644
--- a/lib/librte_net/rte_net_crc.h
+++ b/lib/librte_net/rte_net_crc.h
@@ -23,6 +23,7 @@  enum rte_net_crc_alg {
 	RTE_NET_CRC_SCALAR = 0,
 	RTE_NET_CRC_SSE42,
 	RTE_NET_CRC_NEON,
+	RTE_NET_CRC_AVX512,
 };
 
 /**