From patchwork Tue Oct  6 15:03:03 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79784
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 9C1DFA04BB;
	Tue,  6 Oct 2020 17:08:26 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 003E11B850;
	Tue,  6 Oct 2020 17:08:07 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id F0AF61B7FA;
 Tue,  6 Oct 2020 17:08:03 +0200 (CEST)
IronPort-SDR: 
 S4KE5XRbFk840Q5lSkpzHvcoRVCpSJsfrX6Tod0oLfxstsA+2FcXSVTEIX2Y6Nnkqty5YfvoKk
 QMy4RuVzfZ4Q==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919466"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919466"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:34 -0700
IronPort-SDR: 
 EwrN2L/Cb3P5XwLdsX7Abg28XmpicDYFc85TASKwpCSMIajIcNW1aY7I3VA2vsZtHD4PllbolH
 M0DUNzkLTMPA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315351"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:32 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>, stable@dpdk.org
Date: Tue,  6 Oct 2020 16:03:03 +0100
Message-Id: <20201006150316.5776-2-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
Subject: [dpdk-dev] [PATCH v4 01/14] acl: fix x86 build when compiler
	doesn't support AVX2
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Right now we define dummy version of rte_acl_classify_avx2()
when both X86 and AVX2 are not detected, though it should be
for non-AVX2 case only.

Fixes: e53ce4e41379 ("acl: remove use of weak functions")
Cc: stable@dpdk.org

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
---
 lib/librte_acl/rte_acl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 777ec4d340..715b023592 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,7 +16,6 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
-#ifndef RTE_ARCH_X86
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -33,6 +32,7 @@ rte_acl_classify_avx2(__rte_unused const struct rte_acl_ctx *ctx,
 }
 #endif
 
+#ifndef RTE_ARCH_X86
 int
 rte_acl_classify_sse(__rte_unused const struct rte_acl_ctx *ctx,
 	__rte_unused const uint8_t **data,

From patchwork Tue Oct  6 15:03:04 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79785
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 9FBA3A04BB;
	Tue,  6 Oct 2020 17:08:49 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 4405A1B952;
	Tue,  6 Oct 2020 17:08:12 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id 9B4891B87A;
 Tue,  6 Oct 2020 17:08:09 +0200 (CEST)
IronPort-SDR: 
 pZEIqABxlpfHspF3AfZSMgUNEpQTiya9A7DAq8rqnUgERwgcwWuqK9ZNJogfcz2Dfd6JcrWcDU
 /Q+hFDAVB/LQ==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919509"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919509"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:36 -0700
IronPort-SDR: 
 oiU8j1YqZcryCeYRZ/Se0Mc+DeIMPTabs1xJpqBRhiF8gCyyBW4PFms1FIHZsl3puz+FjU3Chw
 ZQ3gJkXWBWPg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315358"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:34 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>, stable@dpdk.org
Date: Tue,  6 Oct 2020 16:03:04 +0100
Message-Id: <20201006150316.5776-3-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
Subject: [dpdk-dev] [PATCH v4 02/14] doc: fix missing classify methods in
	ACL guide
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Add brief description for missing ACL classify algorithms:
RTE_ACL_CLASSIFY_NEON and RTE_ACL_CLASSIFY_ALTIVEC.

Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8")
Fixes: 1d73135f9f1c ("acl: add AltiVec for ppc64")
Cc: stable@dpdk.org

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
---
 doc/guides/prog_guide/packet_classif_access_ctrl.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index 0345512b9e..daf03e6d7a 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -373,6 +373,12 @@ There are several implementations of classify algorithm:
 
 *   **RTE_ACL_CLASSIFY_AVX2**: vector implementation, can process up to 16 flows in parallel. Requires AVX2 support.
 
+*   **RTE_ACL_CLASSIFY_NEON**: vector implementation, can process up to 8 flows
+    in parallel. Requires NEON support.
+
+*   **RTE_ACL_CLASSIFY_ALTIVEC**: vector implementation, can process up to 8
+    flows in parallel. Requires ALTIVEC support.
+
 It is purely a runtime decision which method to choose, there is no build-time difference.
 All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
 At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.

From patchwork Tue Oct  6 15:03:05 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79786
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 84C43A04BB;
	Tue,  6 Oct 2020 17:09:16 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id A26051B9EB;
	Tue,  6 Oct 2020 17:08:16 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id EFEB61B9EB
 for <dev@dpdk.org>; Tue,  6 Oct 2020 17:08:13 +0200 (CEST)
IronPort-SDR: 
 77VVAwjSV1vrK4FL84Tit7ieiSGDB7xXFLvnfut2x1hAG/wJaFiW0t2MusWymJP37i8Nj6C91s
 Mf2vf2Qoyr2A==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919523"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919523"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:37 -0700
IronPort-SDR: 
 xFA6+UspEzpwiuHqGmuGeQO+zCpxCWfGSQ4qFORi4E0tBg6kbzqjGIRp48GcRHNUoDosPeeRnH
 kRV9pIxgf/tw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315377"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:36 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Tue,  6 Oct 2020 16:03:05 +0100
Message-Id: <20201006150316.5776-4-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
Subject: [dpdk-dev] [PATCH v4 03/14] acl: remove of unused enum value
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Removal of unused enum value (RTE_ACL_CLASSIFY_NUM).
This enum value is not used inside DPDK, while it prevents
to add new classify algorithms without causing an ABI breakage.

Note that this change introduce a formal ABI incompatibility
with previous versions of ACL library.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 doc/guides/rel_notes/deprecation.rst   | 4 ----
 doc/guides/rel_notes/release_20_11.rst | 4 ++++
 lib/librte_acl/rte_acl.h               | 1 -
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 8080a28896..938e967c8f 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -209,10 +209,6 @@ Deprecation Notices
   - https://patches.dpdk.org/patch/71457/
   - https://patches.dpdk.org/patch/71456/
 
-* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value will be removed.
-  This enum value is not used inside DPDK, while it prevents to add new
-  classify algorithms without causing an ABI breakage.
-
 * sched: To allow more traffic classes, flexible mapping of pipe queues to
   traffic classes, and subport level configuration of pipes and queues
   changes will be made to macros, data structures and API functions defined
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 6d8c24413d..e0de60c0c2 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -210,6 +210,10 @@ API Changes
 
 * bpf: ``RTE_BPF_XTYPE_NUM`` has been dropped from ``rte_bpf_xtype``.
 
+* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value has been removed.
+  This enum value was not used inside DPDK, while it prevented to add new
+  classify algorithms without causing an ABI breakage.
+
 
 ABI Changes
 -----------
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index aa22e70c6e..b814423a63 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,7 +241,6 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
-	RTE_ACL_CLASSIFY_NUM          /* should always be the last one. */
 };
 
 /**

From patchwork Tue Oct  6 15:03:06 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79787
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 5752FA04BB;
	Tue,  6 Oct 2020 17:09:45 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 5B8FC1BA83;
	Tue,  6 Oct 2020 17:08:21 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id 5FD4E1BA7F
 for <dev@dpdk.org>; Tue,  6 Oct 2020 17:08:19 +0200 (CEST)
IronPort-SDR: 
 6Yi9jXOQGDXHAveKvEhRzlmfm+a9gyBSrWJQvu3mOe/T4OhvaslPyHwJX1XmIp5Db6j6DvmHPV
 kmwZv15rGtkQ==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919547"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919547"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:39 -0700
IronPort-SDR: 
 sNnpS03ORtwZRedFi8qihyaQWqHq+LWSw3+NWTt9NeYGWRRPHXQRxP3kEdkahFu8hWh0V2nE38
 qIvg1qAR9W3g==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315389"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:38 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Tue,  6 Oct 2020 16:03:06 +0100
Message-Id: <20201006150316.5776-5-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
Subject: [dpdk-dev] [PATCH v4 04/14] acl: remove library constructor
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Right now ACL library determines best possible (default) classify method
on a given platform with specilal constructor function rte_acl_init().
This patch makes the following changes:
 - Move selection of default classify method into a separate private
   function and call it for each ACL context creation (rte_acl_create()).
 - Remove library constructor function
 - Make rte_acl_set_ctx_classify() to check that requested algorithm
   is supported on given platform.

The purpose of these changes to improve and simplify algorithm selection
process and prepare ACL library to be integrated with:
add max SIMD bitwidth to EAL
(https://patches.dpdk.org/project/dpdk/list/?series=11831)
patch-set

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 166 ++++++++++++++++++++++++++++++---------
 lib/librte_acl/rte_acl.h |   1 +
 2 files changed, 132 insertions(+), 35 deletions(-)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 715b023592..863549a38b 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -79,57 +79,153 @@ static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_ALTIVEC] = rte_acl_classify_altivec,
 };
 
-/* by default, use always available scalar code path. */
-static enum rte_acl_classify_alg rte_acl_default_classify =
-	RTE_ACL_CLASSIFY_SCALAR;
+/*
+ * Helper function for acl_check_alg.
+ * Check support for ARM specific classify methods.
+ */
+static int
+acl_check_alg_arm(enum rte_acl_classify_alg alg)
+{
+	if (alg == RTE_ACL_CLASSIFY_NEON) {
+#if defined(RTE_ARCH_ARM64)
+		return 0;
+#elif defined(RTE_ARCH_ARM)
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON))
+			return 0;
+		return -ENOTSUP;
+#else
+		return -ENOTSUP;
+#endif
+	}
+
+	return -EINVAL;
+}
 
-static void
-rte_acl_set_default_classify(enum rte_acl_classify_alg alg)
+/*
+ * Helper function for acl_check_alg.
+ * Check support for PPC specific classify methods.
+ */
+static int
+acl_check_alg_ppc(enum rte_acl_classify_alg alg)
 {
-	rte_acl_default_classify = alg;
+	if (alg == RTE_ACL_CLASSIFY_ALTIVEC) {
+#if defined(RTE_ARCH_PPC_64)
+		return 0;
+#else
+		return -ENOTSUP;
+#endif
+	}
+
+	return -EINVAL;
 }
 
-extern int
-rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
+/*
+ * Helper function for acl_check_alg.
+ * Check support for x86 specific classify methods.
+ */
+static int
+acl_check_alg_x86(enum rte_acl_classify_alg alg)
 {
-	if (ctx == NULL || (uint32_t)alg >= RTE_DIM(classify_fns))
-		return -EINVAL;
+	if (alg == RTE_ACL_CLASSIFY_AVX2) {
+#ifdef CC_AVX2_SUPPORT
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
 
-	ctx->alg = alg;
-	return 0;
+	if (alg == RTE_ACL_CLASSIFY_SSE) {
+#ifdef RTE_ARCH_X86
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
+
+	return -EINVAL;
 }
 
 /*
- * Select highest available classify method as default one.
- * Note that CLASSIFY_AVX2 should be set as a default only
- * if both conditions are met:
- * at build time compiler supports AVX2 and target cpu supports AVX2.
+ * Check if input alg is supported by given platform/binary.
+ * Note that both conditions should be met:
+ * - at build time compiler supports ISA used by given methods
+ * - at run time target cpu supports necessary ISA.
  */
-RTE_INIT(rte_acl_init)
+static int
+acl_check_alg(enum rte_acl_classify_alg alg)
 {
-	enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
+	switch (alg) {
+	case RTE_ACL_CLASSIFY_NEON:
+		return acl_check_alg_arm(alg);
+	case RTE_ACL_CLASSIFY_ALTIVEC:
+		return acl_check_alg_ppc(alg);
+	case RTE_ACL_CLASSIFY_AVX2:
+	case RTE_ACL_CLASSIFY_SSE:
+		return acl_check_alg_x86(alg);
+	/* scalar method is supported on all platforms */
+	case RTE_ACL_CLASSIFY_SCALAR:
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
 
-#if defined(RTE_ARCH_ARM64)
-	alg =  RTE_ACL_CLASSIFY_NEON;
-#elif defined(RTE_ARCH_ARM)
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON))
-		alg =  RTE_ACL_CLASSIFY_NEON;
+/*
+ * Get preferred alg for given platform.
+ */
+static enum rte_acl_classify_alg
+acl_get_best_alg(void)
+{
+	/*
+	 * array of supported methods for each platform.
+	 * Note that order is important - from most to less preferable.
+	 */
+	static const enum rte_acl_classify_alg alg[] = {
+#if defined(RTE_ARCH_ARM)
+		RTE_ACL_CLASSIFY_NEON,
 #elif defined(RTE_ARCH_PPC_64)
-	alg = RTE_ACL_CLASSIFY_ALTIVEC;
-#else
-#ifdef CC_AVX2_SUPPORT
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
-		alg = RTE_ACL_CLASSIFY_AVX2;
-	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
-#else
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+		RTE_ACL_CLASSIFY_ALTIVEC,
+#elif defined(RTE_ARCH_X86)
+		RTE_ACL_CLASSIFY_AVX2,
+		RTE_ACL_CLASSIFY_SSE,
 #endif
-		alg = RTE_ACL_CLASSIFY_SSE;
+		RTE_ACL_CLASSIFY_SCALAR,
+	};
 
-#endif
-	rte_acl_set_default_classify(alg);
+	uint32_t i;
+
+	/* find best possible alg */
+	for (i = 0; i != RTE_DIM(alg) && acl_check_alg(alg[i]) != 0; i++)
+		;
+
+	/* we always have to find something suitable */
+	RTE_VERIFY(i != RTE_DIM(alg));
+	return alg[i];
+}
+
+extern int
+rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
+{
+	int32_t rc;
+
+	/* formal parameters check */
+	if (ctx == NULL || (uint32_t)alg >= RTE_DIM(classify_fns))
+		return -EINVAL;
+
+	/* user asked us to select the *best* one */
+	if (alg == RTE_ACL_CLASSIFY_DEFAULT)
+		alg = acl_get_best_alg();
+
+	/* check that given alg is supported */
+	rc = acl_check_alg(alg);
+	if (rc != 0)
+		return rc;
+
+	ctx->alg = alg;
+	return 0;
 }
 
+
 int
 rte_acl_classify_alg(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories,
@@ -262,7 +358,7 @@ rte_acl_create(const struct rte_acl_param *param)
 		ctx->max_rules = param->max_rule_num;
 		ctx->rule_sz = param->rule_size;
 		ctx->socket_id = param->socket_id;
-		ctx->alg = rte_acl_default_classify;
+		ctx->alg = acl_get_best_alg();
 		strlcpy(ctx->name, param->name, sizeof(ctx->name));
 
 		te->data = (void *) ctx;
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index b814423a63..3999f15ded 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -329,6 +329,7 @@ rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
  *   existing algorithm, and that it could be run on the given CPU.
  * @return
  *   - -EINVAL if the parameters are invalid.
+ *   - -ENOTSUP requested algorithm is not supported by given platform.
  *   - Zero if operation completed successfully.
  */
 extern int

From patchwork Tue Oct  6 15:03:07 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79788
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 607EEA04BB;
	Tue,  6 Oct 2020 17:10:10 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 993851B9E6;
	Tue,  6 Oct 2020 17:08:26 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id 850B11B87A
 for <dev@dpdk.org>; Tue,  6 Oct 2020 17:08:24 +0200 (CEST)
IronPort-SDR: 
 ea2/Z4ERa7MZpr1c49nfCFECzxe0bphpih9w2gZ8T6nFmwaMGCah+hnMSvKm9C+0nyzjTN+02r
 cctUEz4tU/1A==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919569"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919569"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:41 -0700
IronPort-SDR: 
 JKFYYGhmImaamQHWPKrlnihpTuCBwPZLPuKJHzfDTIOEUK+YW6kJALaz/zt8wXuCmPcTwL1bFW
 rAViwyIWNmeA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315403"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:39 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Tue,  6 Oct 2020 16:03:07 +0100
Message-Id: <20201006150316.5776-6-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
Subject: [dpdk-dev] [PATCH v4 05/14] app/acl: few small improvements
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

- enhance output to print extra stats
- use rte_rdtsc_precise() for cycle measurements

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index 0a5dfb621d..d9b65517cb 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -862,9 +862,10 @@ search_ip5tuples(__rte_unused void *arg)
 {
 	uint64_t pkt, start, tm;
 	uint32_t i, lcore;
+	long double st;
 
 	lcore = rte_lcore_id();
-	start = rte_rdtsc();
+	start = rte_rdtsc_precise();
 	pkt = 0;
 
 	for (i = 0; i != config.iter_num; i++) {
@@ -872,12 +873,16 @@ search_ip5tuples(__rte_unused void *arg)
 			config.trace_step, config.alg.name);
 	}
 
-	tm = rte_rdtsc() - start;
+	tm = rte_rdtsc_precise() - start;
+
+	st = (long double)tm / rte_get_timer_hz();
 	dump_verbose(DUMP_NONE, stdout,
 		"%s  @lcore %u: %" PRIu32 " iterations, %" PRIu64 " pkts, %"
-		PRIu32 " categories, %" PRIu64 " cycles, %#Lf cycles/pkt\n",
-		__func__, lcore, i, pkt, config.run_categories,
-		tm, (pkt == 0) ? 0 : (long double)tm / pkt);
+		PRIu32 " categories, %" PRIu64 " cycles (%.2Lf sec), "
+		"%.2Lf cycles/pkt, %.2Lf pkt/sec\n",
+		__func__, lcore, i, pkt,
+		config.run_categories, tm, st,
+		(pkt == 0) ? 0 : (long double)tm / pkt, pkt / st);
 
 	return 0;
 }

From patchwork Tue Oct  6 15:03:08 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79789
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 2B6ECA04BB;
	Tue,  6 Oct 2020 17:10:32 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id E1B971BAB9;
	Tue,  6 Oct 2020 17:08:28 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id 7B1BD1BA9F
 for <dev@dpdk.org>; Tue,  6 Oct 2020 17:08:26 +0200 (CEST)
IronPort-SDR: 
 mH1X+hj65DVV8eB/vA24VQbeV5Q8xKd1IbYiewbTi6XrgEtrFdsmWcxp4j/p/1rBNPzqVXqPG+
 2RonGE8JyHJA==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919594"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919594"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:42 -0700
IronPort-SDR: 
 EWIoS6FIDucncTXkTfck/kQUXyZozDWGWjPeOsM70gkxxPs19sMoGulsL9VQ+kt+z8ri1/dUxl
 CGwArLtm3SAg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315410"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:41 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Tue,  6 Oct 2020 16:03:08 +0100
Message-Id: <20201006150316.5776-7-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
Subject: [dpdk-dev] [PATCH v4 06/14] test/acl: expand classify test coverage
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Make classify test to run for all supported methods.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test/test_acl.c | 103 ++++++++++++++++++++++----------------------
 1 file changed, 51 insertions(+), 52 deletions(-)

diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 316bf4d065..333b347579 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -266,22 +266,20 @@ rte_acl_ipv4vlan_build(struct rte_acl_ctx *ctx,
 }
 
 /*
- * Test scalar and SSE ACL lookup.
+ * Test ACL lookup (selected alg).
  */
 static int
-test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
-	size_t dim)
+test_classify_alg(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
+	const uint8_t *data[], size_t dim, enum rte_acl_classify_alg alg)
 {
-	int ret, i;
-	uint32_t result, count;
+	int32_t ret;
+	uint32_t i, result, count;
 	uint32_t results[dim * RTE_ACL_MAX_CATEGORIES];
-	const uint8_t *data[dim];
-	/* swap all bytes in the data to network order */
-	bswap_test_data(test_data, dim, 1);
 
-	/* store pointers to test data */
-	for (i = 0; i < (int) dim; i++)
-		data[i] = (uint8_t *)&test_data[i];
+	/* set given classify alg, skip test if alg is not supported */
+	ret = rte_acl_set_ctx_classify(acx, alg);
+	if (ret == -ENOTSUP)
+		return 0;
 
 	/**
 	 * these will run quite a few times, it's necessary to test code paths
@@ -291,12 +289,13 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 		ret = rte_acl_classify(acx, data, results,
 				count, RTE_ACL_MAX_CATEGORIES);
 		if (ret != 0) {
-			printf("Line %i: SSE classify failed!\n", __LINE__);
-			goto err;
+			printf("Line %i: classify(alg=%d) failed!\n",
+				__LINE__, alg);
+			return ret;
 		}
 
 		/* check if we allow everything we should allow */
-		for (i = 0; i < (int) count; i++) {
+		for (i = 0; i < count; i++) {
 			result =
 				results[i * RTE_ACL_MAX_CATEGORIES + ACL_ALLOW];
 			if (result != test_data[i].allow) {
@@ -304,63 +303,63 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, test_data[i].allow,
 					result);
-				ret = -EINVAL;
-				goto err;
+				return -EINVAL;
 			}
 		}
 
 		/* check if we deny everything we should deny */
-		for (i = 0; i < (int) count; i++) {
+		for (i = 0; i < count; i++) {
 			result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_DENY];
 			if (result != test_data[i].deny) {
 				printf("Line %i: Error in deny results at %i "
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, test_data[i].deny,
 					result);
-				ret = -EINVAL;
-				goto err;
+				return -EINVAL;
 			}
 		}
 	}
 
-	/* make a quick check for scalar */
-	ret = rte_acl_classify_alg(acx, data, results,
-			dim, RTE_ACL_MAX_CATEGORIES,
-			RTE_ACL_CLASSIFY_SCALAR);
-	if (ret != 0) {
-		printf("Line %i: scalar classify failed!\n", __LINE__);
-		goto err;
-	}
+	/* restore default classify alg */
+	return rte_acl_set_ctx_classify(acx, RTE_ACL_CLASSIFY_DEFAULT);
+}
 
-	/* check if we allow everything we should allow */
-	for (i = 0; i < (int) dim; i++) {
-		result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_ALLOW];
-		if (result != test_data[i].allow) {
-			printf("Line %i: Error in allow results at %i "
-					"(expected %"PRIu32" got %"PRIu32")!\n",
-					__LINE__, i, test_data[i].allow,
-					result);
-			ret = -EINVAL;
-			goto err;
-		}
-	}
+/*
+ * Test ACL lookup (all possible methods).
+ */
+static int
+test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
+	size_t dim)
+{
+	int32_t ret;
+	uint32_t i;
+	const uint8_t *data[dim];
 
-	/* check if we deny everything we should deny */
-	for (i = 0; i < (int) dim; i++) {
-		result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_DENY];
-		if (result != test_data[i].deny) {
-			printf("Line %i: Error in deny results at %i "
-					"(expected %"PRIu32" got %"PRIu32")!\n",
-					__LINE__, i, test_data[i].deny,
-					result);
-			ret = -EINVAL;
-			goto err;
-		}
-	}
+	static const enum rte_acl_classify_alg alg[] = {
+		RTE_ACL_CLASSIFY_SCALAR,
+		RTE_ACL_CLASSIFY_SSE,
+		RTE_ACL_CLASSIFY_AVX2,
+		RTE_ACL_CLASSIFY_NEON,
+		RTE_ACL_CLASSIFY_ALTIVEC,
+	};
+
+	/* swap all bytes in the data to network order */
+	bswap_test_data(test_data, dim, 1);
+
+	/* store pointers to test data */
+	for (i = 0; i < dim; i++)
+		data[i] = (uint8_t *)&test_data[i];
 
 	ret = 0;
+	for (i = 0; i != RTE_DIM(alg); i++) {
+		ret = test_classify_alg(acx, test_data, data, dim, alg[i]);
+		if (ret < 0) {
+			printf("Line %i: %s() for alg=%d failed, errno=%d\n",
+				__LINE__, __func__, alg[i], -ret);
+			break;
+		}
+	}
 
-err:
 	/* swap data back to cpu order so that next time tests don't fail */
 	bswap_test_data(test_data, dim, 0);
 	return ret;

From patchwork Tue Oct  6 15:03:09 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79790
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 1D021A04BB;
	Tue,  6 Oct 2020 17:10:56 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 5AFC61BACC;
	Tue,  6 Oct 2020 17:08:30 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id 4A6251BA9F
 for <dev@dpdk.org>; Tue,  6 Oct 2020 17:08:28 +0200 (CEST)
IronPort-SDR: 
 mRuj84haM8j5V5uLw7Dkg749aJqIrGhWN3oUzunYqqDn9FoeebdHtnQMHGNBFEXOmCQGZiUXpJ
 xT1R9AZIuYGA==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919602"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919602"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:44 -0700
IronPort-SDR: 
 xXOXFM/oloX39w6PIYP05vvmnCjPbmiFsiV6j/qNQtIQ8gkrcVXLX77CqPWJFGmyMdX2GUQ33w
 X2n0crH5yGOg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315419"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:42 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Tue,  6 Oct 2020 16:03:09 +0100
Message-Id: <20201006150316.5776-8-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
Subject: [dpdk-dev] [PATCH v4 07/14] acl: add infrastructure to support
	AVX512 classify
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Add necessary changes to support new AVX512 specific ACL classify
algorithm:
 - changes in meson.build to check that build tools
   (compiler, assembler, etc.) do properly support AVX512.
 - run-time checks to make sure target platform does support AVX512.
 - dummy rte_acl_classify_avx512() for targets where AVX512
   implementation couldn't be properly supported.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
---
 config/x86/meson.build          |  3 ++-
 lib/librte_acl/acl.h            |  8 ++++++
 lib/librte_acl/acl_run_avx512.c | 29 ++++++++++++++++++++
 lib/librte_acl/meson.build      | 48 +++++++++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c        | 42 +++++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.h        |  2 ++
 6 files changed, 131 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512.c

diff --git a/config/x86/meson.build b/config/x86/meson.build
index fea4d54035..724e69f4c4 100644
--- a/config/x86/meson.build
+++ b/config/x86/meson.build
@@ -22,7 +22,8 @@ foreach f:base_flags
 endforeach
 
 optional_flags = ['AES', 'PCLMUL',
-		'AVX', 'AVX2', 'AVX512F',
+		'AVX', 'AVX2',
+		'AVX512F', 'AVX512VL', 'AVX512CD', 'AVX512BW',
 		'RDRND', 'RDSEED']
 foreach f:optional_flags
 	if cc.get_define('__@0@__'.format(f), args: machine_args) == '1'
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 39d45a0c2b..543ce55659 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -201,6 +201,14 @@ int
 rte_acl_classify_avx2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
 
+int
+rte_acl_classify_avx512x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
+int
+rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
 int
 rte_acl_classify_neon(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
new file mode 100644
index 0000000000..1817f88b29
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "acl_run_sse.h"
+
+int
+rte_acl_classify_avx512x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (num >= MAX_SEARCHES_SSE8)
+		return search_sse_8(ctx, data, results, num, categories);
+	if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+
+	return rte_acl_classify_scalar(ctx, data, results, num, categories);
+}
+
+int
+rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (num >= MAX_SEARCHES_SSE8)
+		return search_sse_8(ctx, data, results, num, categories);
+	if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+
+	return rte_acl_classify_scalar(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/meson.build b/lib/librte_acl/meson.build
index b31a3f798e..a3c7c398d0 100644
--- a/lib/librte_acl/meson.build
+++ b/lib/librte_acl/meson.build
@@ -27,6 +27,54 @@ if dpdk_conf.has('RTE_ARCH_X86')
 		cflags += '-DCC_AVX2_SUPPORT'
 	endif
 
+	# compile AVX512 version if:
+	# we are building 64-bit binary AND binutils can generate proper code
+
+	if dpdk_conf.has('RTE_ARCH_X86_64') and binutils_ok.returncode() == 0
+
+		# compile AVX512 version if either:
+		# a. we have AVX512 supported in minimum instruction set
+		#    baseline
+		# b. it's not minimum instruction set, but supported by
+		#    compiler
+		#
+		# in former case, just add avx512 C file to files list
+		# in latter case, compile c file to static lib, using correct
+		# compiler flags, and then have the .o file from static lib
+		# linked into main lib.
+
+		# check if all required flags already enabled (variant a).
+		acl_avx512_flags = ['__AVX512F__', '__AVX512VL__',
+			'__AVX512CD__', '__AVX512BW__']
+
+		acl_avx512_on = true
+		foreach f:acl_avx512_flags
+
+			if cc.get_define(f, args: machine_args) == ''
+				acl_avx512_on = false
+			endif
+		endforeach
+
+		if acl_avx512_on == true
+
+			sources += files('acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+
+		elif cc.has_multi_arguments('-mavx512f', '-mavx512vl',
+					'-mavx512cd', '-mavx512bw')
+
+			avx512_tmplib = static_library('avx512_tmp',
+				'acl_run_avx512.c',
+				dependencies: static_rte_eal,
+				c_args: cflags +
+					['-mavx512f', '-mavx512vl',
+					 '-mavx512cd', '-mavx512bw'])
+			objs += avx512_tmplib.extract_objects(
+					'acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+		endif
+	endif
+
 elif dpdk_conf.has('RTE_ARCH_ARM') or dpdk_conf.has('RTE_ARCH_ARM64')
 	cflags += '-flax-vector-conversions'
 	sources += files('acl_run_neon.c')
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 863549a38b..1154f35107 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,6 +16,32 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
+#ifndef CC_AVX512_SUPPORT
+/*
+ * If the compiler doesn't support AVX512 instructions,
+ * then the dummy one would be used instead for AVX512 classify method.
+ */
+int
+rte_acl_classify_avx512x16(__rte_unused const struct rte_acl_ctx *ctx,
+	__rte_unused const uint8_t **data,
+	__rte_unused uint32_t *results,
+	__rte_unused uint32_t num,
+	__rte_unused uint32_t categories)
+{
+	return -ENOTSUP;
+}
+
+int
+rte_acl_classify_avx512x32(__rte_unused const struct rte_acl_ctx *ctx,
+	__rte_unused const uint8_t **data,
+	__rte_unused uint32_t *results,
+	__rte_unused uint32_t num,
+	__rte_unused uint32_t categories)
+{
+	return -ENOTSUP;
+}
+#endif
+
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -77,6 +103,8 @@ static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
 	[RTE_ACL_CLASSIFY_NEON] = rte_acl_classify_neon,
 	[RTE_ACL_CLASSIFY_ALTIVEC] = rte_acl_classify_altivec,
+	[RTE_ACL_CLASSIFY_AVX512X16] = rte_acl_classify_avx512x16,
+	[RTE_ACL_CLASSIFY_AVX512X32] = rte_acl_classify_avx512x32,
 };
 
 /*
@@ -126,6 +154,18 @@ acl_check_alg_ppc(enum rte_acl_classify_alg alg)
 static int
 acl_check_alg_x86(enum rte_acl_classify_alg alg)
 {
+	if (alg == RTE_ACL_CLASSIFY_AVX512X16 ||
+			alg == RTE_ACL_CLASSIFY_AVX512X32) {
+#ifdef CC_AVX512_SUPPORT
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512VL) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512CD) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512BW))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
+
 	if (alg == RTE_ACL_CLASSIFY_AVX2) {
 #ifdef CC_AVX2_SUPPORT
 		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
@@ -159,6 +199,8 @@ acl_check_alg(enum rte_acl_classify_alg alg)
 		return acl_check_alg_arm(alg);
 	case RTE_ACL_CLASSIFY_ALTIVEC:
 		return acl_check_alg_ppc(alg);
+	case RTE_ACL_CLASSIFY_AVX512X32:
+	case RTE_ACL_CLASSIFY_AVX512X16:
 	case RTE_ACL_CLASSIFY_AVX2:
 	case RTE_ACL_CLASSIFY_SSE:
 		return acl_check_alg_x86(alg);
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index 3999f15ded..1bfed00743 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,6 +241,8 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
+	RTE_ACL_CLASSIFY_AVX512X16 = 6,  /**< requires AVX512 support. */
+	RTE_ACL_CLASSIFY_AVX512X32 = 7,  /**< requires AVX512 support. */
 };
 
 /**

From patchwork Tue Oct  6 15:03:10 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79791
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 1C229A04BB;
	Tue,  6 Oct 2020 17:11:20 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id C0A161BAE6;
	Tue,  6 Oct 2020 17:08:31 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id 934F01BAB5
 for <dev@dpdk.org>; Tue,  6 Oct 2020 17:08:28 +0200 (CEST)
IronPort-SDR: 
 3u8PEknonVrk3OFFQgVkwYGHtSNKfufWzH54UjsfdN80K9m5yz17OxHleVxux43o0duJtBGbb2
 NtmQ8KLJpeGw==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919630"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919630"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:46 -0700
IronPort-SDR: 
 +UMa7osX7JlMu/bNC6pl1ZswtJ9W8hQvK7cAps83cXR9LDHXxhDqDAo8QJnkE3jqbD24BI1CS3
 ws+3xjUBXvcw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315428"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:44 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Tue,  6 Oct 2020 16:03:10 +0100
Message-Id: <20201006150316.5776-9-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
Subject: [dpdk-dev] [PATCH v4 08/14] acl: introduce 256-bit width AVX512
	classify implementation
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Introduce classify implementation that uses AVX512 specific ISA.
rte_acl_classify_avx512x16() is able to process up to 16 flows in parallel.
It uses 256-bit width registers/instructions only
(to avoid frequency level change).
Note that for now only 64-bit version is supported.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 .../prog_guide/packet_classif_access_ctrl.rst |   4 +
 doc/guides/rel_notes/release_20_11.rst        |   5 +
 lib/librte_acl/acl.h                          |   7 +
 lib/librte_acl/acl_gen.c                      |   2 +-
 lib/librte_acl/acl_run_avx512.c               | 129 ++++
 lib/librte_acl/acl_run_avx512x8.h             | 642 ++++++++++++++++++
 6 files changed, 788 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512x8.h

diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index daf03e6d7a..11f4bc841b 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -379,6 +379,10 @@ There are several implementations of classify algorithm:
 *   **RTE_ACL_CLASSIFY_ALTIVEC**: vector implementation, can process up to 8
     flows in parallel. Requires ALTIVEC support.
 
+*   **RTE_ACL_CLASSIFY_AVX512X16**: vector implementation, can process up to 16
+    flows in parallel. Uses 256-bit width SIMD registers.
+    Requires AVX512 support.
+
 It is purely a runtime decision which method to choose, there is no build-time difference.
 All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
 At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index e0de60c0c2..95d7bfd777 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -107,6 +107,11 @@ New Features
   * Extern objects and functions can be plugged into the pipeline.
   * Transaction-oriented table updates.
 
+* **Add new AVX512 specific classify algorithms for ACL library.**
+
+  * Added new ``RTE_ACL_CLASSIFY_AVX512X16`` vector implementation,
+    which can process up to 16 flows in parallel. Requires AVX512 support.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 543ce55659..7ac0d12f08 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -76,6 +76,13 @@ struct rte_acl_bitset {
  * input_byte - ((uint8_t *)&transition)[4 + input_byte / 64].
  */
 
+/*
+ * Each ACL RT contains an idle nomatch node:
+ * a SINGLE node at predefined position (RTE_ACL_DFA_SIZE)
+ * that points to itself.
+ */
+#define RTE_ACL_IDLE_NODE	(RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE)
+
 /*
  * Structure of a node is a set of ptrs and each ptr has a bit map
  * of values associated with this transition.
diff --git a/lib/librte_acl/acl_gen.c b/lib/librte_acl/acl_gen.c
index f1b9d12f1e..e759a2ca15 100644
--- a/lib/librte_acl/acl_gen.c
+++ b/lib/librte_acl/acl_gen.c
@@ -496,7 +496,7 @@ rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	 * highest index, that points to itself)
 	 */
 
-	node_array[RTE_ACL_DFA_SIZE] = RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE;
+	node_array[RTE_ACL_DFA_SIZE] = RTE_ACL_IDLE_NODE;
 
 	for (n = 0; n < RTE_ACL_DFA_SIZE; n++)
 		node_array[n] = no_match;
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 1817f88b29..f5bc628b7c 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -4,10 +4,126 @@
 
 #include "acl_run_sse.h"
 
+/*sizeof(uint32_t) << match_log == sizeof(struct rte_acl_match_results)*/
+static const uint32_t match_log = 5;
+
+struct acl_flow_avx512 {
+	uint32_t num_packets;       /* number of packets processed */
+	uint32_t total_packets;     /* max number of packets to process */
+	uint32_t root_index;        /* current root index */
+	const uint64_t *trans;      /* transition table */
+	const uint32_t *data_index; /* input data indexes */
+	const uint8_t **idata;      /* input data */
+	uint32_t *matches;          /* match indexes */
+};
+
+static inline void
+acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
+	uint32_t trie, const uint8_t *data[], uint32_t *matches,
+	uint32_t total_packets)
+{
+	flow->num_packets = 0;
+	flow->total_packets = total_packets;
+	flow->root_index = ctx->trie[trie].root_index;
+	flow->trans = ctx->trans_table;
+	flow->data_index = ctx->trie[trie].data_index;
+	flow->idata = data;
+	flow->matches = matches;
+}
+
+/*
+ * Update flow and result masks based on the number of unprocessed flows.
+ */
+static inline uint32_t
+update_flow_mask(const struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk)
+{
+	uint32_t i, j, k, m, n;
+
+	fmsk[0] ^= rmsk[0];
+	m = rmsk[0];
+
+	k = __builtin_popcount(m);
+	n = flow->total_packets - flow->num_packets;
+
+	if (n < k) {
+		/* reduce mask */
+		for (i = k - n; i != 0; i--) {
+			j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m);
+			m ^= 1 << j;
+		}
+	} else
+		n = k;
+
+	rmsk[0] = m;
+	fmsk[0] |= rmsk[0];
+
+	return n;
+}
+
+/*
+ * Resolve matches for multiple categories (LE 8, use 128b instuctions/regs)
+ */
+static inline void
+resolve_mcle8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, j, k, mi, mn;
+	__mmask8 msk;
+	xmm_t cp, cr, np, nr;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		for (j = 0; j != nb_cat; j += RTE_ACL_RESULTS_MULTIPLIER) {
+
+			cr = _mm_loadu_si128((const xmm_t *)(res + mi + j));
+			cp = _mm_loadu_si128((const xmm_t *)(pri + mi + j));
+
+			for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+				mn = j + (pm[k] << match_log);
+
+				nr = _mm_loadu_si128((const xmm_t *)(res + mn));
+				np = _mm_loadu_si128((const xmm_t *)(pri + mn));
+
+				msk = _mm_cmpgt_epi32_mask(cp, np);
+				cr = _mm_mask_mov_epi32(nr, msk, cr);
+				cp = _mm_mask_mov_epi32(np, msk, cp);
+			}
+
+			_mm_storeu_si128((xmm_t *)(result + j), cr);
+		}
+	}
+}
+
+#include "acl_run_avx512x8.h"
+
 int
 rte_acl_classify_avx512x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	const uint32_t max_iter = MAX_SEARCHES_AVX16 * MAX_SEARCHES_AVX16;
+
+	/* split huge lookup (gt 256) into series of fixed size ones */
+	while (num > max_iter) {
+		search_avx512x8x2(ctx, data, results, max_iter, categories);
+		data += max_iter;
+		results += max_iter * categories;
+		num -= max_iter;
+	}
+
+	/* select classify method based on number of remaining requests */
+	if (num >= MAX_SEARCHES_AVX16)
+		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
 		return search_sse_8(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE4)
@@ -20,6 +136,19 @@ int
 rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	const uint32_t max_iter = MAX_SEARCHES_AVX16 * MAX_SEARCHES_AVX16;
+
+	/* split huge lookup (gt 256) into series of fixed size ones */
+	while (num > max_iter) {
+		search_avx512x8x2(ctx, data, results, max_iter, categories);
+		data += max_iter;
+		results += max_iter * categories;
+		num -= max_iter;
+	}
+
+	/* select classify method based on number of remaining requests */
+	if (num >= MAX_SEARCHES_AVX16)
+		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
 		return search_sse_8(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE4)
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
new file mode 100644
index 0000000000..cfba0299ed
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -0,0 +1,642 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#define MASK8_BIT	(sizeof(__mmask8) * CHAR_BIT)
+
+#define NUM_AVX512X8X2	(2 * MASK8_BIT)
+#define MSK_AVX512X8X2	(NUM_AVX512X8X2 - 1)
+
+/* num/mask of pointers per SIMD regs */
+#define YMM_PTR_NUM	(sizeof(__m256i) / sizeof(uintptr_t))
+#define YMM_PTR_MSK	RTE_LEN2MASK(YMM_PTR_NUM, uint32_t)
+
+static const rte_ymm_t ymm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_ymm_t ymm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_ymm_t ymm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+	},
+};
+
+static const rte_ymm_t ymm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const rte_ymm_t ymm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const rte_ymm_t ymm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const rte_ymm_t ymm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+	},
+};
+
+static const rte_ymm_t ymm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+static const rte_ymm_t ymm_pminp = {
+	.u32 = {
+		0x00, 0x01, 0x02, 0x03,
+		0x08, 0x09, 0x0a, 0x0b,
+	},
+};
+
+static const __mmask16 ymm_pmidx_msk = 0x55;
+
+static const rte_ymm_t ymm_pmidx[2] = {
+	[0] = {
+		.u32 = {
+			0, 0, 1, 0, 2, 0, 3, 0,
+		},
+	},
+	[1] = {
+		.u32 = {
+			4, 0, 5, 0, 6, 0, 7, 0,
+		},
+	},
+};
+
+/*
+ * unfortunately current AVX512 ISA doesn't provide ability for
+ * gather load on a byte quantity. So we have to mimic it in SW,
+ * by doing 4x1B scalar loads.
+ */
+static inline __m128i
+_m256_mask_gather_epi8x4(__m256i pdata, __mmask8 mask)
+{
+	rte_xmm_t v;
+	rte_ymm_t p;
+
+	static const uint32_t zero;
+
+	p.y = _mm256_mask_set1_epi64(pdata, mask ^ YMM_PTR_MSK,
+		(uintptr_t)&zero);
+
+	v.u32[0] = *(uint8_t *)p.u64[0];
+	v.u32[1] = *(uint8_t *)p.u64[1];
+	v.u32[2] = *(uint8_t *)p.u64[2];
+	v.u32[3] = *(uint8_t *)p.u64[3];
+
+	return v.x;
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline __m256i
+calc_addr8(__m256i index_mask, __m256i next_input, __m256i shuffle_input,
+	__m256i four_32, __m256i range_base, __m256i tr_lo, __m256i tr_hi)
+{
+	__mmask32 qm;
+	__mmask8 dfa_msk;
+	__m256i addr, in, node_type, r, t;
+	__m256i dfa_ofs, quad_ofs;
+
+	t = _mm256_xor_si256(index_mask, index_mask);
+	in = _mm256_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm256_andnot_si256(index_mask, tr_lo);
+	addr = _mm256_and_si256(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm256_cmpeq_epi32_mask(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm256_srli_epi32(in, 30);
+	r = _mm256_add_epi8(r, range_base);
+	t = _mm256_srli_epi32(in, 24);
+	r = _mm256_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm256_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	qm = _mm256_cmpgt_epi8_mask(in, tr_hi);
+	t = _mm256_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
+	t = _mm256_lzcnt_epi32(t);
+	t = _mm256_srli_epi32(t, 3);
+	quad_ofs = _mm256_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm256_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
+
+	/* calculate address for next transitions. */
+	addr = _mm256_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 16 transitions in parallel.
+ * tr_lo contains low 32 bits for 16 transition.
+ * tr_hi contains high 32 bits for 16 transition.
+ * next_input contains up to 4 input bytes for 16 flows.
+ */
+static __rte_always_inline __m256i
+transition8(__m256i next_input, const uint64_t *trans, __m256i *tr_lo,
+	__m256i *tr_hi)
+{
+	const int32_t *tr;
+	__m256i addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 8 transitions. */
+	addr = calc_addr8(ymm_index_mask.y, next_input, ymm_shuffle_input.y,
+		ymm_four_32.y, ymm_range_base.y, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0]));
+
+	next_input = _mm256_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm256_i32gather_epi32(tr + 1, addr, sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to 16 flows in parallel.
+ * next_input should contain one input byte for up to 16 flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to 16 transitions.
+ * tr_hi contains high 32 bits for up to 16 transitions.
+ */
+static __rte_always_inline void
+first_trans8(const struct acl_flow_avx512 *flow, __m256i next_input,
+	__mmask8 msk, __m256i *tr_lo, __m256i *tr_hi)
+{
+	const int32_t *tr;
+	__m256i addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm256_set1_epi32(UINT8_MAX);
+	root = _mm256_set1_epi32(flow->root_index);
+
+	addr = _mm256_and_si256(next_input, addr);
+	addr = _mm256_add_epi32(root, addr);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm256_mmask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm256_mmask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to 16 flows in parallel.
+ * pdata - 8x2 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these 16 flows.
+ */
+static inline __m256i
+get_next_bytes_avx512x8(const struct acl_flow_avx512 *flow, __m256i pdata[2],
+	uint32_t msk, __m256i *di, uint32_t bnum)
+{
+	const int32_t *div;
+	uint32_t m[2];
+	__m256i one, zero, t, p[2];
+	__m128i inp[2];
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm256_set1_epi32(1);
+	zero = _mm256_xor_si256(one, one);
+
+	/* load data offsets for given indexes */
+	t = _mm256_mmask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _mm256_mask_add_epi32(*di, msk, *di, one);
+
+	/*
+	 * unsigned expand 32-bit indexes to 64-bit
+	 * (for later pointer arithmetic), i.e:
+	 * for (i = 0; i != 16; i++)
+	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
+	 */
+	p[0] = _mm256_maskz_permutexvar_epi32(ymm_pmidx_msk, ymm_pmidx[0].y, t);
+	p[1] = _mm256_maskz_permutexvar_epi32(ymm_pmidx_msk, ymm_pmidx[1].y, t);
+
+	p[0] = _mm256_add_epi64(p[0], pdata[0]);
+	p[1] = _mm256_add_epi64(p[1], pdata[1]);
+
+	/* load input byte(s), either one or four */
+
+	m[0] = msk & YMM_PTR_MSK;
+	m[1] = msk >> YMM_PTR_NUM;
+
+	if (bnum == sizeof(uint8_t)) {
+		inp[0] = _m256_mask_gather_epi8x4(p[0], m[0]);
+		inp[1] = _m256_mask_gather_epi8x4(p[1], m[1]);
+	} else {
+		inp[0] = _mm256_mmask_i64gather_epi32(
+				_mm256_castsi256_si128(zero), m[0], p[0],
+				NULL, sizeof(uint8_t));
+		inp[1] = _mm256_mmask_i64gather_epi32(
+				_mm256_castsi256_si128(zero), m[1], p[1],
+				NULL, sizeof(uint8_t));
+	}
+
+	/* squeeze input into one 512-bit register */
+	return _mm256_permutex2var_epi32(_mm256_castsi128_si256(inp[0]),
+			ymm_pminp.y,  _mm256_castsi128_si256(inp[1]));
+}
+
+/*
+ * Start up to 16 new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * idx - match indexed for given flows
+ * di - data indexes for these flows.
+ */
+static inline void
+start_flow8(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m256i pdata[2], __m256i *idx, __m256i *di)
+{
+	uint32_t n, m[2], nm[2];
+	__m256i ni, nd[2];
+
+	m[0] = msk & YMM_PTR_MSK;
+	m[1] = msk >> YMM_PTR_NUM;
+
+	n = __builtin_popcount(m[0]);
+	nm[0] = (1 << n) - 1;
+	nm[1] = (1 << (num - n)) - 1;
+
+	/* load input data pointers for new flows */
+	nd[0] = _mm256_maskz_loadu_epi64(nm[0],
+		flow->idata + flow->num_packets);
+	nd[1] = _mm256_maskz_loadu_epi64(nm[1],
+		flow->idata + flow->num_packets + n);
+
+	/* calculate match indexes of new flows */
+	ni = _mm256_set1_epi32(flow->num_packets);
+	ni = _mm256_add_epi32(ni, ymm_idx_add.y);
+
+	/* merge new and existing flows data */
+	pdata[0] = _mm256_mask_expand_epi64(pdata[0], m[0], nd[0]);
+	pdata[1] = _mm256_mask_expand_epi64(pdata[1], m[1], nd[1]);
+
+	/* update match and data indexes */
+	*idx = _mm256_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm256_maskz_mov_epi32(msk ^ UINT8_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Process found matches for up to 16 flows.
+ * fmsk - mask of active flows
+ * rmsk - mask of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static inline uint32_t
+match_process_avx512x8(struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk, __m256i pdata[2], __m256i *di, __m256i *idx,
+	__m256i *tr_lo, __m256i *tr_hi)
+{
+	uint32_t n;
+	__m256i res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm256_and_si256(tr_lo[0], ymm_index_mask.y);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm256_mask_mov_epi32(tr_lo[0], rmsk[0], ymm_trlo_idle.y);
+	tr_hi[0] = _mm256_mask_mov_epi32(tr_hi[0], rmsk[0], ymm_trhi_idle.y);
+
+	/* save found match indexes */
+	_mm256_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask(flow, fmsk, rmsk);
+	start_flow8(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+/*
+ * Test for matches ut to 32 (2x16) flows at once,
+ * if matches exist - process them and start new flows.
+ */
+static inline void
+match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
+	__m256i pdata[4], __m256i di[2], __m256i idx[2], __m256i inp[2],
+	__m256i tr_lo[2], __m256i tr_hi[2])
+{
+	uint32_t n[2];
+	uint32_t rm[2];
+
+	/* check for matches */
+	rm[0] = _mm256_test_epi32_mask(tr_lo[0], ymm_match_mask.y);
+	rm[1] = _mm256_test_epi32_mask(tr_lo[1], ymm_match_mask.y);
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = match_process_avx512x8(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x8(flow, &fm[1], &rm[1], &pdata[2],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = get_next_bytes_avx512x8(flow, &pdata[0],
+				rm[0], &di[0], sizeof(uint8_t));
+			first_trans8(flow, inp[0], rm[0], &tr_lo[0],
+				&tr_hi[0]);
+			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
+				ymm_match_mask.y);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_bytes_avx512x8(flow, &pdata[2],
+				rm[1], &di[1], sizeof(uint8_t));
+			first_trans8(flow, inp[1], rm[1], &tr_lo[1],
+				&tr_hi[1]);
+			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
+				ymm_match_mask.y);
+		}
+	}
+}
+
+/*
+ * Perform search for up to 32 flows in parallel.
+ * Use two sets of metadata, each serves 16 flows max.
+ * So in fact we perform search for 2x16 flows.
+ */
+static inline void
+search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
+{
+	uint32_t fm[2];
+	__m256i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[2], &idx[1], &di[1]);
+
+	in[0] = get_next_bytes_avx512x8(flow, &pdata[0], UINT8_MAX, &di[0],
+			sizeof(uint8_t));
+	in[1] = get_next_bytes_avx512x8(flow, &pdata[2], UINT8_MAX, &di[1],
+			sizeof(uint8_t));
+
+	first_trans8(flow, in[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans8(flow, in[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT8_MAX;
+	fm[1] = UINT8_MAX;
+
+	/* match check */
+	match_check_process_avx512x8x2(flow, fm, pdata, di, idx, in,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		in[0] = get_next_bytes_avx512x8(flow, &pdata[0], fm[0],
+			&di[0], sizeof(uint32_t));
+		in[1] = get_next_bytes_avx512x8(flow, &pdata[2], fm[1],
+			&di[1], sizeof(uint32_t));
+
+		/* main 4B loop */
+
+		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x8x2(flow, fm, pdata, di, idx, in,
+			tr_lo, tr_hi);
+	}
+}
+
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline __m256i
+resolve_match_idx_avx512x8(__m256i mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm256_slli_epi32(mi, match_log);
+}
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline __m256i
+resolve_pri_avx512x8(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask8 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask16 m;
+	__m256i cp, cr, np, nr, mch;
+
+	const __m256i zero = _mm256_set1_epi32(0);
+
+	/* get match indexes */
+	mch = _mm256_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x8(mch);
+
+	/* read result and priority values for first trie */
+	cr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	/*
+	 * read result and priority values for next tries and select one
+	 * with highest priority.
+	 */
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm256_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x8(mch);
+
+		nr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm256_cmpgt_epi32_mask(cp, np);
+		cr = _mm256_mask_mov_epi32(nr, m, cr);
+		cp = _mm256_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 8) matches for single category
+ */
+static inline void
+resolve_sc_avx512x8(uint32_t result[], const int32_t res[],
+	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
+	uint32_t nb_trie, uint32_t nb_skip)
+{
+	__mmask8 msk;
+	__m256i cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x8(res, pri, match, msk, nb_trie, nb_skip);
+	_mm256_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x8x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t j, k, n;
+	const int32_t *res, *pri;
+	__m256i cr[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X8X2); k += NUM_AVX512X8X2) {
+
+		j = k + MASK8_BIT;
+
+		cr[0] = resolve_pri_avx512x8(res, pri, match + k, UINT8_MAX,
+				nb_trie, nb_pkt);
+		cr[1] = resolve_pri_avx512x8(res, pri, match + j, UINT8_MAX,
+				nb_trie, nb_pkt);
+
+		_mm256_storeu_si256((void *)(result + k), cr[0]);
+		_mm256_storeu_si256((void *)(result + j), cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > MASK8_BIT) {
+			resolve_sc_avx512x8(result + k, res, pri, match + k,
+				MASK8_BIT, nb_trie, nb_pkt);
+			k += MASK8_BIT;
+			n -= MASK8_BIT;
+		}
+		resolve_sc_avx512x8(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
+
+static inline int
+search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x8x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x8x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}

From patchwork Tue Oct  6 15:03:11 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79792
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id E7C43A04BB;
	Tue,  6 Oct 2020 17:11:39 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 2D5F51BB12;
	Tue,  6 Oct 2020 17:08:33 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id 440191BAC7
 for <dev@dpdk.org>; Tue,  6 Oct 2020 17:08:29 +0200 (CEST)
IronPort-SDR: 
 Z2yKS2s8rcHdJ5Qo/lUC93r/5lwDvun0bKN3ARMANxWj9c1mdXaL70PYXprBZNoP4w4l+Xl1Uf
 RQneJNE3x2Fg==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919648"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919648"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:47 -0700
IronPort-SDR: 
 qHf9rDtmEES/fTucwXSa9bqq9bUckkdsm0rym4JA9qdMjmTCwFrA61wo0WCTiC1lXpHYzKSLn/
 rng9F6vJfDVQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315440"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:46 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Tue,  6 Oct 2020 16:03:11 +0100
Message-Id: <20201006150316.5776-10-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
Subject: [dpdk-dev] [PATCH v4 09/14] acl: update default classify algorithm
	selection
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On supported platforms, set RTE_ACL_CLASSIFY_AVX512X16 as
default ACL classify algorithm.
Note that AVX512X16 implementation uses 256-bit registers/instincts only
to avoid possibility of frequency drop.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 1154f35107..245af672ee 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -228,6 +228,7 @@ acl_get_best_alg(void)
 #elif defined(RTE_ARCH_PPC_64)
 		RTE_ACL_CLASSIFY_ALTIVEC,
 #elif defined(RTE_ARCH_X86)
+		RTE_ACL_CLASSIFY_AVX512X16,
 		RTE_ACL_CLASSIFY_AVX2,
 		RTE_ACL_CLASSIFY_SSE,
 #endif

From patchwork Tue Oct  6 15:03:12 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79793
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 91DF1A04BB;
	Tue,  6 Oct 2020 17:12:13 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id C47681BBD1;
	Tue,  6 Oct 2020 17:08:38 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id DCF881BB9A
 for <dev@dpdk.org>; Tue,  6 Oct 2020 17:08:36 +0200 (CEST)
IronPort-SDR: 
 Hn27F/dH8QuaT6yGxTZvWEpxl79ppKV/RdUanrBzXZGObg+ZQLByzQzJzXKlStenPiC1sVVi8Y
 NtGEjz5b+2iw==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919665"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919665"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:49 -0700
IronPort-SDR: 
 CW7rx2xLYxWEA3azn57mv0ygQy4LX/Z9ibfmYygkBfoRK6xbkSycVUqhVV/Zybm7Lje08I3cHc
 CRf9pcbkawEg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315455"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:47 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Tue,  6 Oct 2020 16:03:12 +0100
Message-Id: <20201006150316.5776-11-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
Subject: [dpdk-dev] [PATCH v4 10/14] acl: introduce 512-bit width AVX512
	classify implementation
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Introduce classify implementation that uses AVX512 specific ISA.
rte_acl_classify_avx512x32() is able to process up to 32 flows in parallel.
It uses 512-bit width registers/instructions and provides higher
performance then rte_acl_classify_avx512x16(), but can cause
frequency level change.
Note that for now only 64-bit version is supported.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
Depends-on: patch-79310 ("eal/x86: introduce AVX 512-bit type")

 .../prog_guide/packet_classif_access_ctrl.rst |  10 +
 doc/guides/rel_notes/release_20_11.rst        |   3 +
 lib/librte_acl/acl_run_avx512.c               |   6 +-
 lib/librte_acl/acl_run_avx512x16.h            | 732 ++++++++++++++++++
 4 files changed, 750 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512x16.h

diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index 11f4bc841b..7659af8eb5 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -383,10 +383,20 @@ There are several implementations of classify algorithm:
     flows in parallel. Uses 256-bit width SIMD registers.
     Requires AVX512 support.
 
+*   **RTE_ACL_CLASSIFY_AVX512X32**: vector implementation, can process up to 32
+    flows in parallel. Uses 512-bit width SIMD registers.
+    Requires AVX512 support.
+
 It is purely a runtime decision which method to choose, there is no build-time difference.
 All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
 At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.
 
+.. note::
+
+     Right now ``RTE_ACL_CLASSIFY_AVX512X32`` is not selected by default
+     (due to possible frequency level change), but it can be selected at
+     runtime by apps through the use of ACL API: ``rte_acl_set_ctx_classify``.
+
 Application Programming Interface (API) Usage
 ---------------------------------------------
 
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 95d7bfd777..001e46f595 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -112,6 +112,9 @@ New Features
   * Added new ``RTE_ACL_CLASSIFY_AVX512X16`` vector implementation,
     which can process up to 16 flows in parallel. Requires AVX512 support.
 
+  * Added new ``RTE_ACL_CLASSIFY_AVX512X32`` vector implementation,
+    which can process up to 32 flows in parallel. Requires AVX512 support.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index f5bc628b7c..74698fa2ea 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -132,6 +132,8 @@ rte_acl_classify_avx512x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	return rte_acl_classify_scalar(ctx, data, results, num, categories);
 }
 
+#include "acl_run_avx512x16.h"
+
 int
 rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
@@ -140,13 +142,15 @@ rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	/* split huge lookup (gt 256) into series of fixed size ones */
 	while (num > max_iter) {
-		search_avx512x8x2(ctx, data, results, max_iter, categories);
+		search_avx512x16x2(ctx, data, results, max_iter, categories);
 		data += max_iter;
 		results += max_iter * categories;
 		num -= max_iter;
 	}
 
 	/* select classify method based on number of remaining requests */
+	if (num >= 2 * MAX_SEARCHES_AVX16)
+		return search_avx512x16x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_AVX16)
 		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
new file mode 100644
index 0000000000..981f8d16da
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -0,0 +1,732 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#define	MASK16_BIT	(sizeof(__mmask16) * CHAR_BIT)
+
+#define NUM_AVX512X16X2	(2 * MASK16_BIT)
+#define MSK_AVX512X16X2	(NUM_AVX512X16X2 - 1)
+
+/* num/mask of pointers per SIMD regs */
+#define ZMM_PTR_NUM	(sizeof(__m512i) / sizeof(uintptr_t))
+#define ZMM_PTR_MSK	RTE_LEN2MASK(ZMM_PTR_NUM, uint32_t)
+
+static const __rte_x86_zmm_t zmm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+		8, 9, 10, 11,
+		12, 13, 14, 15,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_pminp = {
+	.u32 = {
+		0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+		0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
+	},
+};
+
+static const __mmask16 zmm_pmidx_msk = 0x5555;
+
+static const __rte_x86_zmm_t zmm_pmidx[2] = {
+	[0] = {
+		.u32 = {
+			0, 0, 1, 0, 2, 0, 3, 0,
+			4, 0, 5, 0, 6, 0, 7, 0,
+		},
+	},
+	[1] = {
+		.u32 = {
+			8, 0, 9, 0, 10, 0, 11, 0,
+			12, 0, 13, 0, 14, 0, 15, 0,
+		},
+	},
+};
+
+/*
+ * unfortunately current AVX512 ISA doesn't provide ability for
+ * gather load on a byte quantity. So we have to mimic it in SW,
+ * by doing 8x1B scalar loads.
+ */
+static inline ymm_t
+_m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
+{
+	rte_ymm_t v;
+	__rte_x86_zmm_t p;
+
+	static const uint32_t zero;
+
+	p.z = _mm512_mask_set1_epi64(pdata, mask ^ ZMM_PTR_MSK,
+		(uintptr_t)&zero);
+
+	v.u32[0] = *(uint8_t *)p.u64[0];
+	v.u32[1] = *(uint8_t *)p.u64[1];
+	v.u32[2] = *(uint8_t *)p.u64[2];
+	v.u32[3] = *(uint8_t *)p.u64[3];
+	v.u32[4] = *(uint8_t *)p.u64[4];
+	v.u32[5] = *(uint8_t *)p.u64[5];
+	v.u32[6] = *(uint8_t *)p.u64[6];
+	v.u32[7] = *(uint8_t *)p.u64[7];
+
+	return v.y;
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline __m512i
+calc_addr16(__m512i index_mask, __m512i next_input, __m512i shuffle_input,
+	__m512i four_32, __m512i range_base, __m512i tr_lo, __m512i tr_hi)
+{
+	__mmask64 qm;
+	__mmask16 dfa_msk;
+	__m512i addr, in, node_type, r, t;
+	__m512i dfa_ofs, quad_ofs;
+
+	t = _mm512_xor_si512(index_mask, index_mask);
+	in = _mm512_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm512_andnot_si512(index_mask, tr_lo);
+	addr = _mm512_and_si512(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm512_cmpeq_epi32_mask(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm512_srli_epi32(in, 30);
+	r = _mm512_add_epi8(r, range_base);
+	t = _mm512_srli_epi32(in, 24);
+	r = _mm512_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm512_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	qm = _mm512_cmpgt_epi8_mask(in, tr_hi);
+	t = _mm512_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
+	t = _mm512_lzcnt_epi32(t);
+	t = _mm512_srli_epi32(t, 3);
+	quad_ofs = _mm512_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm512_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
+
+	/* calculate address for next transitions. */
+	addr = _mm512_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 16 transitions in parallel.
+ * tr_lo contains low 32 bits for 16 transition.
+ * tr_hi contains high 32 bits for 16 transition.
+ * next_input contains up to 4 input bytes for 16 flows.
+ */
+static __rte_always_inline __m512i
+transition16(__m512i next_input, const uint64_t *trans, __m512i *tr_lo,
+	__m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 16 transitions. */
+	addr = calc_addr16(zmm_index_mask.z, next_input, zmm_shuffle_input.z,
+		zmm_four_32.z, zmm_range_base.z, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm512_i32gather_epi32(addr, tr, sizeof(trans[0]));
+
+	next_input = _mm512_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm512_i32gather_epi32(addr, (tr + 1), sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to 16 flows in parallel.
+ * next_input should contain one input byte for up to 16 flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to 16 transitions.
+ * tr_hi contains high 32 bits for up to 16 transitions.
+ */
+static __rte_always_inline void
+first_trans16(const struct acl_flow_avx512 *flow, __m512i next_input,
+	__mmask16 msk, __m512i *tr_lo, __m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm512_set1_epi32(UINT8_MAX);
+	root = _mm512_set1_epi32(flow->root_index);
+
+	addr = _mm512_and_si512(next_input, addr);
+	addr = _mm512_add_epi32(root, addr);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm512_mask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm512_mask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to 16 flows in parallel.
+ * pdata - 8x2 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these 16 flows.
+ */
+static inline __m512i
+get_next_bytes_avx512x16(const struct acl_flow_avx512 *flow, __m512i pdata[2],
+	uint32_t msk, __m512i *di, uint32_t bnum)
+{
+	const int32_t *div;
+	uint32_t m[2];
+	__m512i one, zero, t, p[2];
+	ymm_t inp[2];
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm512_set1_epi32(1);
+	zero = _mm512_xor_si512(one, one);
+
+	/* load data offsets for given indexes */
+	t = _mm512_mask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _mm512_mask_add_epi32(*di, msk, *di, one);
+
+	/*
+	 * unsigned expand 32-bit indexes to 64-bit
+	 * (for later pointer arithmetic), i.e:
+	 * for (i = 0; i != 16; i++)
+	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
+	 */
+	p[0] = _mm512_maskz_permutexvar_epi32(zmm_pmidx_msk, zmm_pmidx[0].z, t);
+	p[1] = _mm512_maskz_permutexvar_epi32(zmm_pmidx_msk, zmm_pmidx[1].z, t);
+
+	p[0] = _mm512_add_epi64(p[0], pdata[0]);
+	p[1] = _mm512_add_epi64(p[1], pdata[1]);
+
+	/* load input byte(s), either one or four */
+
+	m[0] = msk & ZMM_PTR_MSK;
+	m[1] = msk >> ZMM_PTR_NUM;
+
+	if (bnum == sizeof(uint8_t)) {
+		inp[0] = _m512_mask_gather_epi8x8(p[0], m[0]);
+		inp[1] = _m512_mask_gather_epi8x8(p[1], m[1]);
+	} else {
+		inp[0] = _mm512_mask_i64gather_epi32(
+				_mm512_castsi512_si256(zero), m[0], p[0],
+				NULL, sizeof(uint8_t));
+		inp[1] = _mm512_mask_i64gather_epi32(
+				_mm512_castsi512_si256(zero), m[1], p[1],
+				NULL, sizeof(uint8_t));
+	}
+
+	/* squeeze input into one 512-bit register */
+	return _mm512_permutex2var_epi32(_mm512_castsi256_si512(inp[0]),
+			zmm_pminp.z, _mm512_castsi256_si512(inp[1]));
+}
+
+/*
+ * Start up to 16 new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * idx - match indexed for given flows
+ * di - data indexes for these flows.
+ */
+static inline void
+start_flow16(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m512i pdata[2], __m512i *idx, __m512i *di)
+{
+	uint32_t n, m[2], nm[2];
+	__m512i ni, nd[2];
+
+	/* split mask into two - one for each pdata[] */
+	m[0] = msk & ZMM_PTR_MSK;
+	m[1] = msk >> ZMM_PTR_NUM;
+
+	/* calculate masks for new flows */
+	n = __builtin_popcount(m[0]);
+	nm[0] = (1 << n) - 1;
+	nm[1] = (1 << (num - n)) - 1;
+
+	/* load input data pointers for new flows */
+	nd[0] = _mm512_maskz_loadu_epi64(nm[0],
+		flow->idata + flow->num_packets);
+	nd[1] = _mm512_maskz_loadu_epi64(nm[1],
+		flow->idata + flow->num_packets + n);
+
+	/* calculate match indexes of new flows */
+	ni = _mm512_set1_epi32(flow->num_packets);
+	ni = _mm512_add_epi32(ni, zmm_idx_add.z);
+
+	/* merge new and existing flows data */
+	pdata[0] = _mm512_mask_expand_epi64(pdata[0], m[0], nd[0]);
+	pdata[1] = _mm512_mask_expand_epi64(pdata[1], m[1], nd[1]);
+
+	/* update match and data indexes */
+	*idx = _mm512_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm512_maskz_mov_epi32(msk ^ UINT16_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Process found matches for up to 16 flows.
+ * fmsk - mask of active flows
+ * rmsk - mask of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static inline uint32_t
+match_process_avx512x16(struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk, __m512i pdata[2], __m512i *di, __m512i *idx,
+	__m512i *tr_lo, __m512i *tr_hi)
+{
+	uint32_t n;
+	__m512i res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm512_and_si512(tr_lo[0], zmm_index_mask.z);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm512_mask_mov_epi32(tr_lo[0], rmsk[0], zmm_trlo_idle.z);
+	tr_hi[0] = _mm512_mask_mov_epi32(tr_hi[0], rmsk[0], zmm_trhi_idle.z);
+
+	/* save found match indexes */
+	_mm512_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask(flow, fmsk, rmsk);
+	start_flow16(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+/*
+ * Test for matches ut to 32 (2x16) flows at once,
+ * if matches exist - process them and start new flows.
+ */
+static inline void
+match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
+	__m512i pdata[4], __m512i di[2], __m512i idx[2], __m512i inp[2],
+	__m512i tr_lo[2], __m512i tr_hi[2])
+{
+	uint32_t n[2];
+	uint32_t rm[2];
+
+	/* check for matches */
+	rm[0] = _mm512_test_epi32_mask(tr_lo[0], zmm_match_mask.z);
+	rm[1] = _mm512_test_epi32_mask(tr_lo[1], zmm_match_mask.z);
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = match_process_avx512x16(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x16(flow, &fm[1], &rm[1], &pdata[2],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = get_next_bytes_avx512x16(flow, &pdata[0],
+				rm[0], &di[0], sizeof(uint8_t));
+			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
+				&tr_hi[0]);
+			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
+				zmm_match_mask.z);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_bytes_avx512x16(flow, &pdata[2],
+				rm[1], &di[1], sizeof(uint8_t));
+			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
+				&tr_hi[1]);
+			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
+				zmm_match_mask.z);
+		}
+	}
+}
+
+/*
+ * Perform search for up to 32 flows in parallel.
+ * Use two sets of metadata, each serves 16 flows max.
+ * So in fact we perform search for 2x16 flows.
+ */
+static inline void
+search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
+{
+	uint32_t fm[2];
+	__m512i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
+
+	in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0],
+			sizeof(uint8_t));
+	in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1],
+			sizeof(uint8_t));
+
+	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT16_MAX;
+	fm[1] = UINT16_MAX;
+
+	/* match check */
+	match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		in[0] = get_next_bytes_avx512x16(flow, &pdata[0], fm[0],
+			&di[0], sizeof(uint32_t));
+		in[1] = get_next_bytes_avx512x16(flow, &pdata[2], fm[1],
+			&di[1], sizeof(uint32_t));
+
+		/* main 4B loop */
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+			tr_lo, tr_hi);
+	}
+}
+
+/*
+ * Resolve matches for multiple categories (GT 8, use 512b instuctions/regs)
+ */
+static inline void
+resolve_mcgt8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, k, mi;
+	__mmask16 cm, sm;
+	__m512i cp, cr, np, nr;
+
+	const uint32_t match_log = 5;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	cm = (1 << nb_cat) - 1;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		cr = _mm512_maskz_loadu_epi32(cm, res + mi);
+		cp = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+		for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+			mi = pm[k] << match_log;
+
+			nr = _mm512_maskz_loadu_epi32(cm, res + mi);
+			np = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+			sm = _mm512_cmpgt_epi32_mask(cp, np);
+			cr = _mm512_mask_mov_epi32(nr, sm, cr);
+			cp = _mm512_mask_mov_epi32(np, sm, cp);
+		}
+
+		_mm512_mask_storeu_epi32(result, cm, cr);
+	}
+}
+
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline __m512i
+resolve_match_idx_avx512x16(__m512i mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm512_slli_epi32(mi, match_log);
+}
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline __m512i
+resolve_pri_avx512x16(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask16 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask16 m;
+	__m512i cp, cr, np, nr, mch;
+
+	const __m512i zero = _mm512_set1_epi32(0);
+
+	/* get match indexes */
+	mch = _mm512_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x16(mch);
+
+	/* read result and priority values for first trie */
+	cr = _mm512_mask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm512_mask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	/*
+	 * read result and priority values for next tries and select one
+	 * with highest priority.
+	 */
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm512_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x16(mch);
+
+		nr = _mm512_mask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm512_mask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm512_cmpgt_epi32_mask(cp, np);
+		cr = _mm512_mask_mov_epi32(nr, m, cr);
+		cp = _mm512_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 16) matches for single category
+ */
+static inline void
+resolve_sc_avx512x16(uint32_t result[], const int32_t res[],
+	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
+	uint32_t nb_trie, uint32_t nb_skip)
+{
+	__mmask16 msk;
+	__m512i cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x16(res, pri, match, msk, nb_trie, nb_skip);
+	_mm512_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x16x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t j, k, n;
+	const int32_t *res, *pri;
+	__m512i cr[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X16X2); k += NUM_AVX512X16X2) {
+
+		j = k + MASK16_BIT;
+
+		cr[0] = resolve_pri_avx512x16(res, pri, match + k, UINT16_MAX,
+				nb_trie, nb_pkt);
+		cr[1] = resolve_pri_avx512x16(res, pri, match + j, UINT16_MAX,
+				nb_trie, nb_pkt);
+
+		_mm512_storeu_si512(result + k, cr[0]);
+		_mm512_storeu_si512(result + j, cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > MASK16_BIT) {
+			resolve_sc_avx512x16(result + k, res, pri, match + k,
+				MASK16_BIT, nb_trie, nb_pkt);
+			k += MASK16_BIT;
+			n -= MASK16_BIT;
+		}
+		resolve_sc_avx512x16(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
+
+static inline int
+search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x16x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x16x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+	else
+		resolve_mcgt8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}

From patchwork Tue Oct  6 15:03:13 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79794
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id A9F98A04BB;
	Tue,  6 Oct 2020 17:12:38 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 067891BBE6;
	Tue,  6 Oct 2020 17:08:40 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id 0DAF71BB9A
 for <dev@dpdk.org>; Tue,  6 Oct 2020 17:08:37 +0200 (CEST)
IronPort-SDR: 
 YHDhOKq+jaOoYHVkKx4OyiENGvSlDKXgWB+Mt4YcuuFHe8YzhAa3I0D/0RnTdJ/aEaOP7NZPhX
 elXzHqmzGR/w==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919677"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919677"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:52 -0700
IronPort-SDR: 
 Ck/dHGNw8Vfi2XcBwh7odJJKJFe5+QylH59Bl7ZhHjMYhbO7/9d0MVWsDLlcTyIlr2YA2eyomI
 /Z8urss2bQSA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315459"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:49 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Tue,  6 Oct 2020 16:03:13 +0100
Message-Id: <20201006150316.5776-12-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
Subject: [dpdk-dev] [PATCH v4 11/14] acl: for AVX512 classify use 4B load
	whenever possible
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

With current ACL implementation first field in the rule definition
has always to be one byte long. Though for optimising classify
implementation it might be useful to do 4B reads
(as we do for rest of the fields).
So at build phase, check user provided field definitions to determine
is it safe to do 4B loads for first ACL field.
Then at run-time this information can be used to choose classify
behavior.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl.h               |  1 +
 lib/librte_acl/acl_bld.c           | 34 ++++++++++++++++++++++++++++++
 lib/librte_acl/acl_run_avx512.c    |  2 ++
 lib/librte_acl/acl_run_avx512x16.h |  8 +++----
 lib/librte_acl/acl_run_avx512x8.h  |  8 +++----
 lib/librte_acl/rte_acl.c           |  1 +
 6 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 7ac0d12f08..4089ab2a04 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -169,6 +169,7 @@ struct rte_acl_ctx {
 	int32_t             socket_id;
 	/** Socket ID to allocate memory from. */
 	enum rte_acl_classify_alg alg;
+	uint32_t           first_load_sz;
 	void               *rules;
 	uint32_t            max_rules;
 	uint32_t            rule_sz;
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index d1f920b09c..da10864cd8 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -1581,6 +1581,37 @@ acl_check_bld_param(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 	return 0;
 }
 
+/*
+ * With current ACL implementation first field in the rule definition
+ * has always to be one byte long. Though for optimising *classify*
+ * implementation it might be useful to be able to use 4B reads
+ * (as we do for rest of the fields).
+ * This function checks input config to determine is it safe to do 4B
+ * loads for first ACL field. For that we need to make sure that
+ * first field in our rule definition doesn't have the biggest offset,
+ * i.e. we still do have other fields located after the first one.
+ * Contrary if first field has the largest offset, then it means
+ * first field can occupy the very last byte in the input data buffer,
+ * and we have to do single byte load for it.
+ */
+static uint32_t
+get_first_load_size(const struct rte_acl_config *cfg)
+{
+	uint32_t i, max_ofs, ofs;
+
+	ofs = 0;
+	max_ofs = 0;
+
+	for (i = 0; i != cfg->num_fields; i++) {
+		if (cfg->defs[i].field_index == 0)
+			ofs = cfg->defs[i].offset;
+		else if (max_ofs < cfg->defs[i].offset)
+			max_ofs = cfg->defs[i].offset;
+	}
+
+	return (ofs < max_ofs) ? sizeof(uint32_t) : sizeof(uint8_t);
+}
+
 int
 rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 {
@@ -1618,6 +1649,9 @@ rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 				/* set data indexes. */
 				acl_set_data_indexes(ctx);
 
+				/* determine can we always do 4B load */
+				ctx->first_load_sz = get_first_load_size(cfg);
+
 				/* copy in build config. */
 				ctx->config = *cfg;
 			}
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 74698fa2ea..3fd1e33c3f 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -11,6 +11,7 @@ struct acl_flow_avx512 {
 	uint32_t num_packets;       /* number of packets processed */
 	uint32_t total_packets;     /* max number of packets to process */
 	uint32_t root_index;        /* current root index */
+	uint32_t first_load_sz;     /* first load size for new packet */
 	const uint64_t *trans;      /* transition table */
 	const uint32_t *data_index; /* input data indexes */
 	const uint8_t **idata;      /* input data */
@@ -24,6 +25,7 @@ acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
 {
 	flow->num_packets = 0;
 	flow->total_packets = total_packets;
+	flow->first_load_sz = ctx->first_load_sz;
 	flow->root_index = ctx->trie[trie].root_index;
 	flow->trans = ctx->trans_table;
 	flow->data_index = ctx->trie[trie].data_index;
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
index 981f8d16da..a39df8f3c0 100644
--- a/lib/librte_acl/acl_run_avx512x16.h
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -460,7 +460,7 @@ match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[0] != 0) {
 			inp[0] = get_next_bytes_avx512x16(flow, &pdata[0],
-				rm[0], &di[0], sizeof(uint8_t));
+				rm[0], &di[0], flow->first_load_sz);
 			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
 				&tr_hi[0]);
 			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
@@ -469,7 +469,7 @@ match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[1] != 0) {
 			inp[1] = get_next_bytes_avx512x16(flow, &pdata[2],
-				rm[1], &di[1], sizeof(uint8_t));
+				rm[1], &di[1], flow->first_load_sz);
 			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
 				&tr_hi[1]);
 			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
@@ -494,9 +494,9 @@ search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
 	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
 
 	in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0],
-			sizeof(uint8_t));
+			flow->first_load_sz);
 	in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1],
-			sizeof(uint8_t));
+			flow->first_load_sz);
 
 	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
 	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
index cfba0299ed..fedd79b9ae 100644
--- a/lib/librte_acl/acl_run_avx512x8.h
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -418,7 +418,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[0] != 0) {
 			inp[0] = get_next_bytes_avx512x8(flow, &pdata[0],
-				rm[0], &di[0], sizeof(uint8_t));
+				rm[0], &di[0], flow->first_load_sz);
 			first_trans8(flow, inp[0], rm[0], &tr_lo[0],
 				&tr_hi[0]);
 			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
@@ -427,7 +427,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[1] != 0) {
 			inp[1] = get_next_bytes_avx512x8(flow, &pdata[2],
-				rm[1], &di[1], sizeof(uint8_t));
+				rm[1], &di[1], flow->first_load_sz);
 			first_trans8(flow, inp[1], rm[1], &tr_lo[1],
 				&tr_hi[1]);
 			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
@@ -452,9 +452,9 @@ search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
 	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[2], &idx[1], &di[1]);
 
 	in[0] = get_next_bytes_avx512x8(flow, &pdata[0], UINT8_MAX, &di[0],
-			sizeof(uint8_t));
+			flow->first_load_sz);
 	in[1] = get_next_bytes_avx512x8(flow, &pdata[2], UINT8_MAX, &di[1],
-			sizeof(uint8_t));
+			flow->first_load_sz);
 
 	first_trans8(flow, in[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
 	first_trans8(flow, in[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 245af672ee..f1474038e5 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -500,6 +500,7 @@ rte_acl_dump(const struct rte_acl_ctx *ctx)
 	printf("acl context <%s>@%p\n", ctx->name, ctx);
 	printf("  socket_id=%"PRId32"\n", ctx->socket_id);
 	printf("  alg=%"PRId32"\n", ctx->alg);
+	printf("  first_load_sz=%"PRIu32"\n", ctx->first_load_sz);
 	printf("  max_rules=%"PRIu32"\n", ctx->max_rules);
 	printf("  rule_size=%"PRIu32"\n", ctx->rule_sz);
 	printf("  num_rules=%"PRIu32"\n", ctx->num_rules);

From patchwork Tue Oct  6 15:03:14 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79795
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id DC10DA04BB;
	Tue,  6 Oct 2020 17:13:04 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id CA8671BA7F;
	Tue,  6 Oct 2020 17:08:42 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id 5534C1BC17
 for <dev@dpdk.org>; Tue,  6 Oct 2020 17:08:40 +0200 (CEST)
IronPort-SDR: 
 uy/dJgg5ixImMUncXxFu4FFWFWIebI95Kbm2t3P9Rsgvu6e11GXnz7zJGnCurj4g7E7/ad9oCQ
 YeZ1R7+5PSXQ==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919700"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919700"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:52 -0700
IronPort-SDR: 
 eWsZMnxg4HLghkP5laiD8qijM0xhzv0Q6e79DBYoKV6cgCv6D6zWu75lByF9VK3QqEDcqJNFaN
 4iIu5zGRRULg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315476"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:51 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Tue,  6 Oct 2020 16:03:14 +0100
Message-Id: <20201006150316.5776-13-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
MIME-Version: 1.0
Subject: [dpdk-dev] [PATCH v4 12/14] acl: deduplicate AVX512 code paths
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Current rte_acl_classify_avx512x32() and rte_acl_classify_avx512x16()
code paths are very similar. The only differences are due to
256/512 register/instrincts naming conventions.
So to deduplicate the code:
  - Move common code into “acl_run_avx512_common.h”
  - Use macros to hide difference in naming conventions

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl_run_avx512_common.h | 477 +++++++++++++++++++++
 lib/librte_acl/acl_run_avx512x16.h     | 569 ++++---------------------
 lib/librte_acl/acl_run_avx512x8.h      | 565 ++++--------------------
 3 files changed, 654 insertions(+), 957 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx512_common.h

diff --git a/lib/librte_acl/acl_run_avx512_common.h b/lib/librte_acl/acl_run_avx512_common.h
new file mode 100644
index 0000000000..1baf79b7ae
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512_common.h
@@ -0,0 +1,477 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+/*
+ * WARNING: It is not recommended to include this file directly.
+ * Please include "acl_run_avx512x*.h" instead.
+ * To make this file to generate proper code an includer has to
+ * define several macros, refer to "acl_run_avx512x*.h" for more details.
+ */
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline _T_simd
+_F_(calc_addr)(_T_simd index_mask, _T_simd next_input, _T_simd shuffle_input,
+	_T_simd four_32, _T_simd range_base, _T_simd tr_lo, _T_simd tr_hi)
+{
+	__mmask64 qm;
+	_T_mask dfa_msk;
+	_T_simd addr, in, node_type, r, t;
+	_T_simd dfa_ofs, quad_ofs;
+
+	t = _M_SI_(xor)(index_mask, index_mask);
+	in = _M_I_(shuffle_epi8)(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _M_SI_(andnot)(index_mask, tr_lo);
+	addr = _M_SI_(and)(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _M_I_(cmpeq_epi32_mask)(node_type, t);
+
+	/* DFA calculations. */
+	r = _M_I_(srli_epi32)(in, 30);
+	r = _M_I_(add_epi8)(r, range_base);
+	t = _M_I_(srli_epi32)(in, 24);
+	r = _M_I_(shuffle_epi8)(tr_hi, r);
+
+	dfa_ofs = _M_I_(sub_epi32)(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	qm = _M_I_(cmpgt_epi8_mask)(in, tr_hi);
+	t = _M_I_(maskz_set1_epi8)(qm, (uint8_t)UINT8_MAX);
+	t = _M_I_(lzcnt_epi32)(t);
+	t = _M_I_(srli_epi32)(t, 3);
+	quad_ofs = _M_I_(sub_epi32)(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _M_I_(mask_mov_epi32)(quad_ofs, dfa_msk, dfa_ofs);
+
+	/* calculate address for next transitions. */
+	addr = _M_I_(add_epi32)(addr, t);
+	return addr;
+}
+
+/*
+ * Process _N_ transitions in parallel.
+ * tr_lo contains low 32 bits for _N_ transition.
+ * tr_hi contains high 32 bits for _N_ transition.
+ * next_input contains up to 4 input bytes for _N_ flows.
+ */
+static __rte_always_inline _T_simd
+_F_(trans)(_T_simd next_input, const uint64_t *trans, _T_simd *tr_lo,
+	_T_simd *tr_hi)
+{
+	const int32_t *tr;
+	_T_simd addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all _N_ transitions. */
+	addr = _F_(calc_addr)(_SV_(index_mask), next_input, _SV_(shuffle_input),
+		_SV_(four_32), _SV_(range_base), *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of _N_ transactions at once. */
+	*tr_lo = _M_GI_(i32gather_epi32, addr, tr, sizeof(trans[0]));
+
+	next_input = _M_I_(srli_epi32)(next_input, CHAR_BIT);
+
+	/* load high 32 bits of _N_ transactions at once. */
+	*tr_hi = _M_GI_(i32gather_epi32, addr, (tr + 1), sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to _N_ flows in parallel.
+ * next_input should contain one input byte for up to _N_ flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to _N_ transitions.
+ * tr_hi contains high 32 bits for up to _N_ transitions.
+ */
+static __rte_always_inline void
+_F_(first_trans)(const struct acl_flow_avx512 *flow, _T_simd next_input,
+	_T_mask msk, _T_simd *tr_lo, _T_simd *tr_hi)
+{
+	const int32_t *tr;
+	_T_simd addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _M_I_(set1_epi32)(UINT8_MAX);
+	root = _M_I_(set1_epi32)(flow->root_index);
+
+	addr = _M_SI_(and)(next_input, addr);
+	addr = _M_I_(add_epi32)(root, addr);
+
+	/* load lower 32 bits of _N_ transactions at once. */
+	*tr_lo = _M_MGI_(mask_i32gather_epi32)(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of _N_ transactions at once. */
+	*tr_hi = _M_MGI_(mask_i32gather_epi32)(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to _N_ flows in parallel.
+ * pdata - 8x2 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these _N_ flows.
+ */
+static inline _T_simd
+_F_(get_next_bytes)(const struct acl_flow_avx512 *flow, _T_simd pdata[2],
+	uint32_t msk, _T_simd *di, uint32_t bnum)
+{
+	const int32_t *div;
+	uint32_t m[2];
+	_T_simd one, zero, t, p[2];
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _M_I_(set1_epi32)(1);
+	zero = _M_SI_(xor)(one, one);
+
+	/* load data offsets for given indexes */
+	t = _M_MGI_(mask_i32gather_epi32)(zero, msk, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _M_I_(mask_add_epi32)(*di, msk, *di, one);
+
+	/*
+	 * unsigned expand 32-bit indexes to 64-bit
+	 * (for later pointer arithmetic), i.e:
+	 * for (i = 0; i != _N_; i++)
+	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
+	 */
+	p[0] = _M_I_(maskz_permutexvar_epi32)(_SC_(pmidx_msk), _SV_(pmidx[0]),
+			t);
+	p[1] = _M_I_(maskz_permutexvar_epi32)(_SC_(pmidx_msk), _SV_(pmidx[1]),
+			t);
+
+	p[0] = _M_I_(add_epi64)(p[0], pdata[0]);
+	p[1] = _M_I_(add_epi64)(p[1], pdata[1]);
+
+	/* load input byte(s), either one or four */
+
+	m[0] = msk & _SIMD_PTR_MSK_;
+	m[1] = msk >> _SIMD_PTR_NUM_;
+
+	return _F_(gather_bytes)(zero, p, m, bnum);
+}
+
+/*
+ * Start up to _N_ new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * idx - match indexed for given flows
+ * di - data indexes for these flows.
+ */
+static inline void
+_F_(start_flow)(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	_T_simd pdata[2], _T_simd *idx, _T_simd *di)
+{
+	uint32_t n, m[2], nm[2];
+	_T_simd ni, nd[2];
+
+	/* split mask into two - one for each pdata[] */
+	m[0] = msk & _SIMD_PTR_MSK_;
+	m[1] = msk >> _SIMD_PTR_NUM_;
+
+	/* calculate masks for new flows */
+	n = __builtin_popcount(m[0]);
+	nm[0] = (1 << n) - 1;
+	nm[1] = (1 << (num - n)) - 1;
+
+	/* load input data pointers for new flows */
+	nd[0] = _M_I_(maskz_loadu_epi64)(nm[0],
+			flow->idata + flow->num_packets);
+	nd[1] = _M_I_(maskz_loadu_epi64)(nm[1],
+			flow->idata + flow->num_packets + n);
+
+	/* calculate match indexes of new flows */
+	ni = _M_I_(set1_epi32)(flow->num_packets);
+	ni = _M_I_(add_epi32)(ni, _SV_(idx_add));
+
+	/* merge new and existing flows data */
+	pdata[0] = _M_I_(mask_expand_epi64)(pdata[0], m[0], nd[0]);
+	pdata[1] = _M_I_(mask_expand_epi64)(pdata[1], m[1], nd[1]);
+
+	/* update match and data indexes */
+	*idx = _M_I_(mask_expand_epi32)(*idx, msk, ni);
+	*di = _M_I_(maskz_mov_epi32)(msk ^ _SIMD_MASK_MAX_, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Process found matches for up to _N_ flows.
+ * fmsk - mask of active flows
+ * rmsk - mask of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to _N_ transitions.
+ * tr_hi contains high 32 bits for up to _N_ transitions.
+ */
+static inline uint32_t
+_F_(match_process)(struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk, _T_simd pdata[2], _T_simd *di, _T_simd *idx,
+	_T_simd *tr_lo, _T_simd *tr_hi)
+{
+	uint32_t n;
+	_T_simd res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _M_SI_(and)(tr_lo[0], _SV_(index_mask));
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _M_I_(mask_mov_epi32)(tr_lo[0], rmsk[0], _SV_(trlo_idle));
+	tr_hi[0] = _M_I_(mask_mov_epi32)(tr_hi[0], rmsk[0], _SV_(trhi_idle));
+
+	/* save found match indexes */
+	_M_I_(mask_i32scatter_epi32)(flow->matches, rmsk[0], idx[0], res,
+			sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask(flow, fmsk, rmsk);
+	_F_(start_flow)(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+/*
+ * Test for matches ut to (2 * _N_) flows at once,
+ * if matches exist - process them and start new flows.
+ */
+static inline void
+_F_(match_check_process)(struct acl_flow_avx512 *flow, uint32_t fm[2],
+	_T_simd pdata[4], _T_simd di[2], _T_simd idx[2], _T_simd inp[2],
+	_T_simd tr_lo[2], _T_simd tr_hi[2])
+{
+	uint32_t n[2];
+	uint32_t rm[2];
+
+	/* check for matches */
+	rm[0] = _M_I_(test_epi32_mask)(tr_lo[0], _SV_(match_mask));
+	rm[1] = _M_I_(test_epi32_mask)(tr_lo[1], _SV_(match_mask));
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = _F_(match_process)(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = _F_(match_process)(flow, &fm[1], &rm[1], &pdata[2],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = _F_(get_next_bytes)(flow, &pdata[0],
+					rm[0], &di[0], flow->first_load_sz);
+			_F_(first_trans)(flow, inp[0], rm[0], &tr_lo[0],
+					&tr_hi[0]);
+			rm[0] = _M_I_(test_epi32_mask)(tr_lo[0],
+					_SV_(match_mask));
+		}
+
+		if (n[1] != 0) {
+			inp[1] = _F_(get_next_bytes)(flow, &pdata[2],
+					rm[1], &di[1], flow->first_load_sz);
+			_F_(first_trans)(flow, inp[1], rm[1], &tr_lo[1],
+					&tr_hi[1]);
+			rm[1] = _M_I_(test_epi32_mask)(tr_lo[1],
+					_SV_(match_mask));
+		}
+	}
+}
+
+/*
+ * Perform search for up to (2 * _N_) flows in parallel.
+ * Use two sets of metadata, each serves _N_ flows max.
+ */
+static inline void
+_F_(search_trie)(struct acl_flow_avx512 *flow)
+{
+	uint32_t fm[2];
+	_T_simd di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	_F_(start_flow)(flow, _SIMD_MASK_BIT_, _SIMD_MASK_MAX_,
+			&pdata[0], &idx[0], &di[0]);
+	_F_(start_flow)(flow, _SIMD_MASK_BIT_, _SIMD_MASK_MAX_,
+			&pdata[2], &idx[1], &di[1]);
+
+	in[0] = _F_(get_next_bytes)(flow, &pdata[0], _SIMD_MASK_MAX_, &di[0],
+			flow->first_load_sz);
+	in[1] = _F_(get_next_bytes)(flow, &pdata[2], _SIMD_MASK_MAX_, &di[1],
+			flow->first_load_sz);
+
+	_F_(first_trans)(flow, in[0], _SIMD_MASK_MAX_, &tr_lo[0], &tr_hi[0]);
+	_F_(first_trans)(flow, in[1], _SIMD_MASK_MAX_, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = _SIMD_MASK_MAX_;
+	fm[1] = _SIMD_MASK_MAX_;
+
+	/* match check */
+	_F_(match_check_process)(flow, fm, pdata, di, idx, in, tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		in[0] = _F_(get_next_bytes)(flow, &pdata[0], fm[0],
+				&di[0], sizeof(uint32_t));
+		in[1] = _F_(get_next_bytes)(flow, &pdata[2], fm[1],
+				&di[1], sizeof(uint32_t));
+
+		/* main 4B loop */
+
+		in[0] = _F_(trans)(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = _F_(trans)(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = _F_(trans)(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = _F_(trans)(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = _F_(trans)(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = _F_(trans)(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = _F_(trans)(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = _F_(trans)(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		_F_(match_check_process)(flow, fm, pdata, di, idx, in,
+			tr_lo, tr_hi);
+	}
+}
+
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline _T_simd
+_F_(resolve_match_idx)(_T_simd mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _M_I_(slli_epi32)(mi, match_log);
+}
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline _T_simd
+_F_(resolve_pri)(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], _T_mask msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	_T_mask m;
+	_T_simd cp, cr, np, nr, mch;
+
+	const _T_simd zero = _M_I_(set1_epi32)(0);
+
+	/* get match indexes */
+	mch = _M_I_(maskz_loadu_epi32)(msk, match);
+	mch = _F_(resolve_match_idx)(mch);
+
+	/* read result and priority values for first trie */
+	cr = _M_MGI_(mask_i32gather_epi32)(zero, msk, mch, res, sizeof(res[0]));
+	cp = _M_MGI_(mask_i32gather_epi32)(zero, msk, mch, pri, sizeof(pri[0]));
+
+	/*
+	 * read result and priority values for next tries and select one
+	 * with highest priority.
+	 */
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _M_I_(maskz_loadu_epi32)(msk, pm);
+		mch = _F_(resolve_match_idx)(mch);
+
+		nr = _M_MGI_(mask_i32gather_epi32)(zero, msk, mch, res,
+				sizeof(res[0]));
+		np = _M_MGI_(mask_i32gather_epi32)(zero, msk, mch, pri,
+				sizeof(pri[0]));
+
+		m = _M_I_(cmpgt_epi32_mask)(cp, np);
+		cr = _M_I_(mask_mov_epi32)(nr, m, cr);
+		cp = _M_I_(mask_mov_epi32)(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= _N_) matches for single category
+ */
+static inline void
+_F_(resolve_sc)(uint32_t result[], const int32_t res[],
+	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
+	uint32_t nb_trie, uint32_t nb_skip)
+{
+	_T_mask msk;
+	_T_simd cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = _F_(resolve_pri)(res, pri, match, msk, nb_trie, nb_skip);
+	_M_I_(mask_storeu_epi32)(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+_F_(resolve_single_cat)(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t j, k, n;
+	const int32_t *res, *pri;
+	_T_simd cr[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~_SIMD_FLOW_MSK_); k += _SIMD_FLOW_NUM_) {
+
+		j = k + _SIMD_MASK_BIT_;
+
+		cr[0] = _F_(resolve_pri)(res, pri, match + k, _SIMD_MASK_MAX_,
+				nb_trie, nb_pkt);
+		cr[1] = _F_(resolve_pri)(res, pri, match + j, _SIMD_MASK_MAX_,
+				nb_trie, nb_pkt);
+
+		_M_SI_(storeu)((void *)(result + k), cr[0]);
+		_M_SI_(storeu)((void *)(result + j), cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > _SIMD_MASK_BIT_) {
+			_F_(resolve_sc)(result + k, res, pri, match + k,
+				_SIMD_MASK_BIT_, nb_trie, nb_pkt);
+			k += _SIMD_MASK_BIT_;
+			n -= _SIMD_MASK_BIT_;
+		}
+		_F_(resolve_sc)(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
index a39df8f3c0..da244bc257 100644
--- a/lib/librte_acl/acl_run_avx512x16.h
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -2,16 +2,57 @@
  * Copyright(c) 2020 Intel Corporation
  */
 
-#define	MASK16_BIT	(sizeof(__mmask16) * CHAR_BIT)
+/*
+ * Defines required by "acl_run_avx512_common.h".
+ * Note that all of them has to be undefined by the end
+ * of this file, as "acl_run_avx512_common.h" can be included several
+ * times from different *.h files for the same *.c.
+ */
+
+/*
+ * This implementation uses 512-bit registers(zmm) and instrincts.
+ * So our main SIMD type is 512-bit width and each such variable can
+ * process sizeof(__m512i) / sizeof(uint32_t) == 16 entries in parallel.
+ */
+#define _T_simd		__m512i
+#define _T_mask		__mmask16
+
+/* Naming convention for static const variables. */
+#define _SC_(x)		zmm_##x
+#define _SV_(x)		(zmm_##x.z)
+
+/* Naming convention for internal functions. */
+#define _F_(x)		x##_avx512x16
+
+/*
+ * Same instrincts have different syntaxis (depending on the bit-width),
+ * so to overcome that few macros need to be defined.
+ */
+
+/* Naming convention for generic epi(packed integers) type instrincts. */
+#define _M_I_(x)	_mm512_##x
+
+/* Naming convention for si(whole simd integer) type instrincts. */
+#define _M_SI_(x)	_mm512_##x##_si512
+
+/* Naming convention for masked gather type instrincts. */
+#define _M_MGI_(x)	_mm512_##x
+
+/* Naming convention for gather type instrincts. */
+#define _M_GI_(name, idx, base, scale)	_mm512_##name(idx, base, scale)
 
-#define NUM_AVX512X16X2	(2 * MASK16_BIT)
-#define MSK_AVX512X16X2	(NUM_AVX512X16X2 - 1)
+/* num/mask of transitions per SIMD regs */
+#define _SIMD_MASK_BIT_	(sizeof(_T_simd) / sizeof(uint32_t))
+#define _SIMD_MASK_MAX_	RTE_LEN2MASK(_SIMD_MASK_BIT_, uint32_t)
+
+#define _SIMD_FLOW_NUM_	(2 * _SIMD_MASK_BIT_)
+#define _SIMD_FLOW_MSK_	(_SIMD_FLOW_NUM_ - 1)
 
 /* num/mask of pointers per SIMD regs */
-#define ZMM_PTR_NUM	(sizeof(__m512i) / sizeof(uintptr_t))
-#define ZMM_PTR_MSK	RTE_LEN2MASK(ZMM_PTR_NUM, uint32_t)
+#define _SIMD_PTR_NUM_	(sizeof(_T_simd) / sizeof(uintptr_t))
+#define _SIMD_PTR_MSK_	RTE_LEN2MASK(_SIMD_PTR_NUM_, uint32_t)
 
-static const __rte_x86_zmm_t zmm_match_mask = {
+static const __rte_x86_zmm_t _SC_(match_mask) = {
 	.u32 = {
 		RTE_ACL_NODE_MATCH,
 		RTE_ACL_NODE_MATCH,
@@ -32,7 +73,7 @@ static const __rte_x86_zmm_t zmm_match_mask = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_index_mask = {
+static const __rte_x86_zmm_t _SC_(index_mask) = {
 	.u32 = {
 		RTE_ACL_NODE_INDEX,
 		RTE_ACL_NODE_INDEX,
@@ -53,7 +94,7 @@ static const __rte_x86_zmm_t zmm_index_mask = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_trlo_idle = {
+static const __rte_x86_zmm_t _SC_(trlo_idle) = {
 	.u32 = {
 		RTE_ACL_IDLE_NODE,
 		RTE_ACL_IDLE_NODE,
@@ -74,7 +115,7 @@ static const __rte_x86_zmm_t zmm_trlo_idle = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_trhi_idle = {
+static const __rte_x86_zmm_t _SC_(trhi_idle) = {
 	.u32 = {
 		0, 0, 0, 0,
 		0, 0, 0, 0,
@@ -83,7 +124,7 @@ static const __rte_x86_zmm_t zmm_trhi_idle = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_shuffle_input = {
+static const __rte_x86_zmm_t _SC_(shuffle_input) = {
 	.u32 = {
 		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
 		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
@@ -92,7 +133,7 @@ static const __rte_x86_zmm_t zmm_shuffle_input = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_four_32 = {
+static const __rte_x86_zmm_t _SC_(four_32) = {
 	.u32 = {
 		4, 4, 4, 4,
 		4, 4, 4, 4,
@@ -101,7 +142,7 @@ static const __rte_x86_zmm_t zmm_four_32 = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_idx_add = {
+static const __rte_x86_zmm_t _SC_(idx_add) = {
 	.u32 = {
 		0, 1, 2, 3,
 		4, 5, 6, 7,
@@ -110,7 +151,7 @@ static const __rte_x86_zmm_t zmm_idx_add = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_range_base = {
+static const __rte_x86_zmm_t _SC_(range_base) = {
 	.u32 = {
 		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
 		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
@@ -119,16 +160,16 @@ static const __rte_x86_zmm_t zmm_range_base = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_pminp = {
+static const __rte_x86_zmm_t _SC_(pminp) = {
 	.u32 = {
 		0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
 		0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
 	},
 };
 
-static const __mmask16 zmm_pmidx_msk = 0x5555;
+static const _T_mask _SC_(pmidx_msk) = 0x5555;
 
-static const __rte_x86_zmm_t zmm_pmidx[2] = {
+static const __rte_x86_zmm_t _SC_(pmidx[2]) = {
 	[0] = {
 		.u32 = {
 			0, 0, 1, 0, 2, 0, 3, 0,
@@ -148,7 +189,7 @@ static const __rte_x86_zmm_t zmm_pmidx[2] = {
  * gather load on a byte quantity. So we have to mimic it in SW,
  * by doing 8x1B scalar loads.
  */
-static inline ymm_t
+static inline __m256i
 _m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
 {
 	rte_ymm_t v;
@@ -156,7 +197,7 @@ _m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
 
 	static const uint32_t zero;
 
-	p.z = _mm512_mask_set1_epi64(pdata, mask ^ ZMM_PTR_MSK,
+	p.z = _mm512_mask_set1_epi64(pdata, mask ^ _SIMD_PTR_MSK_,
 		(uintptr_t)&zero);
 
 	v.u32[0] = *(uint8_t *)p.u64[0];
@@ -172,369 +213,29 @@ _m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
 }
 
 /*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes not supposed to be encountered here.
- * For quad range nodes:
- * Calculate number of range boundaries that are less than the
- * input value. Range boundaries for each node are in signed 8 bit,
- * ordered from -128 to 127.
- * This is effectively a popcnt of bytes that are greater than the
- * input byte.
- * Single nodes are processed in the same ways as quad range nodes.
- */
-static __rte_always_inline __m512i
-calc_addr16(__m512i index_mask, __m512i next_input, __m512i shuffle_input,
-	__m512i four_32, __m512i range_base, __m512i tr_lo, __m512i tr_hi)
-{
-	__mmask64 qm;
-	__mmask16 dfa_msk;
-	__m512i addr, in, node_type, r, t;
-	__m512i dfa_ofs, quad_ofs;
-
-	t = _mm512_xor_si512(index_mask, index_mask);
-	in = _mm512_shuffle_epi8(next_input, shuffle_input);
-
-	/* Calc node type and node addr */
-	node_type = _mm512_andnot_si512(index_mask, tr_lo);
-	addr = _mm512_and_si512(index_mask, tr_lo);
-
-	/* mask for DFA type(0) nodes */
-	dfa_msk = _mm512_cmpeq_epi32_mask(node_type, t);
-
-	/* DFA calculations. */
-	r = _mm512_srli_epi32(in, 30);
-	r = _mm512_add_epi8(r, range_base);
-	t = _mm512_srli_epi32(in, 24);
-	r = _mm512_shuffle_epi8(tr_hi, r);
-
-	dfa_ofs = _mm512_sub_epi32(t, r);
-
-	/* QUAD/SINGLE calculations. */
-	qm = _mm512_cmpgt_epi8_mask(in, tr_hi);
-	t = _mm512_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
-	t = _mm512_lzcnt_epi32(t);
-	t = _mm512_srli_epi32(t, 3);
-	quad_ofs = _mm512_sub_epi32(four_32, t);
-
-	/* blend DFA and QUAD/SINGLE. */
-	t = _mm512_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
-
-	/* calculate address for next transitions. */
-	addr = _mm512_add_epi32(addr, t);
-	return addr;
-}
-
-/*
- * Process 16 transitions in parallel.
- * tr_lo contains low 32 bits for 16 transition.
- * tr_hi contains high 32 bits for 16 transition.
- * next_input contains up to 4 input bytes for 16 flows.
+ * Gather 4/1 input bytes for up to 16 (2*8) locations in parallel.
  */
 static __rte_always_inline __m512i
-transition16(__m512i next_input, const uint64_t *trans, __m512i *tr_lo,
-	__m512i *tr_hi)
-{
-	const int32_t *tr;
-	__m512i addr;
-
-	tr = (const int32_t *)(uintptr_t)trans;
-
-	/* Calculate the address (array index) for all 16 transitions. */
-	addr = calc_addr16(zmm_index_mask.z, next_input, zmm_shuffle_input.z,
-		zmm_four_32.z, zmm_range_base.z, *tr_lo, *tr_hi);
-
-	/* load lower 32 bits of 16 transactions at once. */
-	*tr_lo = _mm512_i32gather_epi32(addr, tr, sizeof(trans[0]));
-
-	next_input = _mm512_srli_epi32(next_input, CHAR_BIT);
-
-	/* load high 32 bits of 16 transactions at once. */
-	*tr_hi = _mm512_i32gather_epi32(addr, (tr + 1), sizeof(trans[0]));
-
-	return next_input;
-}
-
-/*
- * Execute first transition for up to 16 flows in parallel.
- * next_input should contain one input byte for up to 16 flows.
- * msk - mask of active flows.
- * tr_lo contains low 32 bits for up to 16 transitions.
- * tr_hi contains high 32 bits for up to 16 transitions.
- */
-static __rte_always_inline void
-first_trans16(const struct acl_flow_avx512 *flow, __m512i next_input,
-	__mmask16 msk, __m512i *tr_lo, __m512i *tr_hi)
+_F_(gather_bytes)(__m512i zero, const __m512i p[2], const uint32_t m[2],
+	uint32_t bnum)
 {
-	const int32_t *tr;
-	__m512i addr, root;
-
-	tr = (const int32_t *)(uintptr_t)flow->trans;
-
-	addr = _mm512_set1_epi32(UINT8_MAX);
-	root = _mm512_set1_epi32(flow->root_index);
-
-	addr = _mm512_and_si512(next_input, addr);
-	addr = _mm512_add_epi32(root, addr);
-
-	/* load lower 32 bits of 16 transactions at once. */
-	*tr_lo = _mm512_mask_i32gather_epi32(*tr_lo, msk, addr, tr,
-		sizeof(flow->trans[0]));
-
-	/* load high 32 bits of 16 transactions at once. */
-	*tr_hi = _mm512_mask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
-		sizeof(flow->trans[0]));
-}
-
-/*
- * Load and return next 4 input bytes for up to 16 flows in parallel.
- * pdata - 8x2 pointers to flow input data
- * mask - mask of active flows.
- * di - data indexes for these 16 flows.
- */
-static inline __m512i
-get_next_bytes_avx512x16(const struct acl_flow_avx512 *flow, __m512i pdata[2],
-	uint32_t msk, __m512i *di, uint32_t bnum)
-{
-	const int32_t *div;
-	uint32_t m[2];
-	__m512i one, zero, t, p[2];
-	ymm_t inp[2];
-
-	div = (const int32_t *)flow->data_index;
-
-	one = _mm512_set1_epi32(1);
-	zero = _mm512_xor_si512(one, one);
-
-	/* load data offsets for given indexes */
-	t = _mm512_mask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
-
-	/* increment data indexes */
-	*di = _mm512_mask_add_epi32(*di, msk, *di, one);
-
-	/*
-	 * unsigned expand 32-bit indexes to 64-bit
-	 * (for later pointer arithmetic), i.e:
-	 * for (i = 0; i != 16; i++)
-	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
-	 */
-	p[0] = _mm512_maskz_permutexvar_epi32(zmm_pmidx_msk, zmm_pmidx[0].z, t);
-	p[1] = _mm512_maskz_permutexvar_epi32(zmm_pmidx_msk, zmm_pmidx[1].z, t);
-
-	p[0] = _mm512_add_epi64(p[0], pdata[0]);
-	p[1] = _mm512_add_epi64(p[1], pdata[1]);
-
-	/* load input byte(s), either one or four */
-
-	m[0] = msk & ZMM_PTR_MSK;
-	m[1] = msk >> ZMM_PTR_NUM;
+	__m256i inp[2];
 
 	if (bnum == sizeof(uint8_t)) {
 		inp[0] = _m512_mask_gather_epi8x8(p[0], m[0]);
 		inp[1] = _m512_mask_gather_epi8x8(p[1], m[1]);
 	} else {
 		inp[0] = _mm512_mask_i64gather_epi32(
-				_mm512_castsi512_si256(zero), m[0], p[0],
-				NULL, sizeof(uint8_t));
+				_mm512_castsi512_si256(zero),
+				m[0], p[0], NULL, sizeof(uint8_t));
 		inp[1] = _mm512_mask_i64gather_epi32(
-				_mm512_castsi512_si256(zero), m[1], p[1],
-				NULL, sizeof(uint8_t));
+				_mm512_castsi512_si256(zero),
+				m[1], p[1], NULL, sizeof(uint8_t));
 	}
 
 	/* squeeze input into one 512-bit register */
 	return _mm512_permutex2var_epi32(_mm512_castsi256_si512(inp[0]),
-			zmm_pminp.z, _mm512_castsi256_si512(inp[1]));
-}
-
-/*
- * Start up to 16 new flows.
- * num - number of flows to start
- * msk - mask of new flows.
- * pdata - pointers to flow input data
- * idx - match indexed for given flows
- * di - data indexes for these flows.
- */
-static inline void
-start_flow16(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
-	__m512i pdata[2], __m512i *idx, __m512i *di)
-{
-	uint32_t n, m[2], nm[2];
-	__m512i ni, nd[2];
-
-	/* split mask into two - one for each pdata[] */
-	m[0] = msk & ZMM_PTR_MSK;
-	m[1] = msk >> ZMM_PTR_NUM;
-
-	/* calculate masks for new flows */
-	n = __builtin_popcount(m[0]);
-	nm[0] = (1 << n) - 1;
-	nm[1] = (1 << (num - n)) - 1;
-
-	/* load input data pointers for new flows */
-	nd[0] = _mm512_maskz_loadu_epi64(nm[0],
-		flow->idata + flow->num_packets);
-	nd[1] = _mm512_maskz_loadu_epi64(nm[1],
-		flow->idata + flow->num_packets + n);
-
-	/* calculate match indexes of new flows */
-	ni = _mm512_set1_epi32(flow->num_packets);
-	ni = _mm512_add_epi32(ni, zmm_idx_add.z);
-
-	/* merge new and existing flows data */
-	pdata[0] = _mm512_mask_expand_epi64(pdata[0], m[0], nd[0]);
-	pdata[1] = _mm512_mask_expand_epi64(pdata[1], m[1], nd[1]);
-
-	/* update match and data indexes */
-	*idx = _mm512_mask_expand_epi32(*idx, msk, ni);
-	*di = _mm512_maskz_mov_epi32(msk ^ UINT16_MAX, *di);
-
-	flow->num_packets += num;
-}
-
-/*
- * Process found matches for up to 16 flows.
- * fmsk - mask of active flows
- * rmsk - mask of found matches
- * pdata - pointers to flow input data
- * di - data indexes for these flows
- * idx - match indexed for given flows
- * tr_lo contains low 32 bits for up to 8 transitions.
- * tr_hi contains high 32 bits for up to 8 transitions.
- */
-static inline uint32_t
-match_process_avx512x16(struct acl_flow_avx512 *flow, uint32_t *fmsk,
-	uint32_t *rmsk, __m512i pdata[2], __m512i *di, __m512i *idx,
-	__m512i *tr_lo, __m512i *tr_hi)
-{
-	uint32_t n;
-	__m512i res;
-
-	if (rmsk[0] == 0)
-		return 0;
-
-	/* extract match indexes */
-	res = _mm512_and_si512(tr_lo[0], zmm_index_mask.z);
-
-	/* mask  matched transitions to nop */
-	tr_lo[0] = _mm512_mask_mov_epi32(tr_lo[0], rmsk[0], zmm_trlo_idle.z);
-	tr_hi[0] = _mm512_mask_mov_epi32(tr_hi[0], rmsk[0], zmm_trhi_idle.z);
-
-	/* save found match indexes */
-	_mm512_mask_i32scatter_epi32(flow->matches, rmsk[0],
-		idx[0], res, sizeof(flow->matches[0]));
-
-	/* update masks and start new flows for matches */
-	n = update_flow_mask(flow, fmsk, rmsk);
-	start_flow16(flow, n, rmsk[0], pdata, idx, di);
-
-	return n;
-}
-
-/*
- * Test for matches ut to 32 (2x16) flows at once,
- * if matches exist - process them and start new flows.
- */
-static inline void
-match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
-	__m512i pdata[4], __m512i di[2], __m512i idx[2], __m512i inp[2],
-	__m512i tr_lo[2], __m512i tr_hi[2])
-{
-	uint32_t n[2];
-	uint32_t rm[2];
-
-	/* check for matches */
-	rm[0] = _mm512_test_epi32_mask(tr_lo[0], zmm_match_mask.z);
-	rm[1] = _mm512_test_epi32_mask(tr_lo[1], zmm_match_mask.z);
-
-	/* till unprocessed matches exist */
-	while ((rm[0] | rm[1]) != 0) {
-
-		/* process matches and start new flows */
-		n[0] = match_process_avx512x16(flow, &fm[0], &rm[0], &pdata[0],
-			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
-		n[1] = match_process_avx512x16(flow, &fm[1], &rm[1], &pdata[2],
-			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
-
-		/* execute first transition for new flows, if any */
-
-		if (n[0] != 0) {
-			inp[0] = get_next_bytes_avx512x16(flow, &pdata[0],
-				rm[0], &di[0], flow->first_load_sz);
-			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
-				&tr_hi[0]);
-			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
-				zmm_match_mask.z);
-		}
-
-		if (n[1] != 0) {
-			inp[1] = get_next_bytes_avx512x16(flow, &pdata[2],
-				rm[1], &di[1], flow->first_load_sz);
-			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
-				&tr_hi[1]);
-			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
-				zmm_match_mask.z);
-		}
-	}
-}
-
-/*
- * Perform search for up to 32 flows in parallel.
- * Use two sets of metadata, each serves 16 flows max.
- * So in fact we perform search for 2x16 flows.
- */
-static inline void
-search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
-{
-	uint32_t fm[2];
-	__m512i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
-
-	/* first 1B load */
-	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[0], &idx[0], &di[0]);
-	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
-
-	in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0],
-			flow->first_load_sz);
-	in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1],
-			flow->first_load_sz);
-
-	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
-	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
-
-	fm[0] = UINT16_MAX;
-	fm[1] = UINT16_MAX;
-
-	/* match check */
-	match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
-		tr_lo, tr_hi);
-
-	while ((fm[0] | fm[1]) != 0) {
-
-		/* load next 4B */
-
-		in[0] = get_next_bytes_avx512x16(flow, &pdata[0], fm[0],
-			&di[0], sizeof(uint32_t));
-		in[1] = get_next_bytes_avx512x16(flow, &pdata[2], fm[1],
-			&di[1], sizeof(uint32_t));
-
-		/* main 4B loop */
-
-		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		/* check for matches */
-		match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
-			tr_lo, tr_hi);
-	}
+			_SV_(pminp), _mm512_castsi256_si512(inp[1]));
 }
 
 /*
@@ -582,120 +283,12 @@ resolve_mcgt8_avx512x1(uint32_t result[],
 	}
 }
 
-/*
- * resolve match index to actual result/priority offset.
- */
-static inline __m512i
-resolve_match_idx_avx512x16(__m512i mi)
-{
-	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
-		1 << (match_log + 2));
-	return _mm512_slli_epi32(mi, match_log);
-}
-
-/*
- * Resolve multiple matches for the same flow based on priority.
- */
-static inline __m512i
-resolve_pri_avx512x16(const int32_t res[], const int32_t pri[],
-	const uint32_t match[], __mmask16 msk, uint32_t nb_trie,
-	uint32_t nb_skip)
-{
-	uint32_t i;
-	const uint32_t *pm;
-	__mmask16 m;
-	__m512i cp, cr, np, nr, mch;
-
-	const __m512i zero = _mm512_set1_epi32(0);
-
-	/* get match indexes */
-	mch = _mm512_maskz_loadu_epi32(msk, match);
-	mch = resolve_match_idx_avx512x16(mch);
-
-	/* read result and priority values for first trie */
-	cr = _mm512_mask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
-	cp = _mm512_mask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
-
-	/*
-	 * read result and priority values for next tries and select one
-	 * with highest priority.
-	 */
-	for (i = 1, pm = match + nb_skip; i != nb_trie;
-			i++, pm += nb_skip) {
-
-		mch = _mm512_maskz_loadu_epi32(msk, pm);
-		mch = resolve_match_idx_avx512x16(mch);
-
-		nr = _mm512_mask_i32gather_epi32(zero, msk, mch, res,
-			sizeof(res[0]));
-		np = _mm512_mask_i32gather_epi32(zero, msk, mch, pri,
-			sizeof(pri[0]));
-
-		m = _mm512_cmpgt_epi32_mask(cp, np);
-		cr = _mm512_mask_mov_epi32(nr, m, cr);
-		cp = _mm512_mask_mov_epi32(np, m, cp);
-	}
-
-	return cr;
-}
-
-/*
- * Resolve num (<= 16) matches for single category
- */
-static inline void
-resolve_sc_avx512x16(uint32_t result[], const int32_t res[],
-	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
-	uint32_t nb_trie, uint32_t nb_skip)
-{
-	__mmask16 msk;
-	__m512i cr;
-
-	msk = (1 << nb_pkt) - 1;
-	cr = resolve_pri_avx512x16(res, pri, match, msk, nb_trie, nb_skip);
-	_mm512_mask_storeu_epi32(result, msk, cr);
-}
+#include "acl_run_avx512_common.h"
 
 /*
- * Resolve matches for single category
+ * Perform search for up to (2 * 16) flows in parallel.
+ * Use two sets of metadata, each serves 16 flows max.
  */
-static inline void
-resolve_sc_avx512x16x2(uint32_t result[],
-	const struct rte_acl_match_results pr[], const uint32_t match[],
-	uint32_t nb_pkt, uint32_t nb_trie)
-{
-	uint32_t j, k, n;
-	const int32_t *res, *pri;
-	__m512i cr[2];
-
-	res = (const int32_t *)pr->results;
-	pri = pr->priority;
-
-	for (k = 0; k != (nb_pkt & ~MSK_AVX512X16X2); k += NUM_AVX512X16X2) {
-
-		j = k + MASK16_BIT;
-
-		cr[0] = resolve_pri_avx512x16(res, pri, match + k, UINT16_MAX,
-				nb_trie, nb_pkt);
-		cr[1] = resolve_pri_avx512x16(res, pri, match + j, UINT16_MAX,
-				nb_trie, nb_pkt);
-
-		_mm512_storeu_si512(result + k, cr[0]);
-		_mm512_storeu_si512(result + j, cr[1]);
-	}
-
-	n = nb_pkt - k;
-	if (n != 0) {
-		if (n > MASK16_BIT) {
-			resolve_sc_avx512x16(result + k, res, pri, match + k,
-				MASK16_BIT, nb_trie, nb_pkt);
-			k += MASK16_BIT;
-			n -= MASK16_BIT;
-		}
-		resolve_sc_avx512x16(result + k, res, pri, match + k, n,
-				nb_trie, nb_pkt);
-	}
-}
-
 static inline int
 search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t total_packets, uint32_t categories)
@@ -711,7 +304,7 @@ search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
 
 		/* process the trie */
-		search_trie_avx512x16x2(&flow);
+		_F_(search_trie)(&flow);
 	}
 
 	/* resolve matches */
@@ -719,7 +312,7 @@ search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		(ctx->trans_table + ctx->match_index);
 
 	if (categories == 1)
-		resolve_sc_avx512x16x2(results, pr, match, total_packets,
+		_F_(resolve_single_cat)(results, pr, match, total_packets,
 			ctx->num_tries);
 	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
 		resolve_mcle8_avx512x1(results, pr, match, total_packets,
@@ -730,3 +323,19 @@ search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	return 0;
 }
+
+#undef _SIMD_PTR_MSK_
+#undef _SIMD_PTR_NUM_
+#undef _SIMD_FLOW_MSK_
+#undef _SIMD_FLOW_NUM_
+#undef _SIMD_MASK_MAX_
+#undef _SIMD_MASK_BIT_
+#undef _M_GI_
+#undef _M_MGI_
+#undef _M_SI_
+#undef _M_I_
+#undef _F_
+#undef _SV_
+#undef _SC_
+#undef _T_mask
+#undef _T_simd
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
index fedd79b9ae..61ac9d1b47 100644
--- a/lib/librte_acl/acl_run_avx512x8.h
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -2,16 +2,57 @@
  * Copyright(c) 2020 Intel Corporation
  */
 
-#define MASK8_BIT	(sizeof(__mmask8) * CHAR_BIT)
+/*
+ * Defines required by "acl_run_avx512_common.h".
+ * Note that all of them has to be undefined by the end
+ * of this file, as "acl_run_avx512_common.h" can be included several
+ * times from different *.h files for the same *.c.
+ */
+
+/*
+ * This implementation uses 256-bit registers(ymm) and instrincts.
+ * So our main SIMD type is 256-bit width and each such variable can
+ * process sizeof(__m256i) / sizeof(uint32_t) == 8 entries in parallel.
+ */
+#define _T_simd		__m256i
+#define _T_mask		__mmask8
+
+/* Naming convention for static const variables. */
+#define _SC_(x)		ymm_##x
+#define _SV_(x)		(ymm_##x.y)
+
+/* Naming convention for internal functions. */
+#define _F_(x)		x##_avx512x8
+
+/*
+ * Same instrincts have different syntaxis (depending on the bit-width),
+ * so to overcome that few macros need to be defined.
+ */
+
+/* Naming convention for generic epi(packed integers) type instrincts. */
+#define _M_I_(x)	_mm256_##x
+
+/* Naming convention for si(whole simd integer) type instrincts. */
+#define _M_SI_(x)	_mm256_##x##_si256
 
-#define NUM_AVX512X8X2	(2 * MASK8_BIT)
-#define MSK_AVX512X8X2	(NUM_AVX512X8X2 - 1)
+/* Naming convention for masked gather type instrincts. */
+#define _M_MGI_(x)	_mm256_m##x
+
+/* Naming convention for gather type instrincts. */
+#define _M_GI_(name, idx, base, scale)	_mm256_##name(base, idx, scale)
+
+/* num/mask of transitions per SIMD regs */
+#define _SIMD_MASK_BIT_	(sizeof(_T_simd) / sizeof(uint32_t))
+#define _SIMD_MASK_MAX_	RTE_LEN2MASK(_SIMD_MASK_BIT_, uint32_t)
+
+#define _SIMD_FLOW_NUM_	(2 * _SIMD_MASK_BIT_)
+#define _SIMD_FLOW_MSK_	(_SIMD_FLOW_NUM_ - 1)
 
 /* num/mask of pointers per SIMD regs */
-#define YMM_PTR_NUM	(sizeof(__m256i) / sizeof(uintptr_t))
-#define YMM_PTR_MSK	RTE_LEN2MASK(YMM_PTR_NUM, uint32_t)
+#define _SIMD_PTR_NUM_	(sizeof(_T_simd) / sizeof(uintptr_t))
+#define _SIMD_PTR_MSK_	RTE_LEN2MASK(_SIMD_PTR_NUM_, uint32_t)
 
-static const rte_ymm_t ymm_match_mask = {
+static const rte_ymm_t _SC_(match_mask) = {
 	.u32 = {
 		RTE_ACL_NODE_MATCH,
 		RTE_ACL_NODE_MATCH,
@@ -24,7 +65,7 @@ static const rte_ymm_t ymm_match_mask = {
 	},
 };
 
-static const rte_ymm_t ymm_index_mask = {
+static const rte_ymm_t _SC_(index_mask) = {
 	.u32 = {
 		RTE_ACL_NODE_INDEX,
 		RTE_ACL_NODE_INDEX,
@@ -37,7 +78,7 @@ static const rte_ymm_t ymm_index_mask = {
 	},
 };
 
-static const rte_ymm_t ymm_trlo_idle = {
+static const rte_ymm_t _SC_(trlo_idle) = {
 	.u32 = {
 		RTE_ACL_IDLE_NODE,
 		RTE_ACL_IDLE_NODE,
@@ -50,51 +91,51 @@ static const rte_ymm_t ymm_trlo_idle = {
 	},
 };
 
-static const rte_ymm_t ymm_trhi_idle = {
+static const rte_ymm_t _SC_(trhi_idle) = {
 	.u32 = {
 		0, 0, 0, 0,
 		0, 0, 0, 0,
 	},
 };
 
-static const rte_ymm_t ymm_shuffle_input = {
+static const rte_ymm_t _SC_(shuffle_input) = {
 	.u32 = {
 		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
 		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
 	},
 };
 
-static const rte_ymm_t ymm_four_32 = {
+static const rte_ymm_t _SC_(four_32) = {
 	.u32 = {
 		4, 4, 4, 4,
 		4, 4, 4, 4,
 	},
 };
 
-static const rte_ymm_t ymm_idx_add = {
+static const rte_ymm_t _SC_(idx_add) = {
 	.u32 = {
 		0, 1, 2, 3,
 		4, 5, 6, 7,
 	},
 };
 
-static const rte_ymm_t ymm_range_base = {
+static const rte_ymm_t _SC_(range_base) = {
 	.u32 = {
 		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
 		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
 	},
 };
 
-static const rte_ymm_t ymm_pminp = {
+static const rte_ymm_t _SC_(pminp) = {
 	.u32 = {
 		0x00, 0x01, 0x02, 0x03,
 		0x08, 0x09, 0x0a, 0x0b,
 	},
 };
 
-static const __mmask16 ymm_pmidx_msk = 0x55;
+static const __mmask16 _SC_(pmidx_msk) = 0x55;
 
-static const rte_ymm_t ymm_pmidx[2] = {
+static const rte_ymm_t _SC_(pmidx[2]) = {
 	[0] = {
 		.u32 = {
 			0, 0, 1, 0, 2, 0, 3, 0,
@@ -120,7 +161,7 @@ _m256_mask_gather_epi8x4(__m256i pdata, __mmask8 mask)
 
 	static const uint32_t zero;
 
-	p.y = _mm256_mask_set1_epi64(pdata, mask ^ YMM_PTR_MSK,
+	p.y = _mm256_mask_set1_epi64(pdata, mask ^ _SIMD_PTR_MSK_,
 		(uintptr_t)&zero);
 
 	v.u32[0] = *(uint8_t *)p.u64[0];
@@ -132,483 +173,37 @@ _m256_mask_gather_epi8x4(__m256i pdata, __mmask8 mask)
 }
 
 /*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes not supposed to be encountered here.
- * For quad range nodes:
- * Calculate number of range boundaries that are less than the
- * input value. Range boundaries for each node are in signed 8 bit,
- * ordered from -128 to 127.
- * This is effectively a popcnt of bytes that are greater than the
- * input byte.
- * Single nodes are processed in the same ways as quad range nodes.
- */
-static __rte_always_inline __m256i
-calc_addr8(__m256i index_mask, __m256i next_input, __m256i shuffle_input,
-	__m256i four_32, __m256i range_base, __m256i tr_lo, __m256i tr_hi)
-{
-	__mmask32 qm;
-	__mmask8 dfa_msk;
-	__m256i addr, in, node_type, r, t;
-	__m256i dfa_ofs, quad_ofs;
-
-	t = _mm256_xor_si256(index_mask, index_mask);
-	in = _mm256_shuffle_epi8(next_input, shuffle_input);
-
-	/* Calc node type and node addr */
-	node_type = _mm256_andnot_si256(index_mask, tr_lo);
-	addr = _mm256_and_si256(index_mask, tr_lo);
-
-	/* mask for DFA type(0) nodes */
-	dfa_msk = _mm256_cmpeq_epi32_mask(node_type, t);
-
-	/* DFA calculations. */
-	r = _mm256_srli_epi32(in, 30);
-	r = _mm256_add_epi8(r, range_base);
-	t = _mm256_srli_epi32(in, 24);
-	r = _mm256_shuffle_epi8(tr_hi, r);
-
-	dfa_ofs = _mm256_sub_epi32(t, r);
-
-	/* QUAD/SINGLE calculations. */
-	qm = _mm256_cmpgt_epi8_mask(in, tr_hi);
-	t = _mm256_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
-	t = _mm256_lzcnt_epi32(t);
-	t = _mm256_srli_epi32(t, 3);
-	quad_ofs = _mm256_sub_epi32(four_32, t);
-
-	/* blend DFA and QUAD/SINGLE. */
-	t = _mm256_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
-
-	/* calculate address for next transitions. */
-	addr = _mm256_add_epi32(addr, t);
-	return addr;
-}
-
-/*
- * Process 16 transitions in parallel.
- * tr_lo contains low 32 bits for 16 transition.
- * tr_hi contains high 32 bits for 16 transition.
- * next_input contains up to 4 input bytes for 16 flows.
+ * Gather 4/1 input bytes for up to 8 (2*8) locations in parallel.
  */
 static __rte_always_inline __m256i
-transition8(__m256i next_input, const uint64_t *trans, __m256i *tr_lo,
-	__m256i *tr_hi)
-{
-	const int32_t *tr;
-	__m256i addr;
-
-	tr = (const int32_t *)(uintptr_t)trans;
-
-	/* Calculate the address (array index) for all 8 transitions. */
-	addr = calc_addr8(ymm_index_mask.y, next_input, ymm_shuffle_input.y,
-		ymm_four_32.y, ymm_range_base.y, *tr_lo, *tr_hi);
-
-	/* load lower 32 bits of 16 transactions at once. */
-	*tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0]));
-
-	next_input = _mm256_srli_epi32(next_input, CHAR_BIT);
-
-	/* load high 32 bits of 16 transactions at once. */
-	*tr_hi = _mm256_i32gather_epi32(tr + 1, addr, sizeof(trans[0]));
-
-	return next_input;
-}
-
-/*
- * Execute first transition for up to 16 flows in parallel.
- * next_input should contain one input byte for up to 16 flows.
- * msk - mask of active flows.
- * tr_lo contains low 32 bits for up to 16 transitions.
- * tr_hi contains high 32 bits for up to 16 transitions.
- */
-static __rte_always_inline void
-first_trans8(const struct acl_flow_avx512 *flow, __m256i next_input,
-	__mmask8 msk, __m256i *tr_lo, __m256i *tr_hi)
+_F_(gather_bytes)(__m256i zero, const __m256i p[2], const uint32_t m[2],
+	uint32_t bnum)
 {
-	const int32_t *tr;
-	__m256i addr, root;
-
-	tr = (const int32_t *)(uintptr_t)flow->trans;
-
-	addr = _mm256_set1_epi32(UINT8_MAX);
-	root = _mm256_set1_epi32(flow->root_index);
-
-	addr = _mm256_and_si256(next_input, addr);
-	addr = _mm256_add_epi32(root, addr);
-
-	/* load lower 32 bits of 16 transactions at once. */
-	*tr_lo = _mm256_mmask_i32gather_epi32(*tr_lo, msk, addr, tr,
-		sizeof(flow->trans[0]));
-
-	/* load high 32 bits of 16 transactions at once. */
-	*tr_hi = _mm256_mmask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
-		sizeof(flow->trans[0]));
-}
-
-/*
- * Load and return next 4 input bytes for up to 16 flows in parallel.
- * pdata - 8x2 pointers to flow input data
- * mask - mask of active flows.
- * di - data indexes for these 16 flows.
- */
-static inline __m256i
-get_next_bytes_avx512x8(const struct acl_flow_avx512 *flow, __m256i pdata[2],
-	uint32_t msk, __m256i *di, uint32_t bnum)
-{
-	const int32_t *div;
-	uint32_t m[2];
-	__m256i one, zero, t, p[2];
 	__m128i inp[2];
 
-	div = (const int32_t *)flow->data_index;
-
-	one = _mm256_set1_epi32(1);
-	zero = _mm256_xor_si256(one, one);
-
-	/* load data offsets for given indexes */
-	t = _mm256_mmask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
-
-	/* increment data indexes */
-	*di = _mm256_mask_add_epi32(*di, msk, *di, one);
-
-	/*
-	 * unsigned expand 32-bit indexes to 64-bit
-	 * (for later pointer arithmetic), i.e:
-	 * for (i = 0; i != 16; i++)
-	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
-	 */
-	p[0] = _mm256_maskz_permutexvar_epi32(ymm_pmidx_msk, ymm_pmidx[0].y, t);
-	p[1] = _mm256_maskz_permutexvar_epi32(ymm_pmidx_msk, ymm_pmidx[1].y, t);
-
-	p[0] = _mm256_add_epi64(p[0], pdata[0]);
-	p[1] = _mm256_add_epi64(p[1], pdata[1]);
-
-	/* load input byte(s), either one or four */
-
-	m[0] = msk & YMM_PTR_MSK;
-	m[1] = msk >> YMM_PTR_NUM;
-
 	if (bnum == sizeof(uint8_t)) {
 		inp[0] = _m256_mask_gather_epi8x4(p[0], m[0]);
 		inp[1] = _m256_mask_gather_epi8x4(p[1], m[1]);
 	} else {
 		inp[0] = _mm256_mmask_i64gather_epi32(
-				_mm256_castsi256_si128(zero), m[0], p[0],
-				NULL, sizeof(uint8_t));
+				_mm256_castsi256_si128(zero),
+				m[0], p[0], NULL, sizeof(uint8_t));
 		inp[1] = _mm256_mmask_i64gather_epi32(
-				_mm256_castsi256_si128(zero), m[1], p[1],
-				NULL, sizeof(uint8_t));
+				_mm256_castsi256_si128(zero),
+				m[1], p[1], NULL, sizeof(uint8_t));
 	}
 
-	/* squeeze input into one 512-bit register */
+	/* squeeze input into one 256-bit register */
 	return _mm256_permutex2var_epi32(_mm256_castsi128_si256(inp[0]),
-			ymm_pminp.y,  _mm256_castsi128_si256(inp[1]));
-}
-
-/*
- * Start up to 16 new flows.
- * num - number of flows to start
- * msk - mask of new flows.
- * pdata - pointers to flow input data
- * idx - match indexed for given flows
- * di - data indexes for these flows.
- */
-static inline void
-start_flow8(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
-	__m256i pdata[2], __m256i *idx, __m256i *di)
-{
-	uint32_t n, m[2], nm[2];
-	__m256i ni, nd[2];
-
-	m[0] = msk & YMM_PTR_MSK;
-	m[1] = msk >> YMM_PTR_NUM;
-
-	n = __builtin_popcount(m[0]);
-	nm[0] = (1 << n) - 1;
-	nm[1] = (1 << (num - n)) - 1;
-
-	/* load input data pointers for new flows */
-	nd[0] = _mm256_maskz_loadu_epi64(nm[0],
-		flow->idata + flow->num_packets);
-	nd[1] = _mm256_maskz_loadu_epi64(nm[1],
-		flow->idata + flow->num_packets + n);
-
-	/* calculate match indexes of new flows */
-	ni = _mm256_set1_epi32(flow->num_packets);
-	ni = _mm256_add_epi32(ni, ymm_idx_add.y);
-
-	/* merge new and existing flows data */
-	pdata[0] = _mm256_mask_expand_epi64(pdata[0], m[0], nd[0]);
-	pdata[1] = _mm256_mask_expand_epi64(pdata[1], m[1], nd[1]);
-
-	/* update match and data indexes */
-	*idx = _mm256_mask_expand_epi32(*idx, msk, ni);
-	*di = _mm256_maskz_mov_epi32(msk ^ UINT8_MAX, *di);
-
-	flow->num_packets += num;
-}
-
-/*
- * Process found matches for up to 16 flows.
- * fmsk - mask of active flows
- * rmsk - mask of found matches
- * pdata - pointers to flow input data
- * di - data indexes for these flows
- * idx - match indexed for given flows
- * tr_lo contains low 32 bits for up to 8 transitions.
- * tr_hi contains high 32 bits for up to 8 transitions.
- */
-static inline uint32_t
-match_process_avx512x8(struct acl_flow_avx512 *flow, uint32_t *fmsk,
-	uint32_t *rmsk, __m256i pdata[2], __m256i *di, __m256i *idx,
-	__m256i *tr_lo, __m256i *tr_hi)
-{
-	uint32_t n;
-	__m256i res;
-
-	if (rmsk[0] == 0)
-		return 0;
-
-	/* extract match indexes */
-	res = _mm256_and_si256(tr_lo[0], ymm_index_mask.y);
-
-	/* mask  matched transitions to nop */
-	tr_lo[0] = _mm256_mask_mov_epi32(tr_lo[0], rmsk[0], ymm_trlo_idle.y);
-	tr_hi[0] = _mm256_mask_mov_epi32(tr_hi[0], rmsk[0], ymm_trhi_idle.y);
-
-	/* save found match indexes */
-	_mm256_mask_i32scatter_epi32(flow->matches, rmsk[0],
-		idx[0], res, sizeof(flow->matches[0]));
-
-	/* update masks and start new flows for matches */
-	n = update_flow_mask(flow, fmsk, rmsk);
-	start_flow8(flow, n, rmsk[0], pdata, idx, di);
-
-	return n;
-}
-
-/*
- * Test for matches ut to 32 (2x16) flows at once,
- * if matches exist - process them and start new flows.
- */
-static inline void
-match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
-	__m256i pdata[4], __m256i di[2], __m256i idx[2], __m256i inp[2],
-	__m256i tr_lo[2], __m256i tr_hi[2])
-{
-	uint32_t n[2];
-	uint32_t rm[2];
-
-	/* check for matches */
-	rm[0] = _mm256_test_epi32_mask(tr_lo[0], ymm_match_mask.y);
-	rm[1] = _mm256_test_epi32_mask(tr_lo[1], ymm_match_mask.y);
-
-	/* till unprocessed matches exist */
-	while ((rm[0] | rm[1]) != 0) {
-
-		/* process matches and start new flows */
-		n[0] = match_process_avx512x8(flow, &fm[0], &rm[0], &pdata[0],
-			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
-		n[1] = match_process_avx512x8(flow, &fm[1], &rm[1], &pdata[2],
-			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
-
-		/* execute first transition for new flows, if any */
-
-		if (n[0] != 0) {
-			inp[0] = get_next_bytes_avx512x8(flow, &pdata[0],
-				rm[0], &di[0], flow->first_load_sz);
-			first_trans8(flow, inp[0], rm[0], &tr_lo[0],
-				&tr_hi[0]);
-			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
-				ymm_match_mask.y);
-		}
-
-		if (n[1] != 0) {
-			inp[1] = get_next_bytes_avx512x8(flow, &pdata[2],
-				rm[1], &di[1], flow->first_load_sz);
-			first_trans8(flow, inp[1], rm[1], &tr_lo[1],
-				&tr_hi[1]);
-			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
-				ymm_match_mask.y);
-		}
-	}
-}
-
-/*
- * Perform search for up to 32 flows in parallel.
- * Use two sets of metadata, each serves 16 flows max.
- * So in fact we perform search for 2x16 flows.
- */
-static inline void
-search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
-{
-	uint32_t fm[2];
-	__m256i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
-
-	/* first 1B load */
-	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[0], &idx[0], &di[0]);
-	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[2], &idx[1], &di[1]);
-
-	in[0] = get_next_bytes_avx512x8(flow, &pdata[0], UINT8_MAX, &di[0],
-			flow->first_load_sz);
-	in[1] = get_next_bytes_avx512x8(flow, &pdata[2], UINT8_MAX, &di[1],
-			flow->first_load_sz);
-
-	first_trans8(flow, in[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
-	first_trans8(flow, in[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
-
-	fm[0] = UINT8_MAX;
-	fm[1] = UINT8_MAX;
-
-	/* match check */
-	match_check_process_avx512x8x2(flow, fm, pdata, di, idx, in,
-		tr_lo, tr_hi);
-
-	while ((fm[0] | fm[1]) != 0) {
-
-		/* load next 4B */
-
-		in[0] = get_next_bytes_avx512x8(flow, &pdata[0], fm[0],
-			&di[0], sizeof(uint32_t));
-		in[1] = get_next_bytes_avx512x8(flow, &pdata[2], fm[1],
-			&di[1], sizeof(uint32_t));
-
-		/* main 4B loop */
-
-		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		/* check for matches */
-		match_check_process_avx512x8x2(flow, fm, pdata, di, idx, in,
-			tr_lo, tr_hi);
-	}
-}
-
-/*
- * resolve match index to actual result/priority offset.
- */
-static inline __m256i
-resolve_match_idx_avx512x8(__m256i mi)
-{
-	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
-		1 << (match_log + 2));
-	return _mm256_slli_epi32(mi, match_log);
+			_SV_(pminp), _mm256_castsi128_si256(inp[1]));
 }
 
-/*
- * Resolve multiple matches for the same flow based on priority.
- */
-static inline __m256i
-resolve_pri_avx512x8(const int32_t res[], const int32_t pri[],
-	const uint32_t match[], __mmask8 msk, uint32_t nb_trie,
-	uint32_t nb_skip)
-{
-	uint32_t i;
-	const uint32_t *pm;
-	__mmask16 m;
-	__m256i cp, cr, np, nr, mch;
-
-	const __m256i zero = _mm256_set1_epi32(0);
-
-	/* get match indexes */
-	mch = _mm256_maskz_loadu_epi32(msk, match);
-	mch = resolve_match_idx_avx512x8(mch);
-
-	/* read result and priority values for first trie */
-	cr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
-	cp = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
-
-	/*
-	 * read result and priority values for next tries and select one
-	 * with highest priority.
-	 */
-	for (i = 1, pm = match + nb_skip; i != nb_trie;
-			i++, pm += nb_skip) {
-
-		mch = _mm256_maskz_loadu_epi32(msk, pm);
-		mch = resolve_match_idx_avx512x8(mch);
-
-		nr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res,
-			sizeof(res[0]));
-		np = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri,
-			sizeof(pri[0]));
-
-		m = _mm256_cmpgt_epi32_mask(cp, np);
-		cr = _mm256_mask_mov_epi32(nr, m, cr);
-		cp = _mm256_mask_mov_epi32(np, m, cp);
-	}
-
-	return cr;
-}
-
-/*
- * Resolve num (<= 8) matches for single category
- */
-static inline void
-resolve_sc_avx512x8(uint32_t result[], const int32_t res[],
-	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
-	uint32_t nb_trie, uint32_t nb_skip)
-{
-	__mmask8 msk;
-	__m256i cr;
-
-	msk = (1 << nb_pkt) - 1;
-	cr = resolve_pri_avx512x8(res, pri, match, msk, nb_trie, nb_skip);
-	_mm256_mask_storeu_epi32(result, msk, cr);
-}
+#include "acl_run_avx512_common.h"
 
 /*
- * Resolve matches for single category
+ * Perform search for up to (2 * 8) flows in parallel.
+ * Use two sets of metadata, each serves 8 flows max.
  */
-static inline void
-resolve_sc_avx512x8x2(uint32_t result[],
-	const struct rte_acl_match_results pr[], const uint32_t match[],
-	uint32_t nb_pkt, uint32_t nb_trie)
-{
-	uint32_t j, k, n;
-	const int32_t *res, *pri;
-	__m256i cr[2];
-
-	res = (const int32_t *)pr->results;
-	pri = pr->priority;
-
-	for (k = 0; k != (nb_pkt & ~MSK_AVX512X8X2); k += NUM_AVX512X8X2) {
-
-		j = k + MASK8_BIT;
-
-		cr[0] = resolve_pri_avx512x8(res, pri, match + k, UINT8_MAX,
-				nb_trie, nb_pkt);
-		cr[1] = resolve_pri_avx512x8(res, pri, match + j, UINT8_MAX,
-				nb_trie, nb_pkt);
-
-		_mm256_storeu_si256((void *)(result + k), cr[0]);
-		_mm256_storeu_si256((void *)(result + j), cr[1]);
-	}
-
-	n = nb_pkt - k;
-	if (n != 0) {
-		if (n > MASK8_BIT) {
-			resolve_sc_avx512x8(result + k, res, pri, match + k,
-				MASK8_BIT, nb_trie, nb_pkt);
-			k += MASK8_BIT;
-			n -= MASK8_BIT;
-		}
-		resolve_sc_avx512x8(result + k, res, pri, match + k, n,
-				nb_trie, nb_pkt);
-	}
-}
-
 static inline int
 search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t total_packets, uint32_t categories)
@@ -624,7 +219,7 @@ search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
 
 		/* process the trie */
-		search_trie_avx512x8x2(&flow);
+		_F_(search_trie)(&flow);
 	}
 
 	/* resolve matches */
@@ -632,7 +227,7 @@ search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		(ctx->trans_table + ctx->match_index);
 
 	if (categories == 1)
-		resolve_sc_avx512x8x2(results, pr, match, total_packets,
+		_F_(resolve_single_cat)(results, pr, match, total_packets,
 			ctx->num_tries);
 	else
 		resolve_mcle8_avx512x1(results, pr, match, total_packets,
@@ -640,3 +235,19 @@ search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	return 0;
 }
+
+#undef _SIMD_PTR_MSK_
+#undef _SIMD_PTR_NUM_
+#undef _SIMD_FLOW_MSK_
+#undef _SIMD_FLOW_NUM_
+#undef _SIMD_MASK_MAX_
+#undef _SIMD_MASK_BIT_
+#undef _M_GI_
+#undef _M_MGI_
+#undef _M_SI_
+#undef _M_I_
+#undef _F_
+#undef _SV_
+#undef _SC_
+#undef _T_mask
+#undef _T_simd

From patchwork Tue Oct  6 15:03:15 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79796
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 6A4E8A04BB;
	Tue,  6 Oct 2020 17:13:32 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 4C26B1BC86;
	Tue,  6 Oct 2020 17:08:47 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id 6F3F81BC86
 for <dev@dpdk.org>; Tue,  6 Oct 2020 17:08:45 +0200 (CEST)
IronPort-SDR: 
 2qi5rg6mwg2TZEvCKnALyEVTKIQOtpyG1n/mv6TOlfzynCY4T6rRN+r/ga15Dh4m0VjSPHeGrq
 JMX0pttjrgWg==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919720"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919720"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:54 -0700
IronPort-SDR: 
 11gweS4OZgQKC91+TxMJSFMrnrEFuYxoTlLXl2xv+iQEmDQDw4wEg/bTTmi0GTvRWc09RkrP+o
 d0NABzXifqug==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315488"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:52 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Tue,  6 Oct 2020 16:03:15 +0100
Message-Id: <20201006150316.5776-14-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
Subject: [dpdk-dev] [PATCH v4 13/14] test/acl: add AVX512 classify support
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Add AVX512 classify to the test coverage.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test/test_acl.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 333b347579..5b32347954 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -278,8 +278,8 @@ test_classify_alg(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 
 	/* set given classify alg, skip test if alg is not supported */
 	ret = rte_acl_set_ctx_classify(acx, alg);
-	if (ret == -ENOTSUP)
-		return 0;
+	if (ret != 0)
+		return (ret == -ENOTSUP) ? 0 : ret;
 
 	/**
 	 * these will run quite a few times, it's necessary to test code paths
@@ -341,6 +341,8 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 		RTE_ACL_CLASSIFY_AVX2,
 		RTE_ACL_CLASSIFY_NEON,
 		RTE_ACL_CLASSIFY_ALTIVEC,
+		RTE_ACL_CLASSIFY_AVX512X16,
+		RTE_ACL_CLASSIFY_AVX512X32,
 	};
 
 	/* swap all bytes in the data to network order */

From patchwork Tue Oct  6 15:03:16 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
X-Patchwork-Id: 79797
X-Patchwork-Delegate: david.marchand@redhat.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id F2BB1A04BB;
	Tue,  6 Oct 2020 17:13:54 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 978861BB4F;
	Tue,  6 Oct 2020 17:08:54 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id BA3CB1BA45
 for <dev@dpdk.org>; Tue,  6 Oct 2020 17:08:52 +0200 (CEST)
IronPort-SDR: 
 z4fiwkR+JB2MUfzeGiNz5WbRiUWXZ8F2oVZKmtzyEtDIyzqP/uIUrfmp8c6/uVdyyGSDHT2WKk
 iOEYZ28tRkmw==
X-IronPort-AV: E=McAfee;i="6000,8403,9765"; a="143919739"
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="143919739"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Oct 2020 08:03:55 -0700
IronPort-SDR: 
 e0vyCU4jZDN311zIK8rzE30vhgIuRgE52UL/zcLnB+V+lK0k+EqsZphUVwlCdeaC76k+gEVsHG
 5gQdXEJ9gSKQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,343,1596524400"; d="scan'208";a="518315498"
Received: from sivswdev08.ir.intel.com ([10.237.217.47])
 by fmsmga005.fm.intel.com with ESMTP; 06 Oct 2020 08:03:54 -0700
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
To: dev@dpdk.org
Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com,
 Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Tue,  6 Oct 2020 16:03:16 +0100
Message-Id: <20201006150316.5776-15-konstantin.ananyev@intel.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20201006150316.5776-1-konstantin.ananyev@intel.com>
References: <20201005184526.7465-1-konstantin.ananyev@intel.com>
 <20201006150316.5776-1-konstantin.ananyev@intel.com>
Subject: [dpdk-dev] [PATCH v4 14/14] app/acl: add AVX512 classify support
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Add ability to use AVX512 classify method.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d9b65517cb..2a3a35a054 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -81,6 +81,14 @@ static const struct acl_alg acl_alg[] = {
 		.name = "altivec",
 		.alg = RTE_ACL_CLASSIFY_ALTIVEC,
 	},
+	{
+		.name = "avx512x16",
+		.alg = RTE_ACL_CLASSIFY_AVX512X16,
+	},
+	{
+		.name = "avx512x32",
+		.alg = RTE_ACL_CLASSIFY_AVX512X32,
+	},
 };
 
 static struct {