From patchwork Fri Aug 7 16:28:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Ananyev, Konstantin" X-Patchwork-Id: 75311 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id A817CA04B0; Fri, 7 Aug 2020 18:28:53 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id CF4B51C038; Fri, 7 Aug 2020 18:28:52 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 6D5CB1C034; Fri, 7 Aug 2020 18:28:51 +0200 (CEST) IronPort-SDR: p6R/aESfEAzTYdDsLKdC1APRITfJQ4tTE2VyKxeo+KU1+tsT2DWqQwlaQ5r7GzvsItOi9dZgHa hB4byWNcgkxQ== X-IronPort-AV: E=McAfee;i="6000,8403,9706"; a="141003385" X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="141003385" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Aug 2020 09:28:50 -0700 IronPort-SDR: ZbtT0ubR6qZdPtFiqKAY9WPlVMHrBrqHLOeZnQFYJMdIKVf+AY5MLNqC2+yipxGbAd9jjzO6Ac MJE7biMBcSNw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="323799702" Received: from sivswdev08.ir.intel.com ([10.237.217.47]) by orsmga008.jf.intel.com with ESMTP; 07 Aug 2020 09:28:48 -0700 From: Konstantin Ananyev To: dev@dpdk.org Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com, Konstantin Ananyev , stable@dpdk.org Date: Fri, 7 Aug 2020 17:28:23 +0100 Message-Id: <20200807162829.11690-2-konstantin.ananyev@intel.com> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20200807162829.11690-1-konstantin.ananyev@intel.com> References: <20200807162829.11690-1-konstantin.ananyev@intel.com> Subject: [dpdk-dev] [PATCH 20.11 1/7] acl: fix x86 build when compiler doesn't support AVX2 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Right now we define dummy version of rte_acl_classify_avx2() when both X86 and AVX2 are not detected, though it should be for non-AVX2 case only. Fixes: e53ce4e41379 ("acl: remove use of weak functions") Cc: stable@dpdk.org Signed-off-by: Konstantin Ananyev --- lib/librte_acl/rte_acl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c index 777ec4d34..715b02359 100644 --- a/lib/librte_acl/rte_acl.c +++ b/lib/librte_acl/rte_acl.c @@ -16,7 +16,6 @@ static struct rte_tailq_elem rte_acl_tailq = { }; EAL_REGISTER_TAILQ(rte_acl_tailq) -#ifndef RTE_ARCH_X86 #ifndef CC_AVX2_SUPPORT /* * If the compiler doesn't support AVX2 instructions, @@ -33,6 +32,7 @@ rte_acl_classify_avx2(__rte_unused const struct rte_acl_ctx *ctx, } #endif +#ifndef RTE_ARCH_X86 int rte_acl_classify_sse(__rte_unused const struct rte_acl_ctx *ctx, __rte_unused const uint8_t **data, From patchwork Fri Aug 7 16:28:24 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Ananyev, Konstantin" X-Patchwork-Id: 75312 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 172E1A04B0; Fri, 7 Aug 2020 18:29:04 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 547EB1C0B0; Fri, 7 Aug 2020 18:28:55 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id A88961C034 for ; Fri, 7 Aug 2020 18:28:52 +0200 (CEST) IronPort-SDR: rmGopYm1m6qa/rbh0iMmY953YqHJHrg3+uxV4szEVfeKtxVouTQWjEvS9d+Bz73+VngfC8M4H7 QCF3QT7uQZVA== X-IronPort-AV: E=McAfee;i="6000,8403,9706"; a="141003387" X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="141003387" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Aug 2020 09:28:52 -0700 IronPort-SDR: D8Kdubg4OKb9nJ288PKmWzlnZ78vUxRxFLADE1tDg+Hm4O1cQnIDFLkWA3FxAFMnwvs/A0Qtnh DhEReYCxHYig== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="323799706" Received: from sivswdev08.ir.intel.com ([10.237.217.47]) by orsmga008.jf.intel.com with ESMTP; 07 Aug 2020 09:28:51 -0700 From: Konstantin Ananyev To: dev@dpdk.org Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com, Konstantin Ananyev Date: Fri, 7 Aug 2020 17:28:24 +0100 Message-Id: <20200807162829.11690-3-konstantin.ananyev@intel.com> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20200807162829.11690-1-konstantin.ananyev@intel.com> References: <20200807162829.11690-1-konstantin.ananyev@intel.com> Subject: [dpdk-dev] [PATCH 20.11 2/7] app/acl: few small improvements X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" - enhance output to print extra stats - use rte_rdtsc_precise() for cycle measurements Signed-off-by: Konstantin Ananyev --- app/test-acl/main.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/app/test-acl/main.c b/app/test-acl/main.c index 0a5dfb621..d9b65517c 100644 --- a/app/test-acl/main.c +++ b/app/test-acl/main.c @@ -862,9 +862,10 @@ search_ip5tuples(__rte_unused void *arg) { uint64_t pkt, start, tm; uint32_t i, lcore; + long double st; lcore = rte_lcore_id(); - start = rte_rdtsc(); + start = rte_rdtsc_precise(); pkt = 0; for (i = 0; i != config.iter_num; i++) { @@ -872,12 +873,16 @@ search_ip5tuples(__rte_unused void *arg) config.trace_step, config.alg.name); } - tm = rte_rdtsc() - start; + tm = rte_rdtsc_precise() - start; + + st = (long double)tm / rte_get_timer_hz(); dump_verbose(DUMP_NONE, stdout, "%s @lcore %u: %" PRIu32 " iterations, %" PRIu64 " pkts, %" - PRIu32 " categories, %" PRIu64 " cycles, %#Lf cycles/pkt\n", - __func__, lcore, i, pkt, config.run_categories, - tm, (pkt == 0) ? 0 : (long double)tm / pkt); + PRIu32 " categories, %" PRIu64 " cycles (%.2Lf sec), " + "%.2Lf cycles/pkt, %.2Lf pkt/sec\n", + __func__, lcore, i, pkt, + config.run_categories, tm, st, + (pkt == 0) ? 0 : (long double)tm / pkt, pkt / st); return 0; } From patchwork Fri Aug 7 16:28:25 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Ananyev, Konstantin" X-Patchwork-Id: 75313 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id C2E02A04B0; Fri, 7 Aug 2020 18:29:14 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 92ECD1C0B4; Fri, 7 Aug 2020 18:28:56 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 199C02BF1 for ; Fri, 7 Aug 2020 18:28:54 +0200 (CEST) IronPort-SDR: NGtNNvHWFVZJ1XSWHdUE8a8GoY8kRD8XMxPzoeMtw/p+FYGEaZe/tS9pHnH8CR3O8RB+Zo4fAe jJZ0Xzh90ikQ== X-IronPort-AV: E=McAfee;i="6000,8403,9706"; a="141003393" X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="141003393" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Aug 2020 09:28:54 -0700 IronPort-SDR: jDVz23kYGKTlkQTWP4Z9ybjGuMcb0yNkPU68waxfIjq8sYsha3A9IrMtnw/91acH6OYSiKJsXu RgGmxhPtVaxA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="323799709" Received: from sivswdev08.ir.intel.com ([10.237.217.47]) by orsmga008.jf.intel.com with ESMTP; 07 Aug 2020 09:28:53 -0700 From: Konstantin Ananyev To: dev@dpdk.org Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com, Konstantin Ananyev Date: Fri, 7 Aug 2020 17:28:25 +0100 Message-Id: <20200807162829.11690-4-konstantin.ananyev@intel.com> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20200807162829.11690-1-konstantin.ananyev@intel.com> References: <20200807162829.11690-1-konstantin.ananyev@intel.com> Subject: [dpdk-dev] [PATCH 20.11 3/7] acl: remove of unused enum value X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Removal of unused enum value (RTE_ACL_CLASSIFY_NUM). This enum value is not used inside DPDK, while it prevents to add new classify algorithms without causing an ABI breakage. Note that this change introduce a formal ABI incompatibility with previous versions of ACL library. Signed-off-by: Konstantin Ananyev --- lib/librte_acl/rte_acl.h | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h index aa22e70c6..b814423a6 100644 --- a/lib/librte_acl/rte_acl.h +++ b/lib/librte_acl/rte_acl.h @@ -241,7 +241,6 @@ enum rte_acl_classify_alg { RTE_ACL_CLASSIFY_AVX2 = 3, /**< requires AVX2 support. */ RTE_ACL_CLASSIFY_NEON = 4, /**< requires NEON support. */ RTE_ACL_CLASSIFY_ALTIVEC = 5, /**< requires ALTIVEC support. */ - RTE_ACL_CLASSIFY_NUM /* should always be the last one. */ }; /** From patchwork Fri Aug 7 16:28:26 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Ananyev, Konstantin" X-Patchwork-Id: 75314 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 4D69BA04B0; Fri, 7 Aug 2020 18:29:25 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id B40161C0C2; Fri, 7 Aug 2020 18:28:59 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 861921C0BC for ; Fri, 7 Aug 2020 18:28:58 +0200 (CEST) IronPort-SDR: XBFACAqq8vpD1VmmC/Qv7JixxPipMNY1eYcFZi9u+20WjxJShh8VauX99UTLZ+WKycJpx8DfRF fLW30Oa7FZXw== X-IronPort-AV: E=McAfee;i="6000,8403,9706"; a="141003400" X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="141003400" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Aug 2020 09:28:58 -0700 IronPort-SDR: F+B6XTHL0Z+4S6QajgiT+avQVQEs1MsWQuYrZ5ynFxlENU/tRVTflYy7wzKbDkrRWT925LmvqH kL1y4bpI+A0Q== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="323799717" Received: from sivswdev08.ir.intel.com ([10.237.217.47]) by orsmga008.jf.intel.com with ESMTP; 07 Aug 2020 09:28:56 -0700 From: Konstantin Ananyev To: dev@dpdk.org Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com, Konstantin Ananyev Date: Fri, 7 Aug 2020 17:28:26 +0100 Message-Id: <20200807162829.11690-5-konstantin.ananyev@intel.com> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20200807162829.11690-1-konstantin.ananyev@intel.com> References: <20200807162829.11690-1-konstantin.ananyev@intel.com> Subject: [dpdk-dev] [PATCH 20.11 4/7] acl: add infrastructure to support AVX512 classify X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Add necessary changes to support new AVX512 specific ACL classify algorithm: - changes in meson.build and Makefile to check that build tools (compiler, assembler, etc.) do properly support AVX512. - dummy rte_acl_classify_avx512() for targets where AVX512 implementation couldn't be properly supported. Signed-off-by: Konstantin Ananyev --- config/x86/meson.build | 3 ++- lib/librte_acl/Makefile | 26 ++++++++++++++++++++++ lib/librte_acl/acl.h | 4 ++++ lib/librte_acl/acl_run_avx512.c | 17 ++++++++++++++ lib/librte_acl/meson.build | 39 +++++++++++++++++++++++++++++++++ lib/librte_acl/rte_acl.c | 17 ++++++++++++++ lib/librte_acl/rte_acl.h | 1 + 7 files changed, 106 insertions(+), 1 deletion(-) create mode 100644 lib/librte_acl/acl_run_avx512.c diff --git a/config/x86/meson.build b/config/x86/meson.build index 6ec020ef6..c5626e914 100644 --- a/config/x86/meson.build +++ b/config/x86/meson.build @@ -23,7 +23,8 @@ foreach f:base_flags endforeach optional_flags = ['AES', 'PCLMUL', - 'AVX', 'AVX2', 'AVX512F', + 'AVX', 'AVX2', + 'AVX512F', 'AVX512VL', 'AVX512CD', 'AVX512BW', 'RDRND', 'RDSEED'] foreach f:optional_flags if cc.get_define('__@0@__'.format(f), args: machine_args) == '1' diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile index f4332b044..8bd469c2b 100644 --- a/lib/librte_acl/Makefile +++ b/lib/librte_acl/Makefile @@ -58,6 +58,32 @@ ifeq ($(CC_AVX2_SUPPORT), 1) CFLAGS_rte_acl.o += -DCC_AVX2_SUPPORT endif +# compile AVX512 version if: +# we are building 64-bit binary AND binutils can generate proper code +ifeq ($(CONFIG_RTE_ARCH_X86_64),y) + + BINUTIL_OK=$(shell AS=as; \ + $(RTE_SDK)/buildtools/binutils-avx512-check.sh && \ + echo 1) + ifeq ($(BINUTIL_OK), 1) + + # If the compiler supports AVX512 instructions, + # then add support for AVX512 classify method. + + CC_AVX512_FLAGS=$(shell $(CC) \ + -mavx512f -mavx512vl -mavx512cd -mavx512bw \ + -dM -E - &1 | grep AVX512 | wc -l) + ifeq ($(CC_AVX512_FLAGS), 4) + CFLAGS_acl_run_avx512.o += -mavx512f + CFLAGS_acl_run_avx512.o += -mavx512vl + CFLAGS_acl_run_avx512.o += -mavx512cd + CFLAGS_acl_run_avx512.o += -mavx512bw + SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx512.c + CFLAGS_rte_acl.o += -DCC_AVX512_SUPPORT + endif + endif +endif + # install this header file SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include += rte_acl.h diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h index 39d45a0c2..2022cf253 100644 --- a/lib/librte_acl/acl.h +++ b/lib/librte_acl/acl.h @@ -201,6 +201,10 @@ int rte_acl_classify_avx2(const struct rte_acl_ctx *ctx, const uint8_t **data, uint32_t *results, uint32_t num, uint32_t categories); +int +rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data, + uint32_t *results, uint32_t num, uint32_t categories); + int rte_acl_classify_neon(const struct rte_acl_ctx *ctx, const uint8_t **data, uint32_t *results, uint32_t num, uint32_t categories); diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c new file mode 100644 index 000000000..67274989d --- /dev/null +++ b/lib/librte_acl/acl_run_avx512.c @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Intel Corporation + */ + +#include "acl_run_sse.h" + +int +rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data, + uint32_t *results, uint32_t num, uint32_t categories) +{ + if (num >= MAX_SEARCHES_SSE8) + return search_sse_8(ctx, data, results, num, categories); + if (num >= MAX_SEARCHES_SSE4) + return search_sse_4(ctx, data, results, num, categories); + + return rte_acl_classify_scalar(ctx, data, results, num, categories); +} diff --git a/lib/librte_acl/meson.build b/lib/librte_acl/meson.build index d1e2c184c..b2fd61cad 100644 --- a/lib/librte_acl/meson.build +++ b/lib/librte_acl/meson.build @@ -27,6 +27,45 @@ if dpdk_conf.has('RTE_ARCH_X86') cflags += '-DCC_AVX2_SUPPORT' endif + # compile AVX512 version if: + # we are building 64-bit binary AND binutils can generate proper code + + if dpdk_conf.has('RTE_ARCH_X86_64') and binutils_ok.returncode() == 0 + + # compile AVX512 version if either: + # a. we have AVX512 supported in minimum instruction set + # baseline + # b. it's not minimum instruction set, but supported by + # compiler + # + # in former case, just add avx512 C file to files list + # in latter case, compile c file to static lib, using correct + # compiler flags, and then have the .o file from static lib + # linked into main lib. + + if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F') and \ + dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512VL') and \ + dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512CD') and \ + dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512BW') + + sources += files('acl_run_avx512.c') + cflags += '-DCC_AVX512_SUPPORT' + + elif cc.has_multi_arguments('-mavx512f', '-mavx512vl', + '-mavx512cd', '-mavx512bw') + + avx512_tmplib = static_library('avx512_tmp', + 'acl_run_avx512.c', + dependencies: static_rte_eal, + c_args: cflags + + ['-mavx512f', '-mavx512vl', + '-mavx512cd', '-mavx512bw']) + objs += avx512_tmplib.extract_objects( + 'acl_run_avx512.c') + cflags += '-DCC_AVX512_SUPPORT' + endif + endif + elif dpdk_conf.has('RTE_ARCH_ARM') or dpdk_conf.has('RTE_ARCH_ARM64') cflags += '-flax-vector-conversions' sources += files('acl_run_neon.c') diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c index 715b02359..71b4afb08 100644 --- a/lib/librte_acl/rte_acl.c +++ b/lib/librte_acl/rte_acl.c @@ -16,6 +16,22 @@ static struct rte_tailq_elem rte_acl_tailq = { }; EAL_REGISTER_TAILQ(rte_acl_tailq) +#ifndef CC_AVX512_SUPPORT +/* + * If the compiler doesn't support AVX512 instructions, + * then the dummy one would be used instead for AVX512 classify method. + */ +int +rte_acl_classify_avx512(__rte_unused const struct rte_acl_ctx *ctx, + __rte_unused const uint8_t **data, + __rte_unused uint32_t *results, + __rte_unused uint32_t num, + __rte_unused uint32_t categories) +{ + return -ENOTSUP; +} +#endif + #ifndef CC_AVX2_SUPPORT /* * If the compiler doesn't support AVX2 instructions, @@ -77,6 +93,7 @@ static const rte_acl_classify_t classify_fns[] = { [RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2, [RTE_ACL_CLASSIFY_NEON] = rte_acl_classify_neon, [RTE_ACL_CLASSIFY_ALTIVEC] = rte_acl_classify_altivec, + [RTE_ACL_CLASSIFY_AVX512] = rte_acl_classify_avx512, }; /* by default, use always available scalar code path. */ diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h index b814423a6..6f39042fc 100644 --- a/lib/librte_acl/rte_acl.h +++ b/lib/librte_acl/rte_acl.h @@ -241,6 +241,7 @@ enum rte_acl_classify_alg { RTE_ACL_CLASSIFY_AVX2 = 3, /**< requires AVX2 support. */ RTE_ACL_CLASSIFY_NEON = 4, /**< requires NEON support. */ RTE_ACL_CLASSIFY_ALTIVEC = 5, /**< requires ALTIVEC support. */ + RTE_ACL_CLASSIFY_AVX512 = 6, /**< requires AVX512 support. */ }; /** From patchwork Fri Aug 7 16:28:27 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Ananyev, Konstantin" X-Patchwork-Id: 75315 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 13A64A04B0; Fri, 7 Aug 2020 18:29:40 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 387451C0BD; Fri, 7 Aug 2020 18:29:02 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id E88A21C0D0 for ; Fri, 7 Aug 2020 18:29:00 +0200 (CEST) IronPort-SDR: moMJHWEr7GNYEfwEliNa+r2i/I4+4gSX9gAvrT37TxZTcGXr+RMaQp+u2NF9RaTfLVAn5Dqynz 7Mdr4Iu3E9pQ== X-IronPort-AV: E=McAfee;i="6000,8403,9706"; a="141003442" X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="141003442" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Aug 2020 09:29:00 -0700 IronPort-SDR: kXAlaFEEZ79kF6Tz1imXBX9tbX70Jam9BfZHZcgiqA/idUw7myKLgIdYXNQHMhMthThG357XMi wDlQEbVa3Xjw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="323799737" Received: from sivswdev08.ir.intel.com ([10.237.217.47]) by orsmga008.jf.intel.com with ESMTP; 07 Aug 2020 09:28:59 -0700 From: Konstantin Ananyev To: dev@dpdk.org Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com, Konstantin Ananyev Date: Fri, 7 Aug 2020 17:28:27 +0100 Message-Id: <20200807162829.11690-6-konstantin.ananyev@intel.com> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20200807162829.11690-1-konstantin.ananyev@intel.com> References: <20200807162829.11690-1-konstantin.ananyev@intel.com> Subject: [dpdk-dev] [PATCH 20.11 5/7] app/acl: add AVX512 classify support X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Add ability to use AVX512 classify method. Signed-off-by: Konstantin Ananyev --- app/test-acl/main.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/app/test-acl/main.c b/app/test-acl/main.c index d9b65517c..19b714335 100644 --- a/app/test-acl/main.c +++ b/app/test-acl/main.c @@ -81,6 +81,10 @@ static const struct acl_alg acl_alg[] = { .name = "altivec", .alg = RTE_ACL_CLASSIFY_ALTIVEC, }, + { + .name = "avx512", + .alg = RTE_ACL_CLASSIFY_AVX512, + }, }; static struct { From patchwork Fri Aug 7 16:28:28 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Ananyev, Konstantin" X-Patchwork-Id: 75316 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 32B63A04B0; Fri, 7 Aug 2020 18:29:51 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 54F651C0D4; Fri, 7 Aug 2020 18:29:05 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 156EF1C043 for ; Fri, 7 Aug 2020 18:29:02 +0200 (CEST) IronPort-SDR: WoMNPNtBEokqTHpzhA0RcWAf3FHcIH3ChZaA97sDxQr71p/8bj2lEqpznfPhGARgaB4bFwv42C pxsDXvAYhpLw== X-IronPort-AV: E=McAfee;i="6000,8403,9706"; a="141003486" X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="141003486" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Aug 2020 09:29:02 -0700 IronPort-SDR: b/OupmSPajmgn65T2v9+fhpEPRCfioBQov2ZOdHHZ44q+lvpwFWoShs3QeTbcgYUqu5G9ISnsk Vo4RXH7Cflhw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="323799750" Received: from sivswdev08.ir.intel.com ([10.237.217.47]) by orsmga008.jf.intel.com with ESMTP; 07 Aug 2020 09:29:01 -0700 From: Konstantin Ananyev To: dev@dpdk.org Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com, Konstantin Ananyev Date: Fri, 7 Aug 2020 17:28:28 +0100 Message-Id: <20200807162829.11690-7-konstantin.ananyev@intel.com> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20200807162829.11690-1-konstantin.ananyev@intel.com> References: <20200807162829.11690-1-konstantin.ananyev@intel.com> Subject: [dpdk-dev] [PATCH 20.11 6/7] acl: introduce AVX512 classify implementation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Add search_avx512x8x2() which uses mostly 256-bit width registers/instructions and is able to process up to 16 flows in parallel. Signed-off-by: Konstantin Ananyev --- lib/librte_acl/acl_run_avx512.c | 120 ++++++ lib/librte_acl/acl_run_avx512x8.h | 614 ++++++++++++++++++++++++++++++ 2 files changed, 734 insertions(+) create mode 100644 lib/librte_acl/acl_run_avx512x8.h diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c index 67274989d..8ee996679 100644 --- a/lib/librte_acl/acl_run_avx512.c +++ b/lib/librte_acl/acl_run_avx512.c @@ -4,10 +4,130 @@ #include "acl_run_sse.h" +/*sizeof(uint32_t) << match_log == sizeof(struct rte_acl_match_results)*/ +static const uint32_t match_log = 5; + +struct acl_flow_avx512 { + uint32_t num_packets; /* number of packets processed */ + uint32_t total_packets; /* max number of packets to process */ + uint32_t root_index; /* current root index */ + const uint64_t *trans; /* transition table */ + const uint32_t *data_index; /* input data indexes */ + const uint8_t **idata; /* input data */ + uint32_t *matches; /* match indexes */ +}; + +static inline void +acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx, + uint32_t trie, const uint8_t *data[], uint32_t *matches, + uint32_t total_packets) +{ + flow->num_packets = 0; + flow->total_packets = total_packets; + flow->root_index = ctx->trie[trie].root_index; + flow->trans = ctx->trans_table; + flow->data_index = ctx->trie[trie].data_index; + flow->idata = data; + flow->matches = matches; +} + +/* + * Resolve matches for multiple categories (LE 8, use 128b instuctions/regs) + */ +static inline void +resolve_mcle8_avx512x1(uint32_t result[], + const struct rte_acl_match_results pr[], const uint32_t match[], + uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie) +{ + const int32_t *pri; + const uint32_t *pm, *res; + uint32_t i, j, k, mi, mn; + __mmask8 msk; + xmm_t cp, cr, np, nr; + + res = pr->results; + pri = pr->priority; + + for (k = 0; k != nb_pkt; k++, result += nb_cat) { + + mi = match[k] << match_log; + + for (j = 0; j != nb_cat; j += RTE_ACL_RESULTS_MULTIPLIER) { + + cr = _mm_loadu_si128((const xmm_t *)(res + mi + j)); + cp = _mm_loadu_si128((const xmm_t *)(pri + mi + j)); + + for (i = 1, pm = match + nb_pkt; i != nb_trie; + i++, pm += nb_pkt) { + + mn = j + (pm[k] << match_log); + + nr = _mm_loadu_si128((const xmm_t *)(res + mn)); + np = _mm_loadu_si128((const xmm_t *)(pri + mn)); + + msk = _mm_cmpgt_epi32_mask(cp, np); + cr = _mm_mask_mov_epi32(nr, msk, cr); + cp = _mm_mask_mov_epi32(np, msk, cp); + } + + _mm_storeu_si128((xmm_t *)(result + j), cr); + } + } +} + +/* + * Resolve matches for multiple categories (GT 8, use 512b instuctions/regs) + */ +static inline void +resolve_mcgt8_avx512x1(uint32_t result[], + const struct rte_acl_match_results pr[], const uint32_t match[], + uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie) +{ + const int32_t *pri; + const uint32_t *pm, *res; + uint32_t i, k, mi; + __mmask16 cm, sm; + __m512i cp, cr, np, nr; + + const uint32_t match_log = 5; + + res = pr->results; + pri = pr->priority; + + cm = (1 << nb_cat) - 1; + + for (k = 0; k != nb_pkt; k++, result += nb_cat) { + + mi = match[k] << match_log; + + cr = _mm512_maskz_loadu_epi32(cm, res + mi); + cp = _mm512_maskz_loadu_epi32(cm, pri + mi); + + for (i = 1, pm = match + nb_pkt; i != nb_trie; + i++, pm += nb_pkt) { + + mi = pm[k] << match_log; + + nr = _mm512_maskz_loadu_epi32(cm, res + mi); + np = _mm512_maskz_loadu_epi32(cm, pri + mi); + + sm = _mm512_cmpgt_epi32_mask(cp, np); + cr = _mm512_mask_mov_epi32(nr, sm, cr); + cp = _mm512_mask_mov_epi32(np, sm, cp); + } + + _mm512_mask_storeu_epi32(result, cm, cr); + } +} + +#include "acl_run_avx512x8.h" + int rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data, uint32_t *results, uint32_t num, uint32_t categories) { + if (num >= MAX_SEARCHES_AVX16) + return search_avx512x8x2(ctx, data, results, num, categories); if (num >= MAX_SEARCHES_SSE8) return search_sse_8(ctx, data, results, num, categories); if (num >= MAX_SEARCHES_SSE4) diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h new file mode 100644 index 000000000..63b1d872f --- /dev/null +++ b/lib/librte_acl/acl_run_avx512x8.h @@ -0,0 +1,614 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Intel Corporation + */ + +#define NUM_AVX512X8X2 (2 * CHAR_BIT) +#define MSK_AVX512X8X2 (NUM_AVX512X8X2 - 1) + +static const rte_ymm_t ymm_match_mask = { + .u32 = { + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + }, +}; + +static const rte_ymm_t ymm_index_mask = { + .u32 = { + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + }, +}; + +static const rte_ymm_t ymm_trlo_idle = { + .u32 = { + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + }, +}; + +static const rte_ymm_t ymm_trhi_idle = { + .u32 = { + 0, 0, 0, 0, + 0, 0, 0, 0, + }, +}; + +static const rte_ymm_t ymm_shuffle_input = { + .u32 = { + 0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c, + 0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c, + }, +}; + +static const rte_ymm_t ymm_four_32 = { + .u32 = { + 4, 4, 4, 4, + 4, 4, 4, 4, + }, +}; + +static const rte_ymm_t ymm_idx_add = { + .u32 = { + 0, 1, 2, 3, + 4, 5, 6, 7, + }, +}; + +static const rte_ymm_t ymm_range_base = { + .u32 = { + 0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c, + 0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c, + }, +}; + +/* + * Calculate the address of the next transition for + * all types of nodes. Note that only DFA nodes and range + * nodes actually transition to another node. Match + * nodes not supposed to be encountered here. + * For quad range nodes: + * Calculate number of range boundaries that are less than the + * input value. Range boundaries for each node are in signed 8 bit, + * ordered from -128 to 127. + * This is effectively a popcnt of bytes that are greater than the + * input byte. + * Single nodes are processed in the same ways as quad range nodes. + */ +static __rte_always_inline ymm_t +calc_addr8(ymm_t index_mask, ymm_t next_input, ymm_t shuffle_input, + ymm_t four_32, ymm_t range_base, ymm_t tr_lo, ymm_t tr_hi) +{ + ymm_t addr, in, node_type, r, t; + ymm_t dfa_msk, dfa_ofs, quad_ofs; + + t = _mm256_xor_si256(index_mask, index_mask); + in = _mm256_shuffle_epi8(next_input, shuffle_input); + + /* Calc node type and node addr */ + node_type = _mm256_andnot_si256(index_mask, tr_lo); + addr = _mm256_and_si256(index_mask, tr_lo); + + /* mask for DFA type(0) nodes */ + dfa_msk = _mm256_cmpeq_epi32(node_type, t); + + /* DFA calculations. */ + r = _mm256_srli_epi32(in, 30); + r = _mm256_add_epi8(r, range_base); + t = _mm256_srli_epi32(in, 24); + r = _mm256_shuffle_epi8(tr_hi, r); + + dfa_ofs = _mm256_sub_epi32(t, r); + + /* QUAD/SINGLE calculations. */ + t = _mm256_cmpgt_epi8(in, tr_hi); + t = _mm256_lzcnt_epi32(t); + t = _mm256_srli_epi32(t, 3); + quad_ofs = _mm256_sub_epi32(four_32, t); + + /* blend DFA and QUAD/SINGLE. */ + t = _mm256_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk); + + /* calculate address for next transitions. */ + addr = _mm256_add_epi32(addr, t); + return addr; +} + +/* + * Process 8 transitions in parallel. + * tr_lo contains low 32 bits for 8 transitions. + * tr_hi contains high 32 bits for 8 transitions. + * next_input contains up to 4 input bytes for 8 flows. + */ +static __rte_always_inline ymm_t +transition8(ymm_t next_input, const uint64_t *trans, ymm_t *tr_lo, ymm_t *tr_hi) +{ + const int32_t *tr; + ymm_t addr; + + tr = (const int32_t *)(uintptr_t)trans; + + /* Calculate the address (array index) for all 8 transitions. */ + addr = calc_addr8(ymm_index_mask.y, next_input, ymm_shuffle_input.y, + ymm_four_32.y, ymm_range_base.y, *tr_lo, *tr_hi); + + /* load lower 32 bits of 8 transactions at once. */ + *tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0])); + + next_input = _mm256_srli_epi32(next_input, CHAR_BIT); + + /* load high 32 bits of 8 transactions at once. */ + *tr_hi = _mm256_i32gather_epi32(tr + 1, addr, sizeof(trans[0])); + + return next_input; +} + +/* + * Execute first transition for up to 8 flows in parallel. + * next_input should contain one input byte for up to 8 flows. + * msk - mask of active flows. + * tr_lo contains low 32 bits for up to 8 transitions. + * tr_hi contains high 32 bits for up to 8 transitions. + */ +static __rte_always_inline void +first_trans8(const struct acl_flow_avx512 *flow, ymm_t next_input, + __mmask8 msk, ymm_t *tr_lo, ymm_t *tr_hi) +{ + const int32_t *tr; + ymm_t addr, root; + + tr = (const int32_t *)(uintptr_t)flow->trans; + + addr = _mm256_set1_epi32(UINT8_MAX); + root = _mm256_set1_epi32(flow->root_index); + + addr = _mm256_and_si256(next_input, addr); + addr = _mm256_add_epi32(root, addr); + + /* load lower 32 bits of 8 transactions at once. */ + *tr_lo = _mm256_mmask_i32gather_epi32(*tr_lo, msk, addr, tr, + sizeof(flow->trans[0])); + + /* load high 32 bits of 8 transactions at once. */ + *tr_hi = _mm256_mmask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1), + sizeof(flow->trans[0])); +} + +/* + * Load and return next 4 input bytes for up to 8 flows in parallel. + * pdata - 8 pointers to flow input data + * mask - mask of active flows. + * di - data indexes for these 8 flows. + */ +static inline ymm_t +get_next_4bytes_avx512x8(const struct acl_flow_avx512 *flow, __m512i pdata, + __mmask8 mask, ymm_t *di) +{ + const int32_t *div; + ymm_t one, zero; + ymm_t inp, t; + __m512i p; + + div = (const int32_t *)flow->data_index; + + one = _mm256_set1_epi32(1); + zero = _mm256_xor_si256(one, one); + + /* load data offsets for given indexes */ + t = _mm256_mmask_i32gather_epi32(zero, mask, *di, div, sizeof(div[0])); + + /* increment data indexes */ + *di = _mm256_mask_add_epi32(*di, mask, *di, one); + + p = _mm512_cvtepu32_epi64(t); + p = _mm512_add_epi64(p, pdata); + + /* load input bytes */ + inp = _mm512_mask_i64gather_epi32(zero, mask, p, NULL, sizeof(uint8_t)); + return inp; +} + +/* + * Start up to 8 new flows. + * num - number of flows to start + * msk - mask of new flows. + * pdata - pointers to flow input data + * di - data indexes for these flows. + */ +static inline void +start_flow8(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk, + __m512i *pdata, ymm_t *idx, ymm_t *di) +{ + uint32_t nm; + ymm_t ni; + __m512i nd; + + /* load input data pointers for new flows */ + nm = (1 << num) - 1; + nd = _mm512_maskz_loadu_epi64(nm, flow->idata + flow->num_packets); + + /* calculate match indexes of new flows */ + ni = _mm256_set1_epi32(flow->num_packets); + ni = _mm256_add_epi32(ni, ymm_idx_add.y); + + /* merge new and existing flows data */ + *pdata = _mm512_mask_expand_epi64(*pdata, msk, nd); + *idx = _mm256_mask_expand_epi32(*idx, msk, ni); + *di = _mm256_maskz_mov_epi32(msk ^ UINT8_MAX, *di); + + flow->num_packets += num; +} + +/* + * Update flow and result masks based on the number of unprocessed flows. + */ +static inline uint32_t +update_flow_mask8(const struct acl_flow_avx512 *flow, __mmask8 *fmsk, + __mmask8 *rmsk) +{ + uint32_t i, j, k, m, n; + + fmsk[0] ^= rmsk[0]; + m = rmsk[0]; + + k = __builtin_popcount(m); + n = flow->total_packets - flow->num_packets; + + if (n < k) { + /* reduce mask */ + for (i = k - n; i != 0; i--) { + j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m); + m ^= 1 << j; + } + } else + n = k; + + rmsk[0] = m; + fmsk[0] |= rmsk[0]; + + return n; +} + +/* + * Process found matches for up to 8 flows. + * fmsk - mask of active flows + * rmsk - maks of found matches + * pdata - pointers to flow input data + * di - data indexes for these flows + * idx - match indexed for given flows + * tr_lo contains low 32 bits for up to 8 transitions. + * tr_hi contains high 32 bits for up to 8 transitions. + */ +static inline uint32_t +match_process_avx512x8(struct acl_flow_avx512 *flow, __mmask8 *fmsk, + __mmask8 *rmsk, __m512i *pdata, ymm_t *di, ymm_t *idx, + ymm_t *tr_lo, ymm_t *tr_hi) +{ + uint32_t n; + ymm_t res; + + if (rmsk[0] == 0) + return 0; + + /* extract match indexes */ + res = _mm256_and_si256(tr_lo[0], ymm_index_mask.y); + + /* mask matched transitions to nop */ + tr_lo[0] = _mm256_mask_mov_epi32(tr_lo[0], rmsk[0], ymm_trlo_idle.y); + tr_hi[0] = _mm256_mask_mov_epi32(tr_hi[0], rmsk[0], ymm_trhi_idle.y); + + /* save found match indexes */ + _mm256_mask_i32scatter_epi32(flow->matches, rmsk[0], + idx[0], res, sizeof(flow->matches[0])); + + /* update masks and start new flows for matches */ + n = update_flow_mask8(flow, fmsk, rmsk); + start_flow8(flow, n, rmsk[0], pdata, idx, di); + + return n; +} + + +static inline void +match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, __mmask8 fm[2], + __m512i pdata[2], ymm_t di[2], ymm_t idx[2], ymm_t inp[2], + ymm_t tr_lo[2], ymm_t tr_hi[2]) +{ + uint32_t n[2]; + __mmask8 rm[2]; + + /* check for matches */ + rm[0] = _mm256_test_epi32_mask(tr_lo[0], ymm_match_mask.y); + rm[1] = _mm256_test_epi32_mask(tr_lo[1], ymm_match_mask.y); + + /* till unprocessed matches exist */ + while ((rm[0] | rm[1]) != 0) { + + /* process matches and start new flows */ + n[0] = match_process_avx512x8(flow, &fm[0], &rm[0], &pdata[0], + &di[0], &idx[0], &tr_lo[0], &tr_hi[0]); + n[1] = match_process_avx512x8(flow, &fm[1], &rm[1], &pdata[1], + &di[1], &idx[1], &tr_lo[1], &tr_hi[1]); + + /* execute first transition for new flows, if any */ + + if (n[0] != 0) { + inp[0] = get_next_4bytes_avx512x8(flow, pdata[0], rm[0], + &di[0]); + first_trans8(flow, inp[0], rm[0], &tr_lo[0], &tr_hi[0]); + + rm[0] = _mm256_test_epi32_mask(tr_lo[0], + ymm_match_mask.y); + } + + if (n[1] != 0) { + inp[1] = get_next_4bytes_avx512x8(flow, pdata[1], rm[1], + &di[1]); + first_trans8(flow, inp[1], rm[1], &tr_lo[1], &tr_hi[1]); + + rm[1] = _mm256_test_epi32_mask(tr_lo[1], + ymm_match_mask.y); + } + } +} + +/* + * Perform search for up to 16 flows in parallel. + * Use two sets of metadata, each serves 8 flows max. + * So in fact we perform search for 2x8 flows. + */ +static inline void +search_trie_avx512x8x2(struct acl_flow_avx512 *flow) +{ + __mmask8 fm[2]; + __m512i pdata[2]; + ymm_t di[2], idx[2], inp[2], tr_lo[2], tr_hi[2]; + + /* first 1B load */ + start_flow8(flow, CHAR_BIT, UINT8_MAX, &pdata[0], &idx[0], &di[0]); + start_flow8(flow, CHAR_BIT, UINT8_MAX, &pdata[1], &idx[1], &di[1]); + + inp[0] = get_next_4bytes_avx512x8(flow, pdata[0], UINT8_MAX, &di[0]); + inp[1] = get_next_4bytes_avx512x8(flow, pdata[1], UINT8_MAX, &di[1]); + + first_trans8(flow, inp[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]); + first_trans8(flow, inp[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]); + + fm[0] = UINT8_MAX; + fm[1] = UINT8_MAX; + + /* match check */ + match_check_process_avx512x8x2(flow, fm, pdata, di, idx, inp, + tr_lo, tr_hi); + + while ((fm[0] | fm[1]) != 0) { + + /* load next 4B */ + + inp[0] = get_next_4bytes_avx512x8(flow, pdata[0], fm[0], + &di[0]); + inp[1] = get_next_4bytes_avx512x8(flow, pdata[1], fm[1], + &di[1]); + + /* main 4B loop */ + + inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]); + inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]); + + inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]); + inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]); + + inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]); + inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]); + + inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]); + inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]); + + /* check for matches */ + match_check_process_avx512x8x2(flow, fm, pdata, di, idx, inp, + tr_lo, tr_hi); + } +} + +/* + * resolve match index to actual result/priority offset. + */ +static inline ymm_t +resolve_match_idx_avx512x8(ymm_t mi) +{ + RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) != + 1 << (match_log + 2)); + return _mm256_slli_epi32(mi, match_log); +} + + +/* + * Resolve multiple matches for the same flow based on priority. + */ +static inline ymm_t +resolve_pri_avx512x8(const int32_t res[], const int32_t pri[], + const uint32_t match[], __mmask8 msk, uint32_t nb_trie, + uint32_t nb_skip) +{ + uint32_t i; + const uint32_t *pm; + __mmask8 m; + ymm_t cp, cr, np, nr, mch; + + const ymm_t zero = _mm256_set1_epi32(0); + + mch = _mm256_maskz_loadu_epi32(msk, match); + mch = resolve_match_idx_avx512x8(mch); + + cr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0])); + cp = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0])); + + for (i = 1, pm = match + nb_skip; i != nb_trie; + i++, pm += nb_skip) { + + mch = _mm256_maskz_loadu_epi32(msk, pm); + mch = resolve_match_idx_avx512x8(mch); + + nr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, + sizeof(res[0])); + np = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, + sizeof(pri[0])); + + m = _mm256_cmpgt_epi32_mask(cp, np); + cr = _mm256_mask_mov_epi32(nr, m, cr); + cp = _mm256_mask_mov_epi32(np, m, cp); + } + + return cr; +} + +/* + * Resolve num (<= 8) matches for single category + */ +static inline void +resolve_sc_avx512x8(uint32_t result[], const int32_t res[], const int32_t pri[], + const uint32_t match[], uint32_t nb_pkt, uint32_t nb_trie, + uint32_t nb_skip) +{ + __mmask8 msk; + ymm_t cr; + + msk = (1 << nb_pkt) - 1; + cr = resolve_pri_avx512x8(res, pri, match, msk, nb_trie, nb_skip); + _mm256_mask_storeu_epi32(result, msk, cr); +} + +/* + * Resolve matches for single category + */ +static inline void +resolve_sc_avx512x8x2(uint32_t result[], + const struct rte_acl_match_results pr[], const uint32_t match[], + uint32_t nb_pkt, uint32_t nb_trie) +{ + uint32_t i, j, k, n; + const uint32_t *pm; + const int32_t *res, *pri; + __mmask8 m[2]; + ymm_t cp[2], cr[2], np[2], nr[2], mch[2]; + + res = (const int32_t *)pr->results; + pri = pr->priority; + + for (k = 0; k != (nb_pkt & ~MSK_AVX512X8X2); k += NUM_AVX512X8X2) { + + j = k + CHAR_BIT; + + /* load match indexes for first trie */ + mch[0] = _mm256_loadu_si256((const ymm_t *)(match + k)); + mch[1] = _mm256_loadu_si256((const ymm_t *)(match + j)); + + mch[0] = resolve_match_idx_avx512x8(mch[0]); + mch[1] = resolve_match_idx_avx512x8(mch[1]); + + /* load matches and their priorities for first trie */ + + cr[0] = _mm256_i32gather_epi32(res, mch[0], sizeof(res[0])); + cr[1] = _mm256_i32gather_epi32(res, mch[1], sizeof(res[0])); + + cp[0] = _mm256_i32gather_epi32(pri, mch[0], sizeof(pri[0])); + cp[1] = _mm256_i32gather_epi32(pri, mch[1], sizeof(pri[0])); + + /* select match with highest priority */ + for (i = 1, pm = match + nb_pkt; i != nb_trie; + i++, pm += nb_pkt) { + + mch[0] = _mm256_loadu_si256((const ymm_t *)(pm + k)); + mch[1] = _mm256_loadu_si256((const ymm_t *)(pm + j)); + + mch[0] = resolve_match_idx_avx512x8(mch[0]); + mch[1] = resolve_match_idx_avx512x8(mch[1]); + + nr[0] = _mm256_i32gather_epi32(res, mch[0], + sizeof(res[0])); + nr[1] = _mm256_i32gather_epi32(res, mch[1], + sizeof(res[0])); + + np[0] = _mm256_i32gather_epi32(pri, mch[0], + sizeof(pri[0])); + np[1] = _mm256_i32gather_epi32(pri, mch[1], + sizeof(pri[0])); + + m[0] = _mm256_cmpgt_epi32_mask(cp[0], np[0]); + m[1] = _mm256_cmpgt_epi32_mask(cp[1], np[1]); + + cr[0] = _mm256_mask_mov_epi32(nr[0], m[0], cr[0]); + cr[1] = _mm256_mask_mov_epi32(nr[1], m[1], cr[1]); + + cp[0] = _mm256_mask_mov_epi32(np[0], m[0], cp[0]); + cp[1] = _mm256_mask_mov_epi32(np[1], m[1], cp[1]); + } + + _mm256_storeu_si256((ymm_t *)(result + k), cr[0]); + _mm256_storeu_si256((ymm_t *)(result + j), cr[1]); + } + + n = nb_pkt - k; + if (n != 0) { + if (n > CHAR_BIT) { + resolve_sc_avx512x8(result + k, res, pri, match + k, + CHAR_BIT, nb_trie, nb_pkt); + k += CHAR_BIT; + n -= CHAR_BIT; + } + resolve_sc_avx512x8(result + k, res, pri, match + k, n, + nb_trie, nb_pkt); + } +} + +static inline int +search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data, + uint32_t *results, uint32_t total_packets, uint32_t categories) +{ + uint32_t i, *pm; + const struct rte_acl_match_results *pr; + struct acl_flow_avx512 flow; + uint32_t match[ctx->num_tries * total_packets]; + + for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) { + + /* setup for next trie */ + acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets); + + /* process the trie */ + search_trie_avx512x8x2(&flow); + } + + /* resolve matches */ + pr = (const struct rte_acl_match_results *) + (ctx->trans_table + ctx->match_index); + + if (categories == 1) + resolve_sc_avx512x8x2(results, pr, match, total_packets, + ctx->num_tries); + else if (categories <= RTE_ACL_MAX_CATEGORIES / 2) + resolve_mcle8_avx512x1(results, pr, match, total_packets, + categories, ctx->num_tries); + else + resolve_mcgt8_avx512x1(results, pr, match, total_packets, + categories, ctx->num_tries); + + return 0; +} From patchwork Fri Aug 7 16:28:29 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Ananyev, Konstantin" X-Patchwork-Id: 75317 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 6EF6DA04B0; Fri, 7 Aug 2020 18:30:05 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 8C2241C10A; Fri, 7 Aug 2020 18:29:07 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 658BD1C0D5 for ; Fri, 7 Aug 2020 18:29:05 +0200 (CEST) IronPort-SDR: FOE7kj67EQepSVXRTXas2AfFGOT8yyGIVxmkUvCRlUlhdbmLwVF4oh/XxLhHe3DzLUhIAqoHGD 5nWUTRxLVXNg== X-IronPort-AV: E=McAfee;i="6000,8403,9706"; a="141003530" X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="141003530" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Aug 2020 09:29:05 -0700 IronPort-SDR: Bo20pzAUgiI5M56Eww2HPJF9mJfTOprVVkKZ8N352Sa1zU0jq8bwO9uvMfTI4n4otDMik0l7dG f6cMs8U1bxfw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,446,1589266800"; d="scan'208";a="323799756" Received: from sivswdev08.ir.intel.com ([10.237.217.47]) by orsmga008.jf.intel.com with ESMTP; 07 Aug 2020 09:29:03 -0700 From: Konstantin Ananyev To: dev@dpdk.org Cc: jerinj@marvell.com, ruifeng.wang@arm.com, vladimir.medvedkin@intel.com, Konstantin Ananyev Date: Fri, 7 Aug 2020 17:28:29 +0100 Message-Id: <20200807162829.11690-8-konstantin.ananyev@intel.com> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20200807162829.11690-1-konstantin.ananyev@intel.com> References: <20200807162829.11690-1-konstantin.ananyev@intel.com> Subject: [dpdk-dev] [PATCH 20.11 7/7] acl: enhance AVX512 classify implementation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Add search_avx512x16x2() which uses mostly 512-bit width registers/instructions and is able to process up to 32 flows in parallel. Signed-off-by: Konstantin Ananyev --- These patch depends on: https://patches.dpdk.org/patch/70429/ to be applied first. lib/librte_acl/acl_run_avx512.c | 3 + lib/librte_acl/acl_run_avx512x16.h | 635 +++++++++++++++++++++++++++++ 2 files changed, 638 insertions(+) create mode 100644 lib/librte_acl/acl_run_avx512x16.h diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c index 8ee996679..332e359fb 100644 --- a/lib/librte_acl/acl_run_avx512.c +++ b/lib/librte_acl/acl_run_avx512.c @@ -121,11 +121,14 @@ resolve_mcgt8_avx512x1(uint32_t result[], } #include "acl_run_avx512x8.h" +#include "acl_run_avx512x16.h" int rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data, uint32_t *results, uint32_t num, uint32_t categories) { + if (num >= 2 * MAX_SEARCHES_AVX16) + return search_avx512x16x2(ctx, data, results, num, categories); if (num >= MAX_SEARCHES_AVX16) return search_avx512x8x2(ctx, data, results, num, categories); if (num >= MAX_SEARCHES_SSE8) diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h new file mode 100644 index 000000000..53216bda3 --- /dev/null +++ b/lib/librte_acl/acl_run_avx512x16.h @@ -0,0 +1,635 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Intel Corporation + */ + +#define MASK16_BIT (sizeof(__mmask16) * CHAR_BIT) + +#define NUM_AVX512X16X2 (2 * MASK16_BIT) +#define MSK_AVX512X16X2 (NUM_AVX512X16X2 - 1) + +static const __rte_x86_zmm_t zmm_match_mask = { + .u32 = { + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + RTE_ACL_NODE_MATCH, + }, +}; + +static const __rte_x86_zmm_t zmm_index_mask = { + .u32 = { + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + RTE_ACL_NODE_INDEX, + }, +}; + +static const __rte_x86_zmm_t zmm_trlo_idle = { + .u32 = { + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE, + }, +}; + +static const __rte_x86_zmm_t zmm_trhi_idle = { + .u32 = { + 0, 0, 0, 0, + 0, 0, 0, 0, + 0, 0, 0, 0, + 0, 0, 0, 0, + }, +}; + +static const __rte_x86_zmm_t zmm_shuffle_input = { + .u32 = { + 0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c, + 0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c, + 0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c, + 0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c, + }, +}; + +static const __rte_x86_zmm_t zmm_four_32 = { + .u32 = { + 4, 4, 4, 4, + 4, 4, 4, 4, + 4, 4, 4, 4, + 4, 4, 4, 4, + }, +}; + +static const __rte_x86_zmm_t zmm_idx_add = { + .u32 = { + 0, 1, 2, 3, + 4, 5, 6, 7, + 8, 9, 10, 11, + 12, 13, 14, 15, + }, +}; + +static const __rte_x86_zmm_t zmm_range_base = { + .u32 = { + 0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c, + 0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c, + 0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c, + 0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c, + }, +}; + +/* + * Calculate the address of the next transition for + * all types of nodes. Note that only DFA nodes and range + * nodes actually transition to another node. Match + * nodes not supposed to be encountered here. + * For quad range nodes: + * Calculate number of range boundaries that are less than the + * input value. Range boundaries for each node are in signed 8 bit, + * ordered from -128 to 127. + * This is effectively a popcnt of bytes that are greater than the + * input byte. + * Single nodes are processed in the same ways as quad range nodes. + */ +static __rte_always_inline __m512i +calc_addr16(__m512i index_mask, __m512i next_input, __m512i shuffle_input, + __m512i four_32, __m512i range_base, __m512i tr_lo, __m512i tr_hi) +{ + __mmask64 qm; + __mmask16 dfa_msk; + __m512i addr, in, node_type, r, t; + __m512i dfa_ofs, quad_ofs; + + t = _mm512_xor_si512(index_mask, index_mask); + in = _mm512_shuffle_epi8(next_input, shuffle_input); + + /* Calc node type and node addr */ + node_type = _mm512_andnot_si512(index_mask, tr_lo); + addr = _mm512_and_si512(index_mask, tr_lo); + + /* mask for DFA type(0) nodes */ + dfa_msk = _mm512_cmpeq_epi32_mask(node_type, t); + + /* DFA calculations. */ + r = _mm512_srli_epi32(in, 30); + r = _mm512_add_epi8(r, range_base); + t = _mm512_srli_epi32(in, 24); + r = _mm512_shuffle_epi8(tr_hi, r); + + dfa_ofs = _mm512_sub_epi32(t, r); + + /* QUAD/SINGLE calculations. */ + qm = _mm512_cmpgt_epi8_mask(in, tr_hi); + t = _mm512_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX); + t = _mm512_lzcnt_epi32(t); + t = _mm512_srli_epi32(t, 3); + quad_ofs = _mm512_sub_epi32(four_32, t); + + /* blend DFA and QUAD/SINGLE. */ + t = _mm512_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs); + + /* calculate address for next transitions. */ + addr = _mm512_add_epi32(addr, t); + return addr; +} + +/* + * Process 8 transitions in parallel. + * tr_lo contains low 32 bits for 8 transition. + * tr_hi contains high 32 bits for 8 transition. + * next_input contains up to 4 input bytes for 8 flows. + */ +static __rte_always_inline __m512i +transition16(__m512i next_input, const uint64_t *trans, __m512i *tr_lo, + __m512i *tr_hi) +{ + const int32_t *tr; + __m512i addr; + + tr = (const int32_t *)(uintptr_t)trans; + + /* Calculate the address (array index) for all 8 transitions. */ + addr = calc_addr16(zmm_index_mask.z, next_input, zmm_shuffle_input.z, + zmm_four_32.z, zmm_range_base.z, *tr_lo, *tr_hi); + + /* load lower 32 bits of 8 transactions at once. */ + *tr_lo = _mm512_i32gather_epi32(addr, tr, sizeof(trans[0])); + + next_input = _mm512_srli_epi32(next_input, CHAR_BIT); + + /* load high 32 bits of 8 transactions at once. */ + *tr_hi = _mm512_i32gather_epi32(addr, (tr + 1), sizeof(trans[0])); + + return next_input; +} + +static __rte_always_inline void +first_trans16(const struct acl_flow_avx512 *flow, __m512i next_input, + __mmask16 msk, __m512i *tr_lo, __m512i *tr_hi) +{ + const int32_t *tr; + __m512i addr, root; + + tr = (const int32_t *)(uintptr_t)flow->trans; + + addr = _mm512_set1_epi32(UINT8_MAX); + root = _mm512_set1_epi32(flow->root_index); + + addr = _mm512_and_si512(next_input, addr); + addr = _mm512_add_epi32(root, addr); + + /* load lower 32 bits of 8 transactions at once. */ + *tr_lo = _mm512_mask_i32gather_epi32(*tr_lo, msk, addr, tr, + sizeof(flow->trans[0])); + + /* load high 32 bits of 8 transactions at once. */ + *tr_hi = _mm512_mask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1), + sizeof(flow->trans[0])); +} + +static inline __m512i +get_next_4bytes_avx512x16(const struct acl_flow_avx512 *flow, __m512i pdata[2], + uint32_t msk, __m512i *di) +{ + const int32_t *div; + __m512i one, zero, t, p[2]; + ymm_t inp[2]; + + static const __rte_x86_zmm_t zmm_pminp = { + .u32 = { + 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, + 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, + }, + }; + + const __mmask16 pmidx_msk = 0x5555; + + static const __rte_x86_zmm_t zmm_pmidx[2] = { + [0] = { + .u32 = { + 0, 0, 1, 0, 2, 0, 3, 0, + 4, 0, 5, 0, 6, 0, 7, 0, + }, + }, + [1] = { + .u32 = { + 8, 0, 9, 0, 10, 0, 11, 0, + 12, 0, 13, 0, 14, 0, 15, 0, + }, + }, + }; + + div = (const int32_t *)flow->data_index; + + one = _mm512_set1_epi32(1); + zero = _mm512_xor_si512(one, one); + + t = _mm512_mask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0])); + + *di = _mm512_mask_add_epi32(*di, msk, *di, one); + + p[0] = _mm512_maskz_permutexvar_epi32(pmidx_msk, zmm_pmidx[0].z, t); + p[1] = _mm512_maskz_permutexvar_epi32(pmidx_msk, zmm_pmidx[1].z, t); + + p[0] = _mm512_add_epi64(p[0], pdata[0]); + p[1] = _mm512_add_epi64(p[1], pdata[1]); + + inp[0] = _mm512_mask_i64gather_epi32(_mm512_castsi512_si256(zero), + (msk & UINT8_MAX), p[0], NULL, sizeof(uint8_t)); + inp[1] = _mm512_mask_i64gather_epi32(_mm512_castsi512_si256(zero), + (msk >> CHAR_BIT), p[1], NULL, sizeof(uint8_t)); + + return _mm512_permutex2var_epi32(_mm512_castsi256_si512(inp[0]), + zmm_pminp.z, _mm512_castsi256_si512(inp[1])); +} + +static inline void +start_flow16(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk, + __m512i pdata[2], __m512i *idx, __m512i *di) +{ + uint32_t n, nm[2]; + __m512i ni, nd[2]; + + n = __builtin_popcount(msk & UINT8_MAX); + nm[0] = (1 << n) - 1; + nm[1] = (1 << (num - n)) - 1; + + nd[0] = _mm512_maskz_loadu_epi64(nm[0], + flow->idata + flow->num_packets); + nd[1] = _mm512_maskz_loadu_epi64(nm[1], + flow->idata + flow->num_packets + n); + + ni = _mm512_set1_epi32(flow->num_packets); + ni = _mm512_add_epi32(ni, zmm_idx_add.z); + + pdata[0] = _mm512_mask_expand_epi64(pdata[0], (msk & UINT8_MAX), nd[0]); + pdata[1] = _mm512_mask_expand_epi64(pdata[1], (msk >> CHAR_BIT), nd[1]); + + *idx = _mm512_mask_expand_epi32(*idx, msk, ni); + *di = _mm512_maskz_mov_epi32(msk ^ UINT16_MAX, *di); + + flow->num_packets += num; +} + +static inline uint32_t +update_flow_mask16(const struct acl_flow_avx512 *flow, __mmask16 *fmsk, + __mmask16 *rmsk) +{ + uint32_t i, j, k, m, n; + + fmsk[0] ^= rmsk[0]; + m = rmsk[0]; + + k = __builtin_popcount(m); + n = flow->total_packets - flow->num_packets; + + if (n < k) { + /* reduce mask */ + for (i = k - n; i != 0; i--) { + j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m); + m ^= 1 << j; + } + } else + n = k; + + rmsk[0] = m; + fmsk[0] |= rmsk[0]; + + return n; +} + +static inline uint32_t +match_process_avx512x16(struct acl_flow_avx512 *flow, __mmask16 *fmsk, + __mmask16 *rmsk, __m512i pdata[2], __m512i *di, __m512i *idx, + __m512i *tr_lo, __m512i *tr_hi) +{ + uint32_t n; + __m512i res; + + if (rmsk[0] == 0) + return 0; + + /* extract match indexes */ + res = _mm512_and_si512(tr_lo[0], zmm_index_mask.z); + + /* mask matched transitions to nop */ + tr_lo[0] = _mm512_mask_mov_epi32(tr_lo[0], rmsk[0], zmm_trlo_idle.z); + tr_hi[0] = _mm512_mask_mov_epi32(tr_hi[0], rmsk[0], zmm_trhi_idle.z); + + /* save found match indexes */ + _mm512_mask_i32scatter_epi32(flow->matches, rmsk[0], + idx[0], res, sizeof(flow->matches[0])); + + /* update masks and start new flows for matches */ + n = update_flow_mask16(flow, fmsk, rmsk); + start_flow16(flow, n, rmsk[0], pdata, idx, di); + + return n; +} + +static inline void +match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, __mmask16 fm[2], + __m512i pdata[4], __m512i di[2], __m512i idx[2], __m512i inp[2], + __m512i tr_lo[2], __m512i tr_hi[2]) +{ + uint32_t n[2]; + __mmask16 rm[2]; + + /* check for matches */ + rm[0] = _mm512_test_epi32_mask(tr_lo[0], zmm_match_mask.z); + rm[1] = _mm512_test_epi32_mask(tr_lo[1], zmm_match_mask.z); + + while ((rm[0] | rm[1]) != 0) { + + n[0] = match_process_avx512x16(flow, &fm[0], &rm[0], &pdata[0], + &di[0], &idx[0], &tr_lo[0], &tr_hi[0]); + n[1] = match_process_avx512x16(flow, &fm[1], &rm[1], &pdata[2], + &di[1], &idx[1], &tr_lo[1], &tr_hi[1]); + + if (n[0] != 0) { + inp[0] = get_next_4bytes_avx512x16(flow, &pdata[0], + rm[0], &di[0]); + first_trans16(flow, inp[0], rm[0], &tr_lo[0], + &tr_hi[0]); + rm[0] = _mm512_test_epi32_mask(tr_lo[0], + zmm_match_mask.z); + } + + if (n[1] != 0) { + inp[1] = get_next_4bytes_avx512x16(flow, &pdata[2], + rm[1], &di[1]); + first_trans16(flow, inp[1], rm[1], &tr_lo[1], + &tr_hi[1]); + rm[1] = _mm512_test_epi32_mask(tr_lo[1], + zmm_match_mask.z); + } + } +} + +static inline void +search_trie_avx512x16x2(struct acl_flow_avx512 *flow) +{ + __mmask16 fm[2]; + __m512i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2]; + + /* first 1B load */ + start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[0], &idx[0], &di[0]); + start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]); + + in[0] = get_next_4bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0]); + in[1] = get_next_4bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1]); + + first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]); + first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]); + + fm[0] = UINT16_MAX; + fm[1] = UINT16_MAX; + + /* match check */ + match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in, + tr_lo, tr_hi); + + while ((fm[0] | fm[1]) != 0) { + + /* load next 4B */ + + in[0] = get_next_4bytes_avx512x16(flow, &pdata[0], fm[0], + &di[0]); + in[1] = get_next_4bytes_avx512x16(flow, &pdata[2], fm[1], + &di[1]); + + /* main 4B loop */ + + in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]); + in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]); + + in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]); + in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]); + + in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]); + in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]); + + in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]); + in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]); + + /* check for matches */ + match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in, + tr_lo, tr_hi); + } +} + +static inline __m512i +resolve_match_idx_avx512x16(__m512i mi) +{ + RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) != + 1 << (match_log + 2)); + return _mm512_slli_epi32(mi, match_log); +} + +static inline __m512i +resolve_pri_avx512x16(const int32_t res[], const int32_t pri[], + const uint32_t match[], __mmask16 msk, uint32_t nb_trie, + uint32_t nb_skip) +{ + uint32_t i; + const uint32_t *pm; + __mmask16 m; + __m512i cp, cr, np, nr, mch; + + const __m512i zero = _mm512_set1_epi32(0); + + mch = _mm512_maskz_loadu_epi32(msk, match); + mch = resolve_match_idx_avx512x16(mch); + + cr = _mm512_mask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0])); + cp = _mm512_mask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0])); + + for (i = 1, pm = match + nb_skip; i != nb_trie; + i++, pm += nb_skip) { + + mch = _mm512_maskz_loadu_epi32(msk, pm); + mch = resolve_match_idx_avx512x16(mch); + + nr = _mm512_mask_i32gather_epi32(zero, msk, mch, res, + sizeof(res[0])); + np = _mm512_mask_i32gather_epi32(zero, msk, mch, pri, + sizeof(pri[0])); + + m = _mm512_cmpgt_epi32_mask(cp, np); + cr = _mm512_mask_mov_epi32(nr, m, cr); + cp = _mm512_mask_mov_epi32(np, m, cp); + } + + return cr; +} + +/* + * Resolve num (<= 16) matches for single category + */ +static inline void +resolve_sc_avx512x16(uint32_t result[], const int32_t res[], + const int32_t pri[], const uint32_t match[], uint32_t nb_pkt, + uint32_t nb_trie, uint32_t nb_skip) +{ + __mmask16 msk; + __m512i cr; + + msk = (1 << nb_pkt) - 1; + cr = resolve_pri_avx512x16(res, pri, match, msk, nb_trie, nb_skip); + _mm512_mask_storeu_epi32(result, msk, cr); +} + +/* + * Resolve matches for single category + */ +static inline void +resolve_sc_avx512x16x2(uint32_t result[], + const struct rte_acl_match_results pr[], const uint32_t match[], + uint32_t nb_pkt, uint32_t nb_trie) +{ + uint32_t i, j, k, n; + const uint32_t *pm; + const int32_t *res, *pri; + __mmask16 m[2]; + __m512i cp[2], cr[2], np[2], nr[2], mch[2]; + + res = (const int32_t *)pr->results; + pri = pr->priority; + + for (k = 0; k != (nb_pkt & ~MSK_AVX512X16X2); k += NUM_AVX512X16X2) { + + j = k + MASK16_BIT; + + /* load match indexes for first trie */ + mch[0] = _mm512_loadu_si512(match + k); + mch[1] = _mm512_loadu_si512(match + j); + + mch[0] = resolve_match_idx_avx512x16(mch[0]); + mch[1] = resolve_match_idx_avx512x16(mch[1]); + + /* load matches and their priorities for first trie */ + + cr[0] = _mm512_i32gather_epi32(mch[0], res, sizeof(res[0])); + cr[1] = _mm512_i32gather_epi32(mch[1], res, sizeof(res[0])); + + cp[0] = _mm512_i32gather_epi32(mch[0], pri, sizeof(pri[0])); + cp[1] = _mm512_i32gather_epi32(mch[1], pri, sizeof(pri[0])); + + /* select match with highest priority */ + for (i = 1, pm = match + nb_pkt; i != nb_trie; + i++, pm += nb_pkt) { + + mch[0] = _mm512_loadu_si512(pm + k); + mch[1] = _mm512_loadu_si512(pm + j); + + mch[0] = resolve_match_idx_avx512x16(mch[0]); + mch[1] = resolve_match_idx_avx512x16(mch[1]); + + nr[0] = _mm512_i32gather_epi32(mch[0], res, + sizeof(res[0])); + nr[1] = _mm512_i32gather_epi32(mch[1], res, + sizeof(res[0])); + + np[0] = _mm512_i32gather_epi32(mch[0], pri, + sizeof(pri[0])); + np[1] = _mm512_i32gather_epi32(mch[1], pri, + sizeof(pri[0])); + + m[0] = _mm512_cmpgt_epi32_mask(cp[0], np[0]); + m[1] = _mm512_cmpgt_epi32_mask(cp[1], np[1]); + + cr[0] = _mm512_mask_mov_epi32(nr[0], m[0], cr[0]); + cr[1] = _mm512_mask_mov_epi32(nr[1], m[1], cr[1]); + + cp[0] = _mm512_mask_mov_epi32(np[0], m[0], cp[0]); + cp[1] = _mm512_mask_mov_epi32(np[1], m[1], cp[1]); + } + + _mm512_storeu_si512(result + k, cr[0]); + _mm512_storeu_si512(result + j, cr[1]); + } + + n = nb_pkt - k; + if (n != 0) { + if (n > MASK16_BIT) { + resolve_sc_avx512x16(result + k, res, pri, match + k, + MASK16_BIT, nb_trie, nb_pkt); + k += MASK16_BIT; + n -= MASK16_BIT; + } + resolve_sc_avx512x16(result + k, res, pri, match + k, n, + nb_trie, nb_pkt); + } +} + +static inline int +search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data, + uint32_t *results, uint32_t total_packets, uint32_t categories) +{ + uint32_t i, *pm; + const struct rte_acl_match_results *pr; + struct acl_flow_avx512 flow; + uint32_t match[ctx->num_tries * total_packets]; + + for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) { + + /* setup for next trie */ + acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets); + + /* process the trie */ + search_trie_avx512x16x2(&flow); + } + + /* resolve matches */ + pr = (const struct rte_acl_match_results *) + (ctx->trans_table + ctx->match_index); + + if (categories == 1) + resolve_sc_avx512x16x2(results, pr, match, total_packets, + ctx->num_tries); + else if (categories <= RTE_ACL_MAX_CATEGORIES / 2) + resolve_mcle8_avx512x1(results, pr, match, total_packets, + categories, ctx->num_tries); + else + resolve_mcgt8_avx512x1(results, pr, match, total_packets, + categories, ctx->num_tries); + + return 0; +}