From patchwork Mon May 24 18:59:49 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lance Richardson X-Patchwork-Id: 93426 X-Patchwork-Delegate: ajit.khaparde@broadcom.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 45BC2A0547; Mon, 24 May 2021 21:00:15 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 303CC41101; Mon, 24 May 2021 21:00:15 +0200 (CEST) Received: from mail-pg1-f181.google.com (mail-pg1-f181.google.com [209.85.215.181]) by mails.dpdk.org (Postfix) with ESMTP id 5EAE74068C for ; Mon, 24 May 2021 21:00:13 +0200 (CEST) Received: by mail-pg1-f181.google.com with SMTP id e22so5051534pgv.10 for ; Mon, 24 May 2021 12:00:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=broadcom.com; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version; bh=fCXadFeSozZLIQ8HWMU7r5PacKQQp5xSXSAa8JdjTI4=; b=gjdebPegkbjcE1z5b6Qtfy2pvxlMaK8f+SioyWJiu546hz/qKVQz0aAr4bibr1ctdV LeqVUmCQgdS3Lj3brHPqOJ1RWQXHRJcwjwvv6oZZ+QVNp6CxfEOPmw7tuhf2QTPKvPfo p+u7C0COChpx/p+iReGxcD/Yn3niEMRH8vYGw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version; bh=fCXadFeSozZLIQ8HWMU7r5PacKQQp5xSXSAa8JdjTI4=; b=GleaMSm7URDnUdXgRxwlmUHzuybquyeof29Nbkznrb3B9G7JXnw4yjhOmNgbTAka0H tiUjtybGNWhH8HT/IRD/lUMxMVabvzRCmVJWkpuVWGjaTUG4uOs+8gzMzc9R4Hw4bhBl ATAbC0dXYzn4uSOvUUvCvAU2gczVUXkS15ZLsNxUzOkRNQejrSncksIS1Vbqts6+Dscx ffpPC3wTsbnvsmDUQiwWnUv4N9AsRRLNhSjUnJaIilQ7hhxL764MMb16BMKIqNp2gI/A q4rIF+ceR4tZNEcAvPpC2TDy36uZjxGvTz7ZQHFO3ixWjpXwularywE440rh9bD1YIO8 prOA== X-Gm-Message-State: AOAM530zchsUMfRwGNrZesans+3JlJLVnGd+MuD1vXcpRek8xp/KHdJz pyjL7k8pI43kQ5C+VMV0ruUbMA== X-Google-Smtp-Source: ABdhPJwjkGanFAqMAs5cO1bLZcOfO7DykPSX3ox/XEVPny1aih6FZsxrywPHt5eGkT8XPHwIbAfFsQ== X-Received: by 2002:a65:68d7:: with SMTP id k23mr15139647pgt.410.1621882812317; Mon, 24 May 2021 12:00:12 -0700 (PDT) Received: from localhost.localdomain ([192.19.231.250]) by smtp.gmail.com with ESMTPSA id g4sm12239049pgu.46.2021.05.24.12.00.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 May 2021 12:00:11 -0700 (PDT) From: Lance Richardson To: Ajit Khaparde , Somnath Kotur , Jerin Jacob , Ruifeng Wang , Bruce Richardson , Konstantin Ananyev Cc: dev@dpdk.org Date: Mon, 24 May 2021 14:59:49 -0400 Message-Id: <20210524185952.673258-2-lance.richardson@broadcom.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210524185952.673258-1-lance.richardson@broadcom.com> References: <20210524185952.673258-1-lance.richardson@broadcom.com> MIME-Version: 1.0 X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: [dpdk-dev] [PATCH 1/3] net/bnxt: refactor HW ptype mapping table X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Make the definition of the table used to map hardware packet type information to DPDK packet type more generic. Add macro definitions for constants used in creating table indices, use these to eliminate raw constants in code. Add build-time assertions to validate ptype mapping constants. Signed-off-by: Lance Richardson Reviewed-by: Ajit Kumar Khaparde --- drivers/net/bnxt/bnxt_rxr.c | 34 +++++++++++---------- drivers/net/bnxt/bnxt_rxr.h | 43 ++++++++++++++++++++++++++- drivers/net/bnxt/bnxt_rxtx_vec_neon.c | 19 ++++++++---- drivers/net/bnxt/bnxt_rxtx_vec_sse.c | 18 +++++++---- 4 files changed, 85 insertions(+), 29 deletions(-) diff --git a/drivers/net/bnxt/bnxt_rxr.c b/drivers/net/bnxt/bnxt_rxr.c index 2ef4115ef9..a6a8fb213b 100644 --- a/drivers/net/bnxt/bnxt_rxr.c +++ b/drivers/net/bnxt/bnxt_rxr.c @@ -396,14 +396,14 @@ bnxt_init_ptype_table(void) return; for (i = 0; i < BNXT_PTYPE_TBL_DIM; i++) { - if (i & (RX_PKT_CMPL_FLAGS2_META_FORMAT_VLAN >> 2)) + if (i & BNXT_PTYPE_TBL_VLAN_MSK) pt[i] = RTE_PTYPE_L2_ETHER_VLAN; else pt[i] = RTE_PTYPE_L2_ETHER; - ip6 = i & (RX_PKT_CMPL_FLAGS2_IP_TYPE >> 7); - tun = i & (RX_PKT_CMPL_FLAGS2_T_IP_CS_CALC >> 2); - type = (i & 0x78) << 9; + ip6 = !!(i & BNXT_PTYPE_TBL_IP_VER_MSK); + tun = !!(i & BNXT_PTYPE_TBL_TUN_MSK); + type = (i & BNXT_PTYPE_TBL_TYPE_MSK) >> BNXT_PTYPE_TBL_TYPE_SFT; if (!tun && !ip6) l3 = RTE_PTYPE_L3_IPV4_EXT_UNKNOWN; @@ -415,25 +415,25 @@ bnxt_init_ptype_table(void) l3 = RTE_PTYPE_INNER_L3_IPV6_EXT_UNKNOWN; switch (type) { - case RX_PKT_CMPL_FLAGS_ITYPE_ICMP: + case BNXT_PTYPE_TBL_TYPE_ICMP: if (tun) pt[i] |= l3 | RTE_PTYPE_INNER_L4_ICMP; else pt[i] |= l3 | RTE_PTYPE_L4_ICMP; break; - case RX_PKT_CMPL_FLAGS_ITYPE_TCP: + case BNXT_PTYPE_TBL_TYPE_TCP: if (tun) pt[i] |= l3 | RTE_PTYPE_INNER_L4_TCP; else pt[i] |= l3 | RTE_PTYPE_L4_TCP; break; - case RX_PKT_CMPL_FLAGS_ITYPE_UDP: + case BNXT_PTYPE_TBL_TYPE_UDP: if (tun) pt[i] |= l3 | RTE_PTYPE_INNER_L4_UDP; else pt[i] |= l3 | RTE_PTYPE_L4_UDP; break; - case RX_PKT_CMPL_FLAGS_ITYPE_IP: + case BNXT_PTYPE_TBL_TYPE_IP: pt[i] |= l3; break; } @@ -450,17 +450,19 @@ bnxt_parse_pkt_type(struct rx_pkt_cmpl *rxcmp, struct rx_pkt_cmpl_hi *rxcmp1) flags_type = rte_le_to_cpu_16(rxcmp->flags_type); flags2 = rte_le_to_cpu_32(rxcmp1->flags2); + /* Validate ptype table indexing at build time. */ + bnxt_check_ptype_constants(); + /* * Index format: - * bit 0: RX_PKT_CMPL_FLAGS2_T_IP_CS_CALC - * bit 1: RX_CMPL_FLAGS2_IP_TYPE - * bit 2: RX_PKT_CMPL_FLAGS2_META_FORMAT_VLAN - * bits 3-6: RX_PKT_CMPL_FLAGS_ITYPE + * bit 0: Set if IP tunnel encapsulated packet. + * bit 1: Set if IPv6 packet, clear if IPv4. + * bit 2: Set if VLAN tag present. + * bits 3-6: Four-bit hardware packet type field. */ - index = ((flags_type & RX_PKT_CMPL_FLAGS_ITYPE_MASK) >> 9) | - ((flags2 & (RX_PKT_CMPL_FLAGS2_META_FORMAT_VLAN | - RX_PKT_CMPL_FLAGS2_T_IP_CS_CALC)) >> 2) | - ((flags2 & RX_PKT_CMPL_FLAGS2_IP_TYPE) >> 7); + index = BNXT_CMPL_ITYPE_TO_IDX(flags_type) | + BNXT_CMPL_VLAN_TUN_TO_IDX(flags2) | + BNXT_CMPL_IP_VER_TO_IDX(flags2); return bnxt_ptype_table[index]; } diff --git a/drivers/net/bnxt/bnxt_rxr.h b/drivers/net/bnxt/bnxt_rxr.h index b43256e03e..79f1458698 100644 --- a/drivers/net/bnxt/bnxt_rxr.h +++ b/drivers/net/bnxt/bnxt_rxr.h @@ -131,7 +131,48 @@ bnxt_cfa_code_dynfield(struct rte_mbuf *mbuf) #define BNXT_CFA_META_EEM_TCAM_SHIFT 31 #define BNXT_CFA_META_EM_TEST(x) ((x) >> BNXT_CFA_META_EEM_TCAM_SHIFT) -#define BNXT_PTYPE_TBL_DIM 128 +/* Definitions for translation of hardware packet type to mbuf ptype. */ +#define BNXT_PTYPE_TBL_DIM 128 +#define BNXT_PTYPE_TBL_TUN_SFT 0 /* Set if tunneled packet. */ +#define BNXT_PTYPE_TBL_TUN_MSK BIT(BNXT_PTYPE_TBL_TUN_SFT) +#define BNXT_PTYPE_TBL_IP_VER_SFT 1 /* Set if IPv6, clear if IPv4. */ +#define BNXT_PTYPE_TBL_IP_VER_MSK BIT(BNXT_PTYPE_TBL_IP_VER_SFT) +#define BNXT_PTYPE_TBL_VLAN_SFT 2 /* Set if VLAN encapsulated. */ +#define BNXT_PTYPE_TBL_VLAN_MSK BIT(BNXT_PTYPE_TBL_VLAN_SFT) +#define BNXT_PTYPE_TBL_TYPE_SFT 3 /* Hardware packet type field. */ +#define BNXT_PTYPE_TBL_TYPE_MSK 0x78 /* Hardware itype field mask. */ +#define BNXT_PTYPE_TBL_TYPE_IP 1 +#define BNXT_PTYPE_TBL_TYPE_TCP 2 +#define BNXT_PTYPE_TBL_TYPE_UDP 3 +#define BNXT_PTYPE_TBL_TYPE_ICMP 7 + +#define RX_PKT_CMPL_FLAGS2_IP_TYPE_SFT 8 +#define CMPL_FLAGS2_VLAN_TUN_MSK \ + (RX_PKT_CMPL_FLAGS2_META_FORMAT_VLAN | RX_PKT_CMPL_FLAGS2_T_IP_CS_CALC) + +#define BNXT_CMPL_ITYPE_TO_IDX(ft) \ + (((ft) & RX_PKT_CMPL_FLAGS_ITYPE_MASK) >> \ + (RX_PKT_CMPL_FLAGS_ITYPE_SFT - BNXT_PTYPE_TBL_TYPE_SFT)) + +#define BNXT_CMPL_VLAN_TUN_TO_IDX(f2) \ + (((f2) & CMPL_FLAGS2_VLAN_TUN_MSK) >> \ + (RX_PKT_CMPL_FLAGS2_META_FORMAT_SFT - BNXT_PTYPE_TBL_VLAN_SFT)) + +#define BNXT_CMPL_IP_VER_TO_IDX(f2) \ + (((f2) & RX_PKT_CMPL_FLAGS2_IP_TYPE) >> \ + (RX_PKT_CMPL_FLAGS2_IP_TYPE_SFT - BNXT_PTYPE_TBL_IP_VER_SFT)) + +static inline void +bnxt_check_ptype_constants(void) +{ + RTE_BUILD_BUG_ON(BNXT_CMPL_ITYPE_TO_IDX(RX_PKT_CMPL_FLAGS_ITYPE_MASK) != + BNXT_PTYPE_TBL_TYPE_MSK); + RTE_BUILD_BUG_ON(BNXT_CMPL_VLAN_TUN_TO_IDX(CMPL_FLAGS2_VLAN_TUN_MSK) != + (BNXT_PTYPE_TBL_VLAN_MSK | BNXT_PTYPE_TBL_TUN_MSK)); + RTE_BUILD_BUG_ON(BNXT_CMPL_IP_VER_TO_IDX(RX_PKT_CMPL_FLAGS2_IP_TYPE) != + BNXT_PTYPE_TBL_IP_VER_MSK); +} + extern uint32_t bnxt_ptype_table[BNXT_PTYPE_TBL_DIM]; /* Stingray2 specific code for RX completion parsing */ diff --git a/drivers/net/bnxt/bnxt_rxtx_vec_neon.c b/drivers/net/bnxt/bnxt_rxtx_vec_neon.c index bc2e96ec38..a6fbc0b0bf 100644 --- a/drivers/net/bnxt/bnxt_rxtx_vec_neon.c +++ b/drivers/net/bnxt/bnxt_rxtx_vec_neon.c @@ -71,8 +71,7 @@ descs_to_mbufs(uint32x4_t mm_rxcmp[4], uint32x4_t mm_rxcmp1[4], const uint32x4_t flags_type_mask = vdupq_n_u32(RX_PKT_CMPL_FLAGS_ITYPE_MASK); const uint32x4_t flags2_mask1 = - vdupq_n_u32(RX_PKT_CMPL_FLAGS2_META_FORMAT_VLAN | - RX_PKT_CMPL_FLAGS2_T_IP_CS_CALC); + vdupq_n_u32(CMPL_FLAGS2_VLAN_TUN_MSK); const uint32x4_t flags2_mask2 = vdupq_n_u32(RX_PKT_CMPL_FLAGS2_IP_TYPE); const uint32x4_t rss_mask = @@ -84,14 +83,18 @@ descs_to_mbufs(uint32x4_t mm_rxcmp[4], uint32x4_t mm_rxcmp1[4], uint64x2_t t0, t1; uint32_t ol_flags; + /* Validate ptype table indexing at build time. */ + bnxt_check_ptype_constants(); + /* Compute packet type table indexes for four packets */ t0 = vreinterpretq_u64_u32(vzip1q_u32(mm_rxcmp[0], mm_rxcmp[1])); t1 = vreinterpretq_u64_u32(vzip1q_u32(mm_rxcmp[2], mm_rxcmp[3])); flags_type = vreinterpretq_u32_u64(vcombine_u64(vget_low_u64(t0), vget_low_u64(t1))); - ptype_idx = - vshrq_n_u32(vandq_u32(flags_type, flags_type_mask), 9); + ptype_idx = vshrq_n_u32(vandq_u32(flags_type, flags_type_mask), + RX_PKT_CMPL_FLAGS_ITYPE_SFT - + BNXT_PTYPE_TBL_TYPE_SFT); t0 = vreinterpretq_u64_u32(vzip1q_u32(mm_rxcmp1[0], mm_rxcmp1[1])); t1 = vreinterpretq_u64_u32(vzip1q_u32(mm_rxcmp1[2], mm_rxcmp1[3])); @@ -100,9 +103,13 @@ descs_to_mbufs(uint32x4_t mm_rxcmp[4], uint32x4_t mm_rxcmp1[4], vget_low_u64(t1))); ptype_idx = vorrq_u32(ptype_idx, - vshrq_n_u32(vandq_u32(flags2, flags2_mask1), 2)); + vshrq_n_u32(vandq_u32(flags2, flags2_mask1), + RX_PKT_CMPL_FLAGS2_META_FORMAT_SFT - + BNXT_PTYPE_TBL_VLAN_SFT)); ptype_idx = vorrq_u32(ptype_idx, - vshrq_n_u32(vandq_u32(flags2, flags2_mask2), 7)); + vshrq_n_u32(vandq_u32(flags2, flags2_mask2), + RX_PKT_CMPL_FLAGS2_IP_TYPE_SFT - + BNXT_PTYPE_TBL_IP_VER_SFT)); /* Extract RSS valid flags for four packets. */ rss_flags = vshrq_n_u32(vandq_u32(flags_type, rss_mask), 9); diff --git a/drivers/net/bnxt/bnxt_rxtx_vec_sse.c b/drivers/net/bnxt/bnxt_rxtx_vec_sse.c index 7ec04797b7..6dd18a0077 100644 --- a/drivers/net/bnxt/bnxt_rxtx_vec_sse.c +++ b/drivers/net/bnxt/bnxt_rxtx_vec_sse.c @@ -66,8 +66,7 @@ descs_to_mbufs(__m128i mm_rxcmp[4], __m128i mm_rxcmp1[4], const __m128i flags_type_mask = _mm_set1_epi32(RX_PKT_CMPL_FLAGS_ITYPE_MASK); const __m128i flags2_mask1 = - _mm_set1_epi32(RX_PKT_CMPL_FLAGS2_META_FORMAT_VLAN | - RX_PKT_CMPL_FLAGS2_T_IP_CS_CALC); + _mm_set1_epi32(CMPL_FLAGS2_VLAN_TUN_MSK); const __m128i flags2_mask2 = _mm_set1_epi32(RX_PKT_CMPL_FLAGS2_IP_TYPE); const __m128i rss_mask = @@ -76,21 +75,28 @@ descs_to_mbufs(__m128i mm_rxcmp[4], __m128i mm_rxcmp1[4], __m128i ptype_idx, is_tunnel; uint32_t ol_flags; + /* Validate ptype table indexing at build time. */ + bnxt_check_ptype_constants(); + /* Compute packet type table indexes for four packets */ t0 = _mm_unpacklo_epi32(mm_rxcmp[0], mm_rxcmp[1]); t1 = _mm_unpacklo_epi32(mm_rxcmp[2], mm_rxcmp[3]); flags_type = _mm_unpacklo_epi64(t0, t1); - ptype_idx = - _mm_srli_epi32(_mm_and_si128(flags_type, flags_type_mask), 9); + ptype_idx = _mm_srli_epi32(_mm_and_si128(flags_type, flags_type_mask), + RX_PKT_CMPL_FLAGS_ITYPE_SFT - BNXT_PTYPE_TBL_TYPE_SFT); t0 = _mm_unpacklo_epi32(mm_rxcmp1[0], mm_rxcmp1[1]); t1 = _mm_unpacklo_epi32(mm_rxcmp1[2], mm_rxcmp1[3]); flags2 = _mm_unpacklo_epi64(t0, t1); ptype_idx = _mm_or_si128(ptype_idx, - _mm_srli_epi32(_mm_and_si128(flags2, flags2_mask1), 2)); + _mm_srli_epi32(_mm_and_si128(flags2, flags2_mask1), + RX_PKT_CMPL_FLAGS2_META_FORMAT_SFT - + BNXT_PTYPE_TBL_VLAN_SFT)); ptype_idx = _mm_or_si128(ptype_idx, - _mm_srli_epi32(_mm_and_si128(flags2, flags2_mask2), 7)); + _mm_srli_epi32(_mm_and_si128(flags2, flags2_mask2), + RX_PKT_CMPL_FLAGS2_IP_TYPE_SFT - + BNXT_PTYPE_TBL_IP_VER_SFT)); /* Extract RSS valid flags for four packets. */ rss_flags = _mm_srli_epi32(_mm_and_si128(flags_type, rss_mask), 9); From patchwork Mon May 24 18:59:50 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lance Richardson X-Patchwork-Id: 93427 X-Patchwork-Delegate: ajit.khaparde@broadcom.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 8E561A0547; Mon, 24 May 2021 21:00:23 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id EF46641109; Mon, 24 May 2021 21:00:20 +0200 (CEST) Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) by mails.dpdk.org (Postfix) with ESMTP id F1823410FE for ; Mon, 24 May 2021 21:00:19 +0200 (CEST) Received: by mail-pl1-f176.google.com with SMTP id b7so10939434plg.0 for ; Mon, 24 May 2021 12:00:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=broadcom.com; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version; bh=2T25zHICG7eQeieMi/2cMn7A6yxZPv8YBbpL+DAvDVg=; b=VdG5PoaGsRohR7feD8GkmkIgqbEjmT5X/ehX9nbm19Bh7bQX7DgUtxvQDVRhtL+kVz zeFmFq8IiTMfvWnPITsBCVds2Iqhns+rFFvsMxBUhSfknmFm37xgqbtNlF5evc1EW5gn cECM0sGCsxM+8nPVxkrzo2y+hjOFpoPW8jnCE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version; bh=2T25zHICG7eQeieMi/2cMn7A6yxZPv8YBbpL+DAvDVg=; b=lCKw37oX26xQxG7f5k5RIc3L5A+RK7BELQroz2tq8/m2pcBEE+SD4Q+9kMn7IBiB4a 7+rMEosirGydFCbqGg2hPG/UoGnES+PBy+RJwoLwou17jyeq3/ZgBawajJ2siJy/cAiq 7scGrY4SdC5ee3FVlMND/QGquB35hbTJ6tLhR4sbEbsxm1kHd8TEQN7ff2Nvc3T4cRY1 dH9yYLWbRKmO9t8KgQIbMgm+Se122S+mep+yChlGygg3PY5ZdorbW1kW7dkp383Oy1R0 FuO6d21QuwdT6S3p0s0kYPYehqoR4QSemf54rLvoxg5wcTxlQYNVuRN8cXb5B56eBbJh 4O1A== X-Gm-Message-State: AOAM533DY6IEnHIngWbbuKGE+MxHjoMEnfQJ47AU/3p9N7S160edUO8N hErqYB/5D0BvrTA/uSDB+L7DTA== X-Google-Smtp-Source: ABdhPJzCa7gn5SQBmiPUdKiwW9lFi0k7AxPkoMf2mwcgfZB512v6GvcfyY4AjKBIc8lMRhiYRN6wHQ== X-Received: by 2002:a17:903:31d3:b029:ee:bccd:e686 with SMTP id v19-20020a17090331d3b02900eebccde686mr27151688ple.1.1621882818982; Mon, 24 May 2021 12:00:18 -0700 (PDT) Received: from localhost.localdomain ([192.19.231.250]) by smtp.gmail.com with ESMTPSA id g4sm12239049pgu.46.2021.05.24.12.00.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 May 2021 12:00:18 -0700 (PDT) From: Lance Richardson To: Jerin Jacob , Ruifeng Wang , Ajit Khaparde , Somnath Kotur , Bruce Richardson , Konstantin Ananyev Cc: dev@dpdk.org, stable@dpdk.org Date: Mon, 24 May 2021 14:59:50 -0400 Message-Id: <20210524185952.673258-3-lance.richardson@broadcom.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210524185952.673258-1-lance.richardson@broadcom.com> References: <20210524185952.673258-1-lance.richardson@broadcom.com> MIME-Version: 1.0 X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: [dpdk-dev] [PATCH 2/3] net/bnxt: fix Rx burst size constraint X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" The burst receive function should return all packets currently present in the receive ring up to the requested burst size, update vector mode receive functions accordingly. Fixes: 398358341419 ("net/bnxt: support NEON") Fixes: bc4a000f2f53 ("net/bnxt: implement SSE vector mode") Cc: stable@dpdk.org Signed-off-by: Lance Richardson Reviewed-by: Ajit Kumar Khaparde --- drivers/net/bnxt/bnxt_rxtx_vec_neon.c | 29 +++++++++++++++++++++------ drivers/net/bnxt/bnxt_rxtx_vec_sse.c | 29 +++++++++++++++++++++------ 2 files changed, 46 insertions(+), 12 deletions(-) diff --git a/drivers/net/bnxt/bnxt_rxtx_vec_neon.c b/drivers/net/bnxt/bnxt_rxtx_vec_neon.c index a6fbc0b0bf..a6e630ea5e 100644 --- a/drivers/net/bnxt/bnxt_rxtx_vec_neon.c +++ b/drivers/net/bnxt/bnxt_rxtx_vec_neon.c @@ -158,9 +158,8 @@ descs_to_mbufs(uint32x4_t mm_rxcmp[4], uint32x4_t mm_rxcmp1[4], vst1q_u32((uint32_t *)&mbuf[3]->rx_descriptor_fields1, tmp); } -uint16_t -bnxt_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, - uint16_t nb_pkts) +static uint16_t +recv_burst_vec_neon(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) { struct bnxt_rx_queue *rxq = rx_queue; struct bnxt_cp_ring_info *cpr = rxq->cp_ring; @@ -185,9 +184,6 @@ bnxt_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, if (rxq->rxrearm_nb >= rxq->rx_free_thresh) bnxt_rxq_rearm(rxq, rxr); - /* Return no more than RTE_BNXT_MAX_RX_BURST per call. */ - nb_pkts = RTE_MIN(nb_pkts, RTE_BNXT_MAX_RX_BURST); - cons = raw_cons & (cp_ring_size - 1); mbcons = (raw_cons / 2) & (rx_ring_size - 1); @@ -305,6 +301,27 @@ bnxt_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, return nb_rx_pkts; } +uint16_t +bnxt_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) +{ + uint16_t cnt = 0; + + while (nb_pkts > RTE_BNXT_MAX_RX_BURST) { + uint16_t burst; + + burst = recv_burst_vec_neon(rx_queue, rx_pkts + cnt, + RTE_BNXT_MAX_RX_BURST); + + cnt += burst; + nb_pkts -= burst; + + if (burst < RTE_BNXT_MAX_RX_BURST) + return cnt; + } + + return cnt + recv_burst_vec_neon(rx_queue, rx_pkts + cnt, nb_pkts); +} + static void bnxt_handle_tx_cp_vec(struct bnxt_tx_queue *txq) { diff --git a/drivers/net/bnxt/bnxt_rxtx_vec_sse.c b/drivers/net/bnxt/bnxt_rxtx_vec_sse.c index 6dd18a0077..fe074f82cf 100644 --- a/drivers/net/bnxt/bnxt_rxtx_vec_sse.c +++ b/drivers/net/bnxt/bnxt_rxtx_vec_sse.c @@ -149,9 +149,8 @@ descs_to_mbufs(__m128i mm_rxcmp[4], __m128i mm_rxcmp1[4], _mm_store_si128((void *)&mbuf[3]->rx_descriptor_fields1, t0); } -uint16_t -bnxt_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, - uint16_t nb_pkts) +static uint16_t +recv_burst_vec_sse(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) { struct bnxt_rx_queue *rxq = rx_queue; const __m128i mbuf_init = _mm_set_epi64x(0, rxq->mbuf_initializer); @@ -176,9 +175,6 @@ bnxt_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, if (rxq->rxrearm_nb >= rxq->rx_free_thresh) bnxt_rxq_rearm(rxq, rxr); - /* Return no more than RTE_BNXT_MAX_RX_BURST per call. */ - nb_pkts = RTE_MIN(nb_pkts, RTE_BNXT_MAX_RX_BURST); - cons = raw_cons & (cp_ring_size - 1); mbcons = (raw_cons / 2) & (rx_ring_size - 1); @@ -286,6 +282,27 @@ bnxt_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, return nb_rx_pkts; } +uint16_t +bnxt_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) +{ + uint16_t cnt = 0; + + while (nb_pkts > RTE_BNXT_MAX_RX_BURST) { + uint16_t burst; + + burst = recv_burst_vec_sse(rx_queue, rx_pkts + cnt, + RTE_BNXT_MAX_RX_BURST); + + cnt += burst; + nb_pkts -= burst; + + if (burst < RTE_BNXT_MAX_RX_BURST) + return cnt; + } + + return cnt + recv_burst_vec_sse(rx_queue, rx_pkts + cnt, nb_pkts); +} + static void bnxt_handle_tx_cp_vec(struct bnxt_tx_queue *txq) { From patchwork Mon May 24 18:59:51 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Lance Richardson X-Patchwork-Id: 93428 X-Patchwork-Delegate: ajit.khaparde@broadcom.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 72331A0547; Mon, 24 May 2021 21:00:29 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 2749F41108; Mon, 24 May 2021 21:00:29 +0200 (CEST) Received: from mail-pg1-f181.google.com (mail-pg1-f181.google.com [209.85.215.181]) by mails.dpdk.org (Postfix) with ESMTP id 717E741108 for ; Mon, 24 May 2021 21:00:27 +0200 (CEST) Received: by mail-pg1-f181.google.com with SMTP id i5so20822491pgm.0 for ; Mon, 24 May 2021 12:00:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=broadcom.com; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version; bh=rfY9hlhlUp+E9L3xPpu3ytK9p8NP8J6EB3fwrC2WXJ4=; b=b9GCT8+nCZmPUmbbZivyoDKPAuKEfo13Yfx4om4zhUQTaSSdHaSJ1+lLrXBCXuR8Hq MMZDbgpoS0/sHqRQaGa+mSoACFjzDKGv73qEqA0n8eaByXQZ72DWoDkpG/AvQp0kdD+w mxPtJf8TXquvObSzG3XjtjPnpT5Qq3PiiVUfU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version; bh=rfY9hlhlUp+E9L3xPpu3ytK9p8NP8J6EB3fwrC2WXJ4=; b=dWjHIw0Qxx01r+CvpI8gd5vEv1gAqD1IDCckvecrbjkHAA/XLrUcta2YQvYpHjQz1H qmKJ1Oiw61lqv6m/IXg7iF70N5EO7o6YNHjufl21rL9q504gj2sFpN458Hk3yVmqYuhe ekdlQLfOdoC92759wyekatfrFn4o3Yg+UbVqS1ElwwPxwsWm3vqWd7XF773bvJm80fzU /xq71VVlkiifr3GLgks6KexTTdmhQvZ0MzcdgOUwLdhNTF/0hjA4O77/nllVn+ghnKc1 fJV6greQeFj8ZjWIFdMY8GLwWHDpHcg25nK+KqPITtc5SgnBAtJTKvThDluF2UdiQrPM XCfA== X-Gm-Message-State: AOAM532TCZJuMnDjANe5zwa8ttB4WPVqhmn8Hk2TLCcrj0eTqh87tbWQ N2JN+9ttpxiIm+eSCVvJ/N4xKw== X-Google-Smtp-Source: ABdhPJz3POvWEyR2TMMr/ezNllPnSqq9JqPAkxYo47jc9H634AT2HzCfvGv6AKxsRWq4B1oZ0xJpvA== X-Received: by 2002:a65:41c8:: with SMTP id b8mr14712292pgq.196.1621882825923; Mon, 24 May 2021 12:00:25 -0700 (PDT) Received: from localhost.localdomain ([192.19.231.250]) by smtp.gmail.com with ESMTPSA id g4sm12239049pgu.46.2021.05.24.12.00.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 May 2021 12:00:25 -0700 (PDT) From: Lance Richardson To: Ajit Khaparde , Somnath Kotur , Bruce Richardson , Konstantin Ananyev , Jerin Jacob , Ruifeng Wang Cc: dev@dpdk.org Date: Mon, 24 May 2021 14:59:51 -0400 Message-Id: <20210524185952.673258-4-lance.richardson@broadcom.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210524185952.673258-1-lance.richardson@broadcom.com> References: <20210524185952.673258-1-lance.richardson@broadcom.com> MIME-Version: 1.0 X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: [dpdk-dev] [PATCH 3/3] net/bnxt: add AVX2 vector PMD X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Implement AVX2 vector PMD. Signed-off-by: Lance Richardson Reviewed-by: Ajit Khaparde --- doc/guides/nics/bnxt.rst | 57 ++- drivers/net/bnxt/bnxt_ethdev.c | 119 +++-- drivers/net/bnxt/bnxt_rxr.c | 4 +- drivers/net/bnxt/bnxt_rxr.h | 11 +- drivers/net/bnxt/bnxt_rxtx_vec_avx2.c | 597 ++++++++++++++++++++++++++ drivers/net/bnxt/bnxt_rxtx_vec_neon.c | 25 +- drivers/net/bnxt/bnxt_rxtx_vec_sse.c | 31 +- drivers/net/bnxt/bnxt_txr.h | 7 + drivers/net/bnxt/meson.build | 17 + 9 files changed, 780 insertions(+), 88 deletions(-) create mode 100644 drivers/net/bnxt/bnxt_rxtx_vec_avx2.c diff --git a/doc/guides/nics/bnxt.rst b/doc/guides/nics/bnxt.rst index 0fb2032447..feb0c6a765 100644 --- a/doc/guides/nics/bnxt.rst +++ b/doc/guides/nics/bnxt.rst @@ -853,23 +853,36 @@ DPDK implements a light-weight library to allow PMDs to be bonded together and p Vector Processing ----------------- +The BNXT PMD provides vectorized burst transmit/receive function implementations +on x86-based platforms using SSE (Streaming SIMD Extensions) and AVX2 (Advanced +Vector Extensions 2) instructions, and on Arm-based platforms using Arm Neon +Advanced SIMD instructions. Vector processing support is currently implemented +only for Intel/AMD and Arm CPU architectures. + Vector processing provides significantly improved performance over scalar -processing (see Vector Processor, here). +processing. This improved performance is derived from a number of optimizations: + +* Using SIMD instructions to operate on multiple packets in parallel. +* Using SIMD instructions to do more work per instruction than is possible + with scalar instructions, for example by leveraging 128-bit and 256-bi + load/store instructions or by using SIMD shuffle and permute operations. +* Batching -The BNXT PMD supports the vector processing using SSE (Streaming SIMD -Extensions) instructions on x86 platforms. It also supports NEON intrinsics for -vector processing on ARM CPUs. The BNXT vPMD (vector mode PMD) is available for -Intel/AMD and ARM CPU architectures. +  * TX: transmit completions are processed in bulk. +  * RX: bulk allocation of mbufs is used when allocating rxq buffers. -This improved performance comes from several optimizations: +* Simplifications enabled by not supporting chained mbufs in vector mode. +* Simplifications enabled by not supporting some stateless offloads in vector + mode: -* Batching -  * TX: processing completions in bulk -  * RX: allocating mbufs in bulk -* Chained mbufs are *not* supported, i.e. a packet should fit a single mbuf -* Some stateless offloads are *not* supported with vector processing -  * TX: no offloads will be supported -  * RX: reduced RX offloads (listed below) will be supported:: +  * TX: only the following reduced set of transmit offloads is supported in + vector mode:: + +   DEV_TX_OFFLOAD_MBUF_FAST_FREE + +  * RX: only the following reduced set of receive offloads is supported in + vector mode (note that jumbo MTU is allowed only when the MTU setting + does not require `DEV_RX_OFFLOAD_SCATTER` to be enabled)::   DEV_RX_OFFLOAD_VLAN_STRIP   DEV_RX_OFFLOAD_KEEP_CRC @@ -878,23 +891,21 @@ This improved performance comes from several optimizations:   DEV_RX_OFFLOAD_UDP_CKSUM   DEV_RX_OFFLOAD_TCP_CKSUM   DEV_RX_OFFLOAD_OUTER_IPV4_CKSUM +   DEV_RX_OFFLOAD_OUTER_UDP_CKSUM   DEV_RX_OFFLOAD_RSS_HASH   DEV_RX_OFFLOAD_VLAN_FILTER -The BNXT Vector PMD is enabled in DPDK builds by default. - -However, a decision to enable vector mode will be made when the port transitions -from stopped to started. Any TX offloads or some RX offloads (other than listed -above) will disable the vector mode. -Offload configuration changes that impact vector mode must be made when the port -is stopped. +The BNXT Vector PMD is enabled in DPDK builds by default. The decision to enable +vector processing is made at run-time when the port is started; if no transmit +offloads outside the set supported for vector mode are enabled then vector mode +transmit will be enabled, and if no receive offloads outside the set supported +for vector mode are enabled then vector mode receive will be enabled. Offload +configuration changes that impact the decision to enable vector mode are allowed +only when the port is stopped. Note that TX (or RX) vector mode can be enabled independently from RX (or TX) vector mode. -Also vector mode is allowed when jumbo is enabled -as long as the MTU setting does not require scattered Rx. - Appendix -------- diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c index 0208795fd2..a7d056a34f 100644 --- a/drivers/net/bnxt/bnxt_ethdev.c +++ b/drivers/net/bnxt/bnxt_ethdev.c @@ -1178,32 +1178,57 @@ bnxt_receive_function(struct rte_eth_dev *eth_dev) return bnxt_recv_pkts; } -#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) -#ifndef RTE_LIBRTE_IEEE1588 +#if (defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)) && \ + !defined(RTE_LIBRTE_IEEE1588) + + /* Vector mode receive cannot be enabled if scattered rx is in use. */ + if (eth_dev->data->scattered_rx) + goto use_scalar_rx; + /* - * Vector mode receive can be enabled only if scatter rx is not - * in use and rx offloads are limited to VLAN stripping and - * CRC stripping. + * Vector mode receive cannot be enabled if Truflow is enabled or if + * asynchronous completions and receive completions can be placed in + * the same completion ring. */ - if (!eth_dev->data->scattered_rx && - !(eth_dev->data->dev_conf.rxmode.offloads & - ~(DEV_RX_OFFLOAD_VLAN_STRIP | - DEV_RX_OFFLOAD_KEEP_CRC | - DEV_RX_OFFLOAD_JUMBO_FRAME | - DEV_RX_OFFLOAD_IPV4_CKSUM | - DEV_RX_OFFLOAD_UDP_CKSUM | - DEV_RX_OFFLOAD_TCP_CKSUM | - DEV_RX_OFFLOAD_OUTER_IPV4_CKSUM | - DEV_RX_OFFLOAD_OUTER_UDP_CKSUM | - DEV_RX_OFFLOAD_RSS_HASH | - DEV_RX_OFFLOAD_VLAN_FILTER)) && - !BNXT_TRUFLOW_EN(bp) && BNXT_NUM_ASYNC_CPR(bp) && - rte_vect_get_max_simd_bitwidth() >= RTE_VECT_SIMD_128) { - PMD_DRV_LOG(INFO, "Using vector mode receive for port %d\n", + if (BNXT_TRUFLOW_EN(bp) || !BNXT_NUM_ASYNC_CPR(bp)) + goto use_scalar_rx; + + /* + * Vector mode receive cannot be enabled if any receive offloads outside + * a limited subset have been enabled. + */ + if (eth_dev->data->dev_conf.rxmode.offloads & + ~(DEV_RX_OFFLOAD_VLAN_STRIP | + DEV_RX_OFFLOAD_KEEP_CRC | + DEV_RX_OFFLOAD_JUMBO_FRAME | + DEV_RX_OFFLOAD_IPV4_CKSUM | + DEV_RX_OFFLOAD_UDP_CKSUM | + DEV_RX_OFFLOAD_TCP_CKSUM | + DEV_RX_OFFLOAD_OUTER_IPV4_CKSUM | + DEV_RX_OFFLOAD_OUTER_UDP_CKSUM | + DEV_RX_OFFLOAD_RSS_HASH | + DEV_RX_OFFLOAD_VLAN_FILTER)) + goto use_scalar_rx; + +#if defined(RTE_ARCH_X86) && defined(CC_AVX2_SUPPORT) + if (rte_vect_get_max_simd_bitwidth() >= RTE_VECT_SIMD_256 && + rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1) { + PMD_DRV_LOG(INFO, + "Using AVX2 vector mode receive for port %d\n", + eth_dev->data->port_id); + bp->flags |= BNXT_FLAG_RX_VECTOR_PKT_MODE; + return bnxt_recv_pkts_vec_avx2; + } + #endif + if (rte_vect_get_max_simd_bitwidth() >= RTE_VECT_SIMD_128) { + PMD_DRV_LOG(INFO, + "Using SSE vector mode receive for port %d\n", eth_dev->data->port_id); bp->flags |= BNXT_FLAG_RX_VECTOR_PKT_MODE; return bnxt_recv_pkts_vec; } + +use_scalar_rx: PMD_DRV_LOG(INFO, "Vector mode receive disabled for port %d\n", eth_dev->data->port_id); PMD_DRV_LOG(INFO, @@ -1211,7 +1236,6 @@ bnxt_receive_function(struct rte_eth_dev *eth_dev) eth_dev->data->port_id, eth_dev->data->scattered_rx, eth_dev->data->dev_conf.rxmode.offloads); -#endif #endif bp->flags &= ~BNXT_FLAG_RX_VECTOR_PKT_MODE; return bnxt_recv_pkts; @@ -1226,22 +1250,36 @@ bnxt_transmit_function(struct rte_eth_dev *eth_dev) if (BNXT_CHIP_SR2(bp)) return bnxt_xmit_pkts; -#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) -#ifndef RTE_LIBRTE_IEEE1588 +#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) && \ + !defined(RTE_LIBRTE_IEEE1588) uint64_t offloads = eth_dev->data->dev_conf.txmode.offloads; /* * Vector mode transmit can be enabled only if not using scatter rx * or tx offloads. */ - if (!eth_dev->data->scattered_rx && - !(offloads & ~DEV_TX_OFFLOAD_MBUF_FAST_FREE) && - !BNXT_TRUFLOW_EN(bp) && - rte_vect_get_max_simd_bitwidth() >= RTE_VECT_SIMD_128) { - PMD_DRV_LOG(INFO, "Using vector mode transmit for port %d\n", + if (eth_dev->data->scattered_rx || + (offloads & ~DEV_TX_OFFLOAD_MBUF_FAST_FREE) || + BNXT_TRUFLOW_EN(bp)) + goto use_scalar_tx; + +#if defined(RTE_ARCH_X86) && defined(CC_AVX2_SUPPORT) + if (rte_vect_get_max_simd_bitwidth() >= RTE_VECT_SIMD_256 && + rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2) == 1) { + PMD_DRV_LOG(INFO, + "Using AVX2 vector mode transmit for port %d\n", + eth_dev->data->port_id); + return bnxt_xmit_pkts_vec_avx2; + } +#endif + if (rte_vect_get_max_simd_bitwidth() >= RTE_VECT_SIMD_128) { + PMD_DRV_LOG(INFO, + "Using SSE vector mode transmit for port %d\n", eth_dev->data->port_id); return bnxt_xmit_pkts_vec; } + +use_scalar_tx: PMD_DRV_LOG(INFO, "Vector mode transmit disabled for port %d\n", eth_dev->data->port_id); PMD_DRV_LOG(INFO, @@ -1249,7 +1287,6 @@ bnxt_transmit_function(struct rte_eth_dev *eth_dev) eth_dev->data->port_id, eth_dev->data->scattered_rx, offloads); -#endif #endif return bnxt_xmit_pkts; } @@ -2859,11 +2896,15 @@ static const struct { eth_rx_burst_t pkt_burst; const char *info; } bnxt_rx_burst_info[] = { - {bnxt_recv_pkts, "Scalar"}, + {bnxt_recv_pkts, "Scalar"}, #if defined(RTE_ARCH_X86) - {bnxt_recv_pkts_vec, "Vector SSE"}, -#elif defined(RTE_ARCH_ARM64) - {bnxt_recv_pkts_vec, "Vector Neon"}, + {bnxt_recv_pkts_vec, "Vector SSE"}, +#endif +#if defined(RTE_ARCH_X86) && defined(CC_AVX2_SUPPORT) + {bnxt_recv_pkts_vec_avx2, "Vector AVX2"}, +#endif +#if defined(RTE_ARCH_ARM64) + {bnxt_recv_pkts_vec, "Vector Neon"}, #endif }; @@ -2889,11 +2930,15 @@ static const struct { eth_tx_burst_t pkt_burst; const char *info; } bnxt_tx_burst_info[] = { - {bnxt_xmit_pkts, "Scalar"}, + {bnxt_xmit_pkts, "Scalar"}, #if defined(RTE_ARCH_X86) - {bnxt_xmit_pkts_vec, "Vector SSE"}, -#elif defined(RTE_ARCH_ARM64) - {bnxt_xmit_pkts_vec, "Vector Neon"}, + {bnxt_xmit_pkts_vec, "Vector SSE"}, +#endif +#if defined(RTE_ARCH_X86) && defined(CC_AVX2_SUPPORT) + {bnxt_xmit_pkts_vec_avx2, "Vector AVX2"}, +#endif +#if defined(RTE_ARCH_ARM64) + {bnxt_xmit_pkts_vec, "Vector Neon"}, #endif }; diff --git a/drivers/net/bnxt/bnxt_rxr.c b/drivers/net/bnxt/bnxt_rxr.c index a6a8fb213b..4eef75f6be 100644 --- a/drivers/net/bnxt/bnxt_rxr.c +++ b/drivers/net/bnxt/bnxt_rxr.c @@ -1147,7 +1147,7 @@ int bnxt_init_rx_ring_struct(struct bnxt_rx_queue *rxq, unsigned int socket_id) /* Allocate extra rx ring entries for vector rx. */ ring->vmem_size = sizeof(struct rte_mbuf *) * - (ring->ring_size + RTE_BNXT_DESCS_PER_LOOP); + (ring->ring_size + BNXT_RX_EXTRA_MBUF_ENTRIES); ring->vmem = (void **)&rxr->rx_buf_ring; ring->fw_ring_id = INVALID_HW_RING_ID; @@ -1251,7 +1251,7 @@ int bnxt_init_one_rx_ring(struct bnxt_rx_queue *rxq) /* Initialize dummy mbuf pointers for vector mode rx. */ for (i = ring->ring_size; - i < ring->ring_size + RTE_BNXT_DESCS_PER_LOOP; i++) { + i < ring->ring_size + BNXT_RX_EXTRA_MBUF_ENTRIES; i++) { rxr->rx_buf_ring[i] = &rxq->fake_mbuf; } diff --git a/drivers/net/bnxt/bnxt_rxr.h b/drivers/net/bnxt/bnxt_rxr.h index 79f1458698..955bf3e99e 100644 --- a/drivers/net/bnxt/bnxt_rxr.h +++ b/drivers/net/bnxt/bnxt_rxr.h @@ -42,7 +42,12 @@ static inline uint16_t bnxt_tpa_start_agg_id(struct bnxt *bp, RX_PKT_CMPL_AGG_BUFS_SFT) /* Number of descriptors to process per inner loop in vector mode. */ -#define RTE_BNXT_DESCS_PER_LOOP 4U +#define BNXT_RX_DESCS_PER_LOOP_VEC128 4U /* SSE, Neon */ +#define BNXT_RX_DESCS_PER_LOOP_VEC256 8U /* AVX2 */ + +/* Number of extra Rx mbuf ring entries to allocate for vector mode. */ +#define BNXT_RX_EXTRA_MBUF_ENTRIES \ + RTE_MAX(BNXT_RX_DESCS_PER_LOOP_VEC128, BNXT_RX_DESCS_PER_LOOP_VEC256) #define BNXT_OL_FLAGS_TBL_DIM 64 #define BNXT_OL_FLAGS_ERR_TBL_DIM 32 @@ -106,6 +111,10 @@ uint16_t bnxt_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, int bnxt_rxq_vec_setup(struct bnxt_rx_queue *rxq); #endif +#if defined(RTE_ARCH_X86) && defined(CC_AVX2_SUPPORT) +uint16_t bnxt_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts); +#endif void bnxt_set_mark_in_mbuf(struct bnxt *bp, struct rx_pkt_cmpl_hi *rxcmp1, struct rte_mbuf *mbuf); diff --git a/drivers/net/bnxt/bnxt_rxtx_vec_avx2.c b/drivers/net/bnxt/bnxt_rxtx_vec_avx2.c new file mode 100644 index 0000000000..a06dfec90e --- /dev/null +++ b/drivers/net/bnxt/bnxt_rxtx_vec_avx2.c @@ -0,0 +1,597 @@ +/* SPDX-License-Identifier: BSD-3-Clause */ +/* Copyright(c) 2019-2021 Broadcom All rights reserved. */ + +#include +#include + +#include +#include +#include +#include +#include + +#include "bnxt.h" +#include "bnxt_cpr.h" +#include "bnxt_ring.h" + +#include "bnxt_txq.h" +#include "bnxt_txr.h" +#include "bnxt_rxtx_vec_common.h" +#include + +static uint16_t +recv_burst_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) +{ + struct bnxt_rx_queue *rxq = rx_queue; + const __m256i mbuf_init = + _mm256_set_epi64x(0, 0, 0, rxq->mbuf_initializer); + struct bnxt_cp_ring_info *cpr = rxq->cp_ring; + struct bnxt_rx_ring_info *rxr = rxq->rx_ring; + uint16_t cp_ring_size = cpr->cp_ring_struct->ring_size; + uint16_t rx_ring_size = rxr->rx_ring_struct->ring_size; + struct cmpl_base *cp_desc_ring = cpr->cp_desc_ring; + uint64_t valid, desc_valid_mask = ~0ULL; + const __m256i info3_v_mask = _mm256_set1_epi32(CMPL_BASE_V); + uint32_t raw_cons = cpr->cp_raw_cons; + uint32_t cons, mbcons; + int nb_rx_pkts = 0; + int i; + const __m256i valid_target = + _mm256_set1_epi32(!!(raw_cons & cp_ring_size)); + const __m256i dsc_shuf_msk = + _mm256_set_epi8(0xff, 0xff, 0xff, 0xff, /* Zeroes. */ + 7, 6, /* metadata type */ + 9, 8, /* flags2 low 16 */ + 5, 4, /* vlan_tci */ + 1, 0, /* errors_v2 */ + 0xff, 0xff, 0xff, 0xff, /* Zeroes. */ + 0xff, 0xff, 0xff, 0xff, /* Zeroes. */ + 7, 6, /* metadata type */ + 9, 8, /* flags2 low 16 */ + 5, 4, /* vlan_tci */ + 1, 0, /* errors_v2 */ + 0xff, 0xff, 0xff, 0xff); /* Zeroes. */ + const __m256i shuf_msk = + _mm256_set_epi8(15, 14, 13, 12, /* rss */ + 7, 6, /* vlan_tci */ + 3, 2, /* data_len */ + 0xFF, 0xFF, 3, 2, /* pkt_len */ + 0xFF, 0xFF, 0xFF, 0xFF, /* pkt_type (zeroes) */ + 15, 14, 13, 12, /* rss */ + 7, 6, /* vlan_tci */ + 3, 2, /* data_len */ + 0xFF, 0xFF, 3, 2, /* pkt_len */ + 0xFF, 0xFF, 0xFF, 0xFF); /* pkt_type (zeroes) */ + const __m256i flags_type_mask = + _mm256_set1_epi32(RX_PKT_CMPL_FLAGS_ITYPE_MASK); + const __m256i flags2_mask1 = + _mm256_set1_epi32(CMPL_FLAGS2_VLAN_TUN_MSK); + const __m256i flags2_mask2 = + _mm256_set1_epi32(RX_PKT_CMPL_FLAGS2_IP_TYPE); + const __m256i rss_mask = + _mm256_set1_epi32(RX_PKT_CMPL_FLAGS_RSS_VALID); + __m256i t0, t1, flags_type, flags2, index, errors; + __m256i ptype_idx, ptypes, is_tunnel; + __m256i mbuf01, mbuf23, mbuf45, mbuf67; + __m256i rearm0, rearm1, rearm2, rearm3, rearm4, rearm5, rearm6, rearm7; + __m256i ol_flags, ol_flags_hi; + __m256i rss_flags; + + /* Validate ptype table indexing at build time. */ + bnxt_check_ptype_constants(); + + /* If Rx Q was stopped return */ + if (unlikely(!rxq->rx_started)) + return 0; + + if (rxq->rxrearm_nb >= rxq->rx_free_thresh) + bnxt_rxq_rearm(rxq, rxr); + + nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, BNXT_RX_DESCS_PER_LOOP_VEC256); + + cons = raw_cons & (cp_ring_size - 1); + mbcons = (raw_cons / 2) & (rx_ring_size - 1); + + /* Prefetch first four descriptor pairs. */ + rte_prefetch0(&cp_desc_ring[cons + 0]); + rte_prefetch0(&cp_desc_ring[cons + 4]); + rte_prefetch0(&cp_desc_ring[cons + 8]); + rte_prefetch0(&cp_desc_ring[cons + 12]); + + /* Ensure that we do not go past the ends of the rings. */ + nb_pkts = RTE_MIN(nb_pkts, RTE_MIN(rx_ring_size - mbcons, + (cp_ring_size - cons) / 2)); + /* + * If we are at the end of the ring, ensure that descriptors after the + * last valid entry are not treated as valid. Otherwise, force the + * maximum number of packets to receive to be a multiple of the per- + * loop count. + */ + if (nb_pkts < BNXT_RX_DESCS_PER_LOOP_VEC256) { + desc_valid_mask >>= + CHAR_BIT * (BNXT_RX_DESCS_PER_LOOP_VEC256 - nb_pkts); + } else { + nb_pkts = + RTE_ALIGN_FLOOR(nb_pkts, BNXT_RX_DESCS_PER_LOOP_VEC256); + } + + /* Handle RX burst request */ + for (i = 0; i < nb_pkts; i += BNXT_RX_DESCS_PER_LOOP_VEC256, + cons += BNXT_RX_DESCS_PER_LOOP_VEC256 * 2, + mbcons += BNXT_RX_DESCS_PER_LOOP_VEC256) { + __m256i desc0, desc1, desc2, desc3, desc4, desc5, desc6, desc7; + __m256i rxcmp0_1, rxcmp2_3, rxcmp4_5, rxcmp6_7, info3_v; + __m256i errors_v2; + uint32_t num_valid; + + /* Copy eight mbuf pointers to output array. */ + t0 = _mm256_loadu_si256((void *)&rxr->rx_buf_ring[mbcons]); + _mm256_storeu_si256((void *)&rx_pkts[i], t0); +#ifdef RTE_ARCH_X86_64 + t0 = _mm256_loadu_si256((void *)&rxr->rx_buf_ring[mbcons + 4]); + _mm256_storeu_si256((void *)&rx_pkts[i + 4], t0); +#endif + + /* Prefetch eight descriptor pairs for next iteration. */ + if (i + BNXT_RX_DESCS_PER_LOOP_VEC256 < nb_pkts) { + rte_prefetch0(&cp_desc_ring[cons + 16]); + rte_prefetch0(&cp_desc_ring[cons + 20]); + rte_prefetch0(&cp_desc_ring[cons + 24]); + rte_prefetch0(&cp_desc_ring[cons + 28]); + } + + /* + * Load eight receive completion descriptors into 256-bit + * registers. Loads are issued in reverse order in order to + * ensure consistent state. + */ + desc7 = _mm256_load_si256((void *)&cp_desc_ring[cons + 14]); + rte_compiler_barrier(); + desc6 = _mm256_load_si256((void *)&cp_desc_ring[cons + 12]); + rte_compiler_barrier(); + desc5 = _mm256_load_si256((void *)&cp_desc_ring[cons + 10]); + rte_compiler_barrier(); + desc4 = _mm256_load_si256((void *)&cp_desc_ring[cons + 8]); + rte_compiler_barrier(); + desc3 = _mm256_load_si256((void *)&cp_desc_ring[cons + 6]); + rte_compiler_barrier(); + desc2 = _mm256_load_si256((void *)&cp_desc_ring[cons + 4]); + rte_compiler_barrier(); + desc1 = _mm256_load_si256((void *)&cp_desc_ring[cons + 2]); + rte_compiler_barrier(); + desc0 = _mm256_load_si256((void *)&cp_desc_ring[cons + 0]); + + /* + * Pack needed fields from each descriptor into a compressed + * 128-bit layout and pair two compressed descriptors into + * 256-bit registers. The 128-bit compressed layout is as + * follows: + * Bits 0-15: flags_type field from low completion record. + * Bits 16-31: len field from low completion record. + * Bits 32-47: flags2 (low 16 bits) from high completion. + * Bits 48-79: metadata from high completion record. + * Bits 80-95: errors_v2 from high completion record. + * Bits 96-127: rss hash from low completion record. + */ + t0 = _mm256_permute2f128_si256(desc6, desc7, 0x20); + t1 = _mm256_permute2f128_si256(desc6, desc7, 0x31); + t1 = _mm256_shuffle_epi8(t1, dsc_shuf_msk); + rxcmp6_7 = _mm256_blend_epi32(t0, t1, 0x66); + + t0 = _mm256_permute2f128_si256(desc4, desc5, 0x20); + t1 = _mm256_permute2f128_si256(desc4, desc5, 0x31); + t1 = _mm256_shuffle_epi8(t1, dsc_shuf_msk); + rxcmp4_5 = _mm256_blend_epi32(t0, t1, 0x66); + + t0 = _mm256_permute2f128_si256(desc2, desc3, 0x20); + t1 = _mm256_permute2f128_si256(desc2, desc3, 0x31); + t1 = _mm256_shuffle_epi8(t1, dsc_shuf_msk); + rxcmp2_3 = _mm256_blend_epi32(t0, t1, 0x66); + + t0 = _mm256_permute2f128_si256(desc0, desc1, 0x20); + t1 = _mm256_permute2f128_si256(desc0, desc1, 0x31); + t1 = _mm256_shuffle_epi8(t1, dsc_shuf_msk); + rxcmp0_1 = _mm256_blend_epi32(t0, t1, 0x66); + + /* Compute packet type table indices for eight packets. */ + t0 = _mm256_unpacklo_epi32(rxcmp0_1, rxcmp2_3); + t1 = _mm256_unpacklo_epi32(rxcmp4_5, rxcmp6_7); + flags_type = _mm256_unpacklo_epi64(t0, t1); + ptype_idx = _mm256_and_si256(flags_type, flags_type_mask); + ptype_idx = _mm256_srli_epi32(ptype_idx, + RX_PKT_CMPL_FLAGS_ITYPE_SFT - + BNXT_PTYPE_TBL_TYPE_SFT); + + t0 = _mm256_unpacklo_epi32(rxcmp0_1, rxcmp2_3); + t1 = _mm256_unpacklo_epi32(rxcmp4_5, rxcmp6_7); + flags2 = _mm256_unpackhi_epi64(t0, t1); + + t0 = _mm256_srli_epi32(_mm256_and_si256(flags2, flags2_mask1), + RX_PKT_CMPL_FLAGS2_META_FORMAT_SFT - + BNXT_PTYPE_TBL_VLAN_SFT); + ptype_idx = _mm256_or_si256(ptype_idx, t0); + + t0 = _mm256_srli_epi32(_mm256_and_si256(flags2, flags2_mask2), + RX_PKT_CMPL_FLAGS2_IP_TYPE_SFT - + BNXT_PTYPE_TBL_IP_VER_SFT); + ptype_idx = _mm256_or_si256(ptype_idx, t0); + + /* + * Load ptypes for eight packets using gather. Gather operations + * have extremely high latency (~19 cycles), execution and use + * of result should be separated as much as possible. + */ + ptypes = _mm256_i32gather_epi32((int *)bnxt_ptype_table, + ptype_idx, sizeof(uint32_t)); + /* + * Compute ol_flags and checksum error table indices for eight + * packets. + */ + is_tunnel = _mm256_and_si256(flags2, _mm256_set1_epi32(4)); + is_tunnel = _mm256_slli_epi32(is_tunnel, 3); + flags2 = _mm256_and_si256(flags2, _mm256_set1_epi32(0x1F)); + + /* Extract errors_v2 fields for eight packets. */ + t0 = _mm256_unpackhi_epi32(rxcmp0_1, rxcmp2_3); + t1 = _mm256_unpackhi_epi32(rxcmp4_5, rxcmp6_7); + errors_v2 = _mm256_unpacklo_epi64(t0, t1); + + errors = _mm256_srli_epi32(errors_v2, 4); + errors = _mm256_and_si256(errors, _mm256_set1_epi32(0xF)); + errors = _mm256_and_si256(errors, flags2); + + index = _mm256_andnot_si256(errors, flags2); + errors = _mm256_or_si256(errors, + _mm256_srli_epi32(is_tunnel, 1)); + index = _mm256_or_si256(index, is_tunnel); + + /* + * Load ol_flags for eight packets using gather. Gather + * operations have extremely high latency (~19 cycles), + * execution and use of result should be separated as much + * as possible. + */ + ol_flags = _mm256_i32gather_epi32((int *)rxr->ol_flags_table, + index, sizeof(uint32_t)); + errors = _mm256_i32gather_epi32((int *)rxr->ol_flags_err_table, + errors, sizeof(uint32_t)); + + /* + * Pack the 128-bit array of valid descriptor flags into 64 + * bits and count the number of set bits in order to determine + * the number of valid descriptors. + */ + const __m256i perm_msk = + _mm256_set_epi32(7, 3, 6, 2, 5, 1, 4, 0); + info3_v = _mm256_permutevar8x32_epi32(errors_v2, perm_msk); + info3_v = _mm256_and_si256(errors_v2, info3_v_mask); + info3_v = _mm256_xor_si256(info3_v, valid_target); + + info3_v = _mm256_packs_epi32(info3_v, _mm256_setzero_si256()); + valid = _mm_cvtsi128_si64(_mm256_extracti128_si256(info3_v, 1)); + valid = (valid << CHAR_BIT) | + _mm_cvtsi128_si64(_mm256_castsi256_si128(info3_v)); + num_valid = __builtin_popcountll(valid & desc_valid_mask); + + if (num_valid == 0) + break; + + /* Update mbuf rearm_data for eight packets. */ + mbuf01 = _mm256_shuffle_epi8(rxcmp0_1, shuf_msk); + mbuf23 = _mm256_shuffle_epi8(rxcmp2_3, shuf_msk); + mbuf45 = _mm256_shuffle_epi8(rxcmp4_5, shuf_msk); + mbuf67 = _mm256_shuffle_epi8(rxcmp6_7, shuf_msk); + + /* Blend in ptype field for two mbufs at a time. */ + mbuf01 = _mm256_blend_epi32(mbuf01, ptypes, 0x11); + mbuf23 = _mm256_blend_epi32(mbuf23, + _mm256_srli_si256(ptypes, 4), 0x11); + mbuf45 = _mm256_blend_epi32(mbuf45, + _mm256_srli_si256(ptypes, 8), 0x11); + mbuf67 = _mm256_blend_epi32(mbuf67, + _mm256_srli_si256(ptypes, 12), 0x11); + + /* Unpack rearm data, set fixed fields for first four mbufs. */ + rearm0 = _mm256_permute2f128_si256(mbuf_init, mbuf01, 0x20); + rearm1 = _mm256_blend_epi32(mbuf_init, mbuf01, 0xF0); + rearm2 = _mm256_permute2f128_si256(mbuf_init, mbuf23, 0x20); + rearm3 = _mm256_blend_epi32(mbuf_init, mbuf23, 0xF0); + + /* Compute final ol_flags values for eight packets. */ + rss_flags = _mm256_and_si256(flags_type, rss_mask); + rss_flags = _mm256_srli_epi32(rss_flags, 9); + ol_flags = _mm256_or_si256(ol_flags, errors); + ol_flags = _mm256_or_si256(ol_flags, rss_flags); + ol_flags_hi = _mm256_permute2f128_si256(ol_flags, + ol_flags, 0x11); + + /* Set ol_flags fields for first four packets. */ + rearm0 = _mm256_blend_epi32(rearm0, + _mm256_slli_si256(ol_flags, 8), + 0x04); + rearm1 = _mm256_blend_epi32(rearm1, + _mm256_slli_si256(ol_flags_hi, 8), + 0x04); + rearm2 = _mm256_blend_epi32(rearm2, + _mm256_slli_si256(ol_flags, 4), + 0x04); + rearm3 = _mm256_blend_epi32(rearm3, + _mm256_slli_si256(ol_flags_hi, 4), + 0x04); + + /* Store all mbuf fields for first four packets. */ + _mm256_storeu_si256((void *)&rx_pkts[i + 0]->rearm_data, + rearm0); + _mm256_storeu_si256((void *)&rx_pkts[i + 1]->rearm_data, + rearm1); + _mm256_storeu_si256((void *)&rx_pkts[i + 2]->rearm_data, + rearm2); + _mm256_storeu_si256((void *)&rx_pkts[i + 3]->rearm_data, + rearm3); + + /* Unpack rearm data, set fixed fields for final four mbufs. */ + rearm4 = _mm256_permute2f128_si256(mbuf_init, mbuf45, 0x20); + rearm5 = _mm256_blend_epi32(mbuf_init, mbuf45, 0xF0); + rearm6 = _mm256_permute2f128_si256(mbuf_init, mbuf67, 0x20); + rearm7 = _mm256_blend_epi32(mbuf_init, mbuf67, 0xF0); + + /* Set ol_flags fields for final four packets. */ + rearm4 = _mm256_blend_epi32(rearm4, ol_flags, 0x04); + rearm5 = _mm256_blend_epi32(rearm5, ol_flags_hi, 0x04); + rearm6 = _mm256_blend_epi32(rearm6, + _mm256_srli_si256(ol_flags, 4), + 0x04); + rearm7 = _mm256_blend_epi32(rearm7, + _mm256_srli_si256(ol_flags_hi, 4), + 0x04); + + /* Store all mbuf fields for final four packets. */ + _mm256_storeu_si256((void *)&rx_pkts[i + 4]->rearm_data, + rearm4); + _mm256_storeu_si256((void *)&rx_pkts[i + 5]->rearm_data, + rearm5); + _mm256_storeu_si256((void *)&rx_pkts[i + 6]->rearm_data, + rearm6); + _mm256_storeu_si256((void *)&rx_pkts[i + 7]->rearm_data, + rearm7); + + nb_rx_pkts += num_valid; + if (num_valid < BNXT_RX_DESCS_PER_LOOP_VEC256) + break; + } + + if (nb_rx_pkts) { + rxr->rx_raw_prod = RING_ADV(rxr->rx_raw_prod, nb_rx_pkts); + + rxq->rxrearm_nb += nb_rx_pkts; + cpr->cp_raw_cons += 2 * nb_rx_pkts; + bnxt_db_cq(cpr); + } + + return nb_rx_pkts; +} + +uint16_t +bnxt_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts, + uint16_t nb_pkts) +{ + uint16_t cnt = 0; + + while (nb_pkts > RTE_BNXT_MAX_RX_BURST) { + uint16_t burst; + + burst = recv_burst_vec_avx2(rx_queue, rx_pkts + cnt, + RTE_BNXT_MAX_RX_BURST); + + cnt += burst; + nb_pkts -= burst; + + if (burst < RTE_BNXT_MAX_RX_BURST) + return cnt; + } + return cnt + recv_burst_vec_avx2(rx_queue, rx_pkts + cnt, nb_pkts); +} + +static void +bnxt_handle_tx_cp_vec(struct bnxt_tx_queue *txq) +{ + struct bnxt_cp_ring_info *cpr = txq->cp_ring; + uint32_t raw_cons = cpr->cp_raw_cons; + uint32_t cons; + uint32_t nb_tx_pkts = 0; + struct tx_cmpl *txcmp; + struct cmpl_base *cp_desc_ring = cpr->cp_desc_ring; + struct bnxt_ring *cp_ring_struct = cpr->cp_ring_struct; + uint32_t ring_mask = cp_ring_struct->ring_mask; + + do { + cons = RING_CMPL(ring_mask, raw_cons); + txcmp = (struct tx_cmpl *)&cp_desc_ring[cons]; + + if (!CMP_VALID(txcmp, raw_cons, cp_ring_struct)) + break; + + nb_tx_pkts += txcmp->opaque; + raw_cons = NEXT_RAW_CMP(raw_cons); + } while (nb_tx_pkts < ring_mask); + + if (nb_tx_pkts) { + if (txq->offloads & DEV_TX_OFFLOAD_MBUF_FAST_FREE) + bnxt_tx_cmp_vec_fast(txq, nb_tx_pkts); + else + bnxt_tx_cmp_vec(txq, nb_tx_pkts); + cpr->cp_raw_cons = raw_cons; + bnxt_db_cq(cpr); + } +} + +static inline void +bnxt_xmit_one(struct rte_mbuf *mbuf, struct tx_bd_long *txbd, + struct rte_mbuf **tx_buf) +{ + uint64_t dsc_hi, dsc_lo; + __m128i desc; + + *tx_buf = mbuf; + + dsc_hi = mbuf->buf_iova + mbuf->data_off; + dsc_lo = (mbuf->data_len << 16) | + bnxt_xmit_flags_len(mbuf->data_len, TX_BD_FLAGS_NOCMPL); + + desc = _mm_set_epi64x(dsc_hi, dsc_lo); + _mm_store_si128((void *)txbd, desc); +} + +static uint16_t +bnxt_xmit_fixed_burst_vec(struct bnxt_tx_queue *txq, struct rte_mbuf **pkts, + uint16_t nb_pkts) +{ + struct bnxt_tx_ring_info *txr = txq->tx_ring; + uint16_t tx_prod, tx_raw_prod = txr->tx_raw_prod; + struct tx_bd_long *txbd; + struct rte_mbuf **tx_buf; + uint16_t to_send; + + tx_prod = RING_IDX(txr->tx_ring_struct, tx_raw_prod); + txbd = &txr->tx_desc_ring[tx_prod]; + tx_buf = &txr->tx_buf_ring[tx_prod]; + + /* Prefetch next transmit buffer descriptors. */ + rte_prefetch0(txbd); + rte_prefetch0(txbd + 3); + + nb_pkts = RTE_MIN(nb_pkts, bnxt_tx_avail(txq)); + + if (unlikely(nb_pkts == 0)) + return 0; + + /* Handle TX burst request */ + to_send = nb_pkts; + + /* + * If current descriptor is not on a 32-byte boundary, send one packet + * to align for 32-byte stores. + */ + if (tx_prod & 1) { + bnxt_xmit_one(pkts[0], txbd++, tx_buf++); + to_send--; + pkts++; + } + + /* + * Send four packets per loop, with a single store for each pair + * of descriptors. + */ + while (to_send >= BNXT_TX_DESCS_PER_LOOP) { + uint64_t dsc0_hi, dsc0_lo, dsc1_hi, dsc1_lo; + uint64_t dsc2_hi, dsc2_lo, dsc3_hi, dsc3_lo; + __m256i dsc01, dsc23; + + /* Prefetch next transmit buffer descriptors. */ + rte_prefetch0(txbd + 4); + rte_prefetch0(txbd + 7); + + /* Copy four mbuf pointers to tx buf ring. */ +#ifdef RTE_ARCH_X86_64 + __m256i tmp = _mm256_loadu_si256((void *)pkts); + _mm256_storeu_si256((void *)tx_buf, tmp); +#else + __m128i tmp = _mm_loadu_si128((void *)pkts); + _mm_storeu_si128((void *)tx_buf, tmp); +#endif + + dsc0_hi = tx_buf[0]->buf_iova + tx_buf[0]->data_off; + dsc0_lo = (tx_buf[0]->data_len << 16) | + bnxt_xmit_flags_len(tx_buf[0]->data_len, + TX_BD_FLAGS_NOCMPL); + + dsc1_hi = tx_buf[1]->buf_iova + tx_buf[1]->data_off; + dsc1_lo = (tx_buf[1]->data_len << 16) | + bnxt_xmit_flags_len(tx_buf[1]->data_len, + TX_BD_FLAGS_NOCMPL); + + dsc01 = _mm256_set_epi64x(dsc1_hi, dsc1_lo, dsc0_hi, dsc0_lo); + + dsc2_hi = tx_buf[2]->buf_iova + tx_buf[2]->data_off; + dsc2_lo = (tx_buf[2]->data_len << 16) | + bnxt_xmit_flags_len(tx_buf[2]->data_len, + TX_BD_FLAGS_NOCMPL); + + dsc3_hi = tx_buf[3]->buf_iova + tx_buf[3]->data_off; + dsc3_lo = (tx_buf[3]->data_len << 16) | + bnxt_xmit_flags_len(tx_buf[3]->data_len, + TX_BD_FLAGS_NOCMPL); + + dsc23 = _mm256_set_epi64x(dsc3_hi, dsc3_lo, dsc2_hi, dsc2_lo); + + _mm256_store_si256((void *)txbd, dsc01); + _mm256_store_si256((void *)(txbd + 2), dsc23); + + to_send -= BNXT_TX_DESCS_PER_LOOP; + pkts += BNXT_TX_DESCS_PER_LOOP; + txbd += BNXT_TX_DESCS_PER_LOOP; + tx_buf += BNXT_TX_DESCS_PER_LOOP; + } + + /* Send any remaining packets, writing each descriptor individually. */ + while (to_send) { + bnxt_xmit_one(pkts[0], txbd++, tx_buf++); + to_send--; + pkts++; + } + + /* Request a completion for the final packet of the burst. */ + txbd[-1].opaque = nb_pkts; + txbd[-1].flags_type &= ~TX_BD_LONG_FLAGS_NO_CMPL; + + tx_raw_prod += nb_pkts; + bnxt_db_write(&txr->tx_db, tx_raw_prod); + + txr->tx_raw_prod = tx_raw_prod; + + return nb_pkts; +} + +uint16_t +bnxt_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts) +{ + int nb_sent = 0; + struct bnxt_tx_queue *txq = tx_queue; + struct bnxt_tx_ring_info *txr = txq->tx_ring; + uint16_t ring_size = txr->tx_ring_struct->ring_size; + + /* Tx queue was stopped; wait for it to be restarted */ + if (unlikely(!txq->tx_started)) { + PMD_DRV_LOG(DEBUG, "Tx q stopped;return\n"); + return 0; + } + + /* Handle TX completions */ + if (bnxt_tx_bds_in_hw(txq) >= txq->tx_free_thresh) + bnxt_handle_tx_cp_vec(txq); + + while (nb_pkts) { + uint16_t ret, num; + + /* + * Ensure that no more than RTE_BNXT_MAX_TX_BURST packets + * are transmitted before the next completion. + */ + num = RTE_MIN(nb_pkts, RTE_BNXT_MAX_TX_BURST); + + /* + * Ensure that a ring wrap does not occur within a call to + * bnxt_xmit_fixed_burst_vec(). + */ + num = RTE_MIN(num, ring_size - + (txr->tx_raw_prod & (ring_size - 1))); + ret = bnxt_xmit_fixed_burst_vec(txq, &tx_pkts[nb_sent], num); + nb_sent += ret; + nb_pkts -= ret; + if (ret < num) + break; + } + + return nb_sent; +} diff --git a/drivers/net/bnxt/bnxt_rxtx_vec_neon.c b/drivers/net/bnxt/bnxt_rxtx_vec_neon.c index a6e630ea5e..b4e9202568 100644 --- a/drivers/net/bnxt/bnxt_rxtx_vec_neon.c +++ b/drivers/net/bnxt/bnxt_rxtx_vec_neon.c @@ -200,17 +200,20 @@ recv_burst_vec_neon(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) * maximum number of packets to receive to be a multiple of the per- * loop count. */ - if (nb_pkts < RTE_BNXT_DESCS_PER_LOOP) - desc_valid_mask >>= 16 * (RTE_BNXT_DESCS_PER_LOOP - nb_pkts); - else - nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_BNXT_DESCS_PER_LOOP); + if (nb_pkts < BNXT_RX_DESCS_PER_LOOP_VEC128) { + desc_valid_mask >>= + 16 * (BNXT_RX_DESCS_PER_LOOP_VEC128 - nb_pkts); + } else { + nb_pkts = + RTE_ALIGN_FLOOR(nb_pkts, BNXT_RX_DESCS_PER_LOOP_VEC128); + } /* Handle RX burst request */ - for (i = 0; i < nb_pkts; i += RTE_BNXT_DESCS_PER_LOOP, - cons += RTE_BNXT_DESCS_PER_LOOP * 2, - mbcons += RTE_BNXT_DESCS_PER_LOOP) { - uint32x4_t rxcmp1[RTE_BNXT_DESCS_PER_LOOP]; - uint32x4_t rxcmp[RTE_BNXT_DESCS_PER_LOOP]; + for (i = 0; i < nb_pkts; i += BNXT_RX_DESCS_PER_LOOP_VEC128, + cons += BNXT_RX_DESCS_PER_LOOP_VEC128 * 2, + mbcons += BNXT_RX_DESCS_PER_LOOP_VEC128) { + uint32x4_t rxcmp1[BNXT_RX_DESCS_PER_LOOP_VEC128]; + uint32x4_t rxcmp[BNXT_RX_DESCS_PER_LOOP_VEC128]; uint32x4_t info3_v; uint64x2_t t0, t1; uint32_t num_valid; @@ -226,7 +229,7 @@ recv_burst_vec_neon(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) #endif /* Prefetch four descriptor pairs for next iteration. */ - if (i + RTE_BNXT_DESCS_PER_LOOP < nb_pkts) { + if (i + BNXT_RX_DESCS_PER_LOOP_VEC128 < nb_pkts) { rte_prefetch0(&cp_desc_ring[cons + 8]); rte_prefetch0(&cp_desc_ring[cons + 12]); } @@ -284,7 +287,7 @@ recv_burst_vec_neon(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) rxr); nb_rx_pkts += num_valid; - if (num_valid < RTE_BNXT_DESCS_PER_LOOP) + if (num_valid < BNXT_RX_DESCS_PER_LOOP_VEC128) break; } diff --git a/drivers/net/bnxt/bnxt_rxtx_vec_sse.c b/drivers/net/bnxt/bnxt_rxtx_vec_sse.c index fe074f82cf..c479697ac0 100644 --- a/drivers/net/bnxt/bnxt_rxtx_vec_sse.c +++ b/drivers/net/bnxt/bnxt_rxtx_vec_sse.c @@ -191,17 +191,20 @@ recv_burst_vec_sse(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) * maximum number of packets to receive to be a multiple of the per- * loop count. */ - if (nb_pkts < RTE_BNXT_DESCS_PER_LOOP) - desc_valid_mask >>= 16 * (RTE_BNXT_DESCS_PER_LOOP - nb_pkts); - else - nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_BNXT_DESCS_PER_LOOP); + if (nb_pkts < BNXT_RX_DESCS_PER_LOOP_VEC128) { + desc_valid_mask >>= + 16 * (BNXT_RX_DESCS_PER_LOOP_VEC128 - nb_pkts); + } else { + nb_pkts = + RTE_ALIGN_FLOOR(nb_pkts, BNXT_RX_DESCS_PER_LOOP_VEC128); + } /* Handle RX burst request */ - for (i = 0; i < nb_pkts; i += RTE_BNXT_DESCS_PER_LOOP, - cons += RTE_BNXT_DESCS_PER_LOOP * 2, - mbcons += RTE_BNXT_DESCS_PER_LOOP) { - __m128i rxcmp1[RTE_BNXT_DESCS_PER_LOOP]; - __m128i rxcmp[RTE_BNXT_DESCS_PER_LOOP]; + for (i = 0; i < nb_pkts; i += BNXT_RX_DESCS_PER_LOOP_VEC128, + cons += BNXT_RX_DESCS_PER_LOOP_VEC128 * 2, + mbcons += BNXT_RX_DESCS_PER_LOOP_VEC128) { + __m128i rxcmp1[BNXT_RX_DESCS_PER_LOOP_VEC128]; + __m128i rxcmp[BNXT_RX_DESCS_PER_LOOP_VEC128]; __m128i tmp0, tmp1, info3_v; uint32_t num_valid; @@ -216,7 +219,7 @@ recv_burst_vec_sse(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) #endif /* Prefetch four descriptor pairs for next iteration. */ - if (i + RTE_BNXT_DESCS_PER_LOOP < nb_pkts) { + if (i + BNXT_RX_DESCS_PER_LOOP_VEC128 < nb_pkts) { rte_prefetch0(&cp_desc_ring[cons + 8]); rte_prefetch0(&cp_desc_ring[cons + 12]); } @@ -265,7 +268,7 @@ recv_burst_vec_sse(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) rxr); nb_rx_pkts += num_valid; - if (num_valid < RTE_BNXT_DESCS_PER_LOOP) + if (num_valid < BNXT_RX_DESCS_PER_LOOP_VEC128) break; } @@ -383,7 +386,7 @@ bnxt_xmit_fixed_burst_vec(struct bnxt_tx_queue *txq, struct rte_mbuf **tx_pkts, /* Handle TX burst request */ to_send = nb_pkts; - while (to_send >= RTE_BNXT_DESCS_PER_LOOP) { + while (to_send >= BNXT_TX_DESCS_PER_LOOP) { /* Prefetch next transmit buffer descriptors. */ rte_prefetch0(txbd + 4); rte_prefetch0(txbd + 7); @@ -393,8 +396,8 @@ bnxt_xmit_fixed_burst_vec(struct bnxt_tx_queue *txq, struct rte_mbuf **tx_pkts, bnxt_xmit_one(tx_pkts[2], txbd++, tx_buf++); bnxt_xmit_one(tx_pkts[3], txbd++, tx_buf++); - to_send -= RTE_BNXT_DESCS_PER_LOOP; - tx_pkts += RTE_BNXT_DESCS_PER_LOOP; + to_send -= BNXT_TX_DESCS_PER_LOOP; + tx_pkts += BNXT_TX_DESCS_PER_LOOP; } while (to_send) { diff --git a/drivers/net/bnxt/bnxt_txr.h b/drivers/net/bnxt/bnxt_txr.h index e4bd90f883..6bfdc6d01a 100644 --- a/drivers/net/bnxt/bnxt_txr.h +++ b/drivers/net/bnxt/bnxt_txr.h @@ -11,6 +11,9 @@ #define BNXT_MAX_TSO_SEGS 32 #define BNXT_MIN_PKT_SIZE 52 +/* Number of transmit descriptors processed per inner loop in vector mode. */ +#define BNXT_TX_DESCS_PER_LOOP 4U + struct bnxt_tx_ring_info { uint16_t tx_raw_prod; uint16_t tx_raw_cons; @@ -48,6 +51,10 @@ uint16_t bnxt_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t bnxt_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts); #endif +#if defined(RTE_ARCH_X86) && defined(CC_AVX2_SUPPORT) +uint16_t bnxt_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts, + uint16_t nb_pkts); +#endif int bnxt_tx_queue_start(struct rte_eth_dev *dev, uint16_t tx_queue_id); int bnxt_tx_queue_stop(struct rte_eth_dev *dev, uint16_t tx_queue_id); diff --git a/drivers/net/bnxt/meson.build b/drivers/net/bnxt/meson.build index 117c753489..41c4796366 100644 --- a/drivers/net/bnxt/meson.build +++ b/drivers/net/bnxt/meson.build @@ -82,6 +82,23 @@ sources = files( if arch_subdir == 'x86' sources += files('bnxt_rxtx_vec_sse.c') + # compile AVX2 version if either: + # a. we have AVX supported in minimum instruction set baseline + # b. it's not minimum instruction set, but supported by compiler + if cc.get_define('__AVX2__', args: machine_args) != '' + cflags += ['-DCC_AVX2_SUPPORT'] + sources += files('bnxt_rxtx_vec_avx2.c') + elif cc.has_argument('-mavx2') + cflags += ['-DCC_AVX2_SUPPORT'] + bnxt_avx2_lib = static_library('bnxt_avx2_lib', + 'bnxt_rxtx_vec_avx2.c', + dependencies: [static_rte_ethdev, + static_rte_bus_pci, + static_rte_kvargs, static_rte_hash], + include_directories: includes, + c_args: [cflags, '-mavx2']) + objs += bnxt_avx2_lib.extract_objects('bnxt_rxtx_vec_avx2.c') + endif elif arch_subdir == 'arm' and host_machine.cpu_family().startswith('aarch64') sources += files('bnxt_rxtx_vec_neon.c') endif