From patchwork Fri Jul 23 03:10:49 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Feifei Wang X-Patchwork-Id: 96225 X-Patchwork-Delegate: qi.z.zhang@intel.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 2A208A0C46; Fri, 23 Jul 2021 05:11:27 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 6B6FC410F1; Fri, 23 Jul 2021 05:11:19 +0200 (CEST) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by mails.dpdk.org (Postfix) with ESMTP id 0702440DDA for ; Fri, 23 Jul 2021 05:11:18 +0200 (CEST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 73973106F; Thu, 22 Jul 2021 20:11:17 -0700 (PDT) Received: from net-x86-dell-8268.shanghai.arm.com (net-x86-dell-8268.shanghai.arm.com [10.169.210.99]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 9F55E3F694; Thu, 22 Jul 2021 20:11:15 -0700 (PDT) From: Feifei Wang To: Ruifeng Wang , Beilei Xing Cc: dev@dpdk.org, nd@arm.com, Feifei Wang Date: Fri, 23 Jul 2021 11:10:49 +0800 Message-Id: <20210723031049.2201665-5-feifei.wang2@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210723031049.2201665-1-feifei.wang2@arm.com> References: <20210723031049.2201665-1-feifei.wang2@arm.com> MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v1 4/4] net/i40e: change code order to reduce L1 cache misses X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" For N1 platform, packet mbuf load and descs load are hot spots to limit the performance for "desc_to_ptype_v" and "desc_to_olflags_v" functions in i40e rx NEON path. This is because packet mbuf and descs are evicted from l1d-cache to l2d-cache. To reduce l1d-cache-misses and improve the performance, change the code order and move "desc_to_ptype_v" and "desc_to_olflags_v" functions forward to the location, where packet mbuf and descs are just loaded. Test Result: dpdk:21.08-rc1 gcc-9 For n1sdp, the patch improves the performance by 1.8%. For thunderx2, no performance changes. Signed-off-by: Feifei Wang Reviewed-by: Ruifeng Wang --- drivers/net/i40e/i40e_rxtx_vec_neon.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c index 8f3188e910..b2683fda60 100644 --- a/drivers/net/i40e/i40e_rxtx_vec_neon.c +++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c @@ -301,18 +301,6 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq, rte_mbuf_prefetch_part2(rx_pkts[pos + 3]); } - /* C.1 4=>2 filter staterr info only */ - sterr_tmp2 = vzipq_u16(vreinterpretq_u16_u64(descs[1]), - vreinterpretq_u16_u64(descs[3])); - sterr_tmp1 = vzipq_u16(vreinterpretq_u16_u64(descs[0]), - vreinterpretq_u16_u64(descs[2])); - - /* C.2 get 4 pkts staterr value */ - staterr = vzipq_u16(sterr_tmp1.val[1], - sterr_tmp2.val[1]).val[0]; - - desc_to_olflags_v(rxq, descs, &rx_pkts[pos]); - /* pkts shift the pktlen field to be 16-bit aligned*/ uint32x4_t len3 = vshlq_u32(vreinterpretq_u32_u64(descs[3]), len_shl); @@ -367,10 +355,22 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq, desc_to_ptype_v(descs, &rx_pkts[pos], ptype_tbl); + desc_to_olflags_v(rxq, descs, &rx_pkts[pos]); + if (likely(pos + RTE_I40E_DESCS_PER_LOOP < nb_pkts)) { rte_prefetch_non_temporal(rxdp + RTE_I40E_DESCS_PER_LOOP); } + /* C.1 4=>2 filter staterr info only */ + sterr_tmp2 = vzipq_u16(vreinterpretq_u16_u64(descs[1]), + vreinterpretq_u16_u64(descs[3])); + sterr_tmp1 = vzipq_u16(vreinterpretq_u16_u64(descs[0]), + vreinterpretq_u16_u64(descs[2])); + + /* C.2 get 4 pkts staterr value */ + staterr = vzipq_u16(sterr_tmp1.val[1], + sterr_tmp2.val[1]).val[0]; + /* C* extract and record EOP bit */ if (split_packet) { uint8x16_t eop_shuf_mask = {