From patchwork Fri Nov 6 00:04:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexander Kozyrev X-Patchwork-Id: 83779 X-Patchwork-Delegate: rasland@nvidia.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id B812AA0524; Fri, 6 Nov 2020 01:04:51 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id E40911515; Fri, 6 Nov 2020 01:04:49 +0100 (CET) Received: from mellanox.co.il (mail-il-dmz.mellanox.com [193.47.165.129]) by dpdk.org (Postfix) with ESMTP id 32D52126B for ; Fri, 6 Nov 2020 01:04:48 +0100 (CET) Received: from Internal Mail-Server by MTLPINE1 (envelope-from akozyrev@nvidia.com) with SMTP; 6 Nov 2020 02:04:44 +0200 Received: from nvidia.com (pegasus02.mtr.labs.mlnx [10.210.16.122]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id 0A604iXl016424; Fri, 6 Nov 2020 02:04:44 +0200 From: Alexander Kozyrev To: dev@dpdk.org Cc: rasland@nvidia.com, viacheslavo@nvidia.com, matan@nvidia.com Date: Fri, 6 Nov 2020 00:04:42 +0000 Message-Id: <20201106000442.26059-1-akozyrev@nvidia.com> X-Mailer: git-send-email 2.24.1 MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH] net/mlx5: improve vMPRQ descriptors allocation locality X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" There is a performance penalty for the replenish scheme used in vectorized Rx burst for both MPRQ and SPRQ. Mbuf elements are being filled at the end of the mbufs array and being replenished at the beginning. That leads to an increase in cache misses and the performance drop. The more Rx descriptors are used the worse the situation. Change the allocation scheme for vectorized MPRQ Rx burst: allocate new mbufs only when consumed mbufs are almost depleted (always have one burst gap between allocated and consumed indices). Keeping a small number of mbufs allocated improves cache locality and improves performance a lot. Unfortunately, this approach cannot be applied to SPRQ Rx burst routine. In MPRQ Rx burst we simply copy packets from external MPRQ buffers or attach these buffers to mbufs. In SPRQ Rx burst we allow the NIC to fill mbufs for us. Hence keeping a small number of allocated mbufs will limit NIC ability to fill as many buffers as possible. This fact offsets the advantage of better cache locality. Fixes: f2fa5327ff ("net/mlx5: implement vectorized MPRQ burst") Signed-off-by: Alexander Kozyrev --- drivers/net/mlx5/mlx5_rxtx_vec.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c index 469ea8401d..8001ab6eb3 100644 --- a/drivers/net/mlx5/mlx5_rxtx_vec.c +++ b/drivers/net/mlx5/mlx5_rxtx_vec.c @@ -145,16 +145,16 @@ mlx5_rx_mprq_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq) const uint32_t strd_n = 1 << rxq->strd_num_n; const uint32_t elts_n = wqe_n * strd_n; const uint32_t wqe_mask = elts_n - 1; - uint32_t n = elts_n - (rxq->elts_ci - rxq->rq_pi); + uint32_t n = rxq->elts_ci - rxq->rq_pi; uint32_t elts_idx = rxq->elts_ci & wqe_mask; struct rte_mbuf **elts = &(*rxq->elts)[elts_idx]; - /* Not to cross queue end. */ - if (n >= rxq->rq_repl_thresh) { + if (n <= rxq->rq_repl_thresh) { MLX5_ASSERT(n >= MLX5_VPMD_RXQ_RPLNSH_THRESH(elts_n)); MLX5_ASSERT(MLX5_VPMD_RXQ_RPLNSH_THRESH(elts_n) > MLX5_VPMD_DESCS_PER_LOOP); - n = RTE_MIN(n, elts_n - elts_idx); + /* Not to cross queue end. */ + n = RTE_MIN(n + MLX5_VPMD_RX_MAX_BURST, elts_n - elts_idx); if (rte_mempool_get_bulk(rxq->mp, (void *)elts, n) < 0) { rxq->stats.rx_nombuf += n; return;