From patchwork Wed Oct 9 13:38:35 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Marvin Liu X-Patchwork-Id: 60733 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 456021C042; Wed, 9 Oct 2019 07:59:27 +0200 (CEST) Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by dpdk.org (Postfix) with ESMTP id 951661C031 for ; Wed, 9 Oct 2019 07:59:25 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 08 Oct 2019 22:59:25 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.67,273,1566889200"; d="scan'208";a="223473359" Received: from npg-dpdk-virtual-marvin-dev.sh.intel.com ([10.67.119.142]) by fmsmga002.fm.intel.com with ESMTP; 08 Oct 2019 22:59:22 -0700 From: Marvin Liu To: maxime.coquelin@redhat.com, tiwei.bie@intel.com, zhihong.wang@intel.com, stephen@networkplumber.org, gavin.hu@arm.com Cc: dev@dpdk.org, Marvin Liu Date: Wed, 9 Oct 2019 21:38:35 +0800 Message-Id: <20191009133849.69002-1-yong.liu@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190925171329.63734-1-yong.liu@intel.com> References: <20190925171329.63734-1-yong.liu@intel.com> MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v4 00/14] vhost packed ring performance optimization X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Packed ring has more compact ring format and thus can significantly reduce the number of cache miss. It can lead to better performance. This has been approved in virtio user driver, on normal E5 Xeon cpu single core performance can raise 12%. http://mails.dpdk.org/archives/dev/2018-April/095470.html However vhost performance with packed ring performance was decreased. Through analysis, mostly extra cost was from the calculating of each descriptor flag which depended on ring wrap counter. Moreover, both frontend and backend need to write same descriptors which will cause cache contention. Especially when doing vhost enqueue function, virtio refill packed ring function may write same cache line when vhost doing enqueue function. This kind of extra cache cost will reduce the benefit of reducing cache misses. For optimizing vhost packed ring performance, vhost enqueue and dequeue function will be splitted into fast and normal path. Several methods will be taken in fast path: Handle descriptors in one cache line by batch. Split loop function into more pieces and unroll them. Prerequisite check that whether I/O space can copy directly into mbuf space and vice versa. Prerequisite check that whether descriptor mapping is successful. Distinguish vhost used ring update function by enqueue and dequeue function. Buffer dequeue used descriptors as many as possible. Update enqueue used descriptors by cache line. Disable sofware prefetch if hardware can do better. After all these methods done, single core vhost PvP performance with 64B packet on Xeon 8180 can boost 40%. v4: - Support meson build - Remove memory region cache for no clear performance gain and ABI break - Not assume ring size is power of two v3: - Check available index overflow - Remove dequeue remained descs number check - Remove changes in split ring datapath - Call memory write barriers once when updating used flags - Rename some functions and macros - Code style optimization v2: - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST) - Optimize dequeue used ring update when in_order negotiated Marvin Liu (14): vhost: add single packet enqueue function vhost: unify unroll pragma parameter vhost: add batch enqueue function for packed ring vhost: add single packet dequeue function vhost: add batch dequeue function vhost: flush vhost enqueue shadow ring by batch vhost: add flush function for batch enqueue vhost: buffer vhost dequeue shadow ring vhost: split enqueue and dequeue flush functions vhost: optimize enqueue function of packed ring vhost: add batch and single zero dequeue functions vhost: optimize dequeue function of packed ring vhost: check whether disable software pre-fetch vhost: optimize packed ring dequeue when in-order lib/librte_vhost/Makefile | 24 + lib/librte_vhost/meson.build | 11 + lib/librte_vhost/vhost.h | 33 ++ lib/librte_vhost/virtio_net.c | 993 +++++++++++++++++++++++++++------- 4 files changed, 878 insertions(+), 183 deletions(-)