From patchwork Mon Oct 21 15:40:03 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Marvin Liu <yong.liu@intel.com>
X-Patchwork-Id: 61544
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@dpdk.org
Delivered-To: patchwork@dpdk.org
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id D78C634F3;
	Mon, 21 Oct 2019 10:00:02 +0200 (CEST)
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
	by dpdk.org (Postfix) with ESMTP id 814813237
	for <dev@dpdk.org>; Mon, 21 Oct 2019 10:00:00 +0200 (CEST)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
	by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
	21 Oct 2019 00:59:59 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.67,322,1566889200"; d="scan'208";a="227237804"
Received: from npg-dpdk-virtual-marvin-dev.sh.intel.com ([10.67.119.142])
	by fmsmga002.fm.intel.com with ESMTP; 21 Oct 2019 00:59:58 -0700
From: Marvin Liu <yong.liu@intel.com>
To: maxime.coquelin@redhat.com, tiwei.bie@intel.com, zhihong.wang@intel.com,
	stephen@networkplumber.org, gavin.hu@arm.com
Cc: dev@dpdk.org,
	Marvin Liu <yong.liu@intel.com>
Date: Mon, 21 Oct 2019 23:40:03 +0800
Message-Id: <20191021154016.16274-1-yong.liu@intel.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20191015160739.51940-1-yong.liu@intel.com>
References: <20191015160739.51940-1-yong.liu@intel.com>
Subject: [dpdk-dev] [PATCH v7 00/13] vhost packed ring performance
	optimization
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
	<mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
	<mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Packed ring has more compact ring format and thus can significantly
reduce the number of cache miss. It can lead to better performance.
This has been approved in virtio user driver, on normal E5 Xeon cpu
single core performance can raise 12%.

http://mails.dpdk.org/archives/dev/2018-April/095470.html

However vhost performance with packed ring performance was decreased.
Through analysis, mostly extra cost was from the calculating of each
descriptor flag which depended on ring wrap counter. Moreover, both
frontend and backend need to write same descriptors which will cause
cache contention. Especially when doing vhost enqueue function, virtio
refill packed ring function may write same cache line when vhost doing
enqueue function. This kind of extra cache cost will reduce the benefit
of reducing cache misses. 

For optimizing vhost packed ring performance, vhost enqueue and dequeue
function will be splitted into fast and normal path.

Several methods will be taken in fast path:
  Handle descriptors in one cache line by batch.
  Split loop function into more pieces and unroll them.
  Prerequisite check that whether I/O space can copy directly into mbuf
    space and vice versa. 
  Prerequisite check that whether descriptor mapping is successful.
  Distinguish vhost used ring update function by enqueue and dequeue
    function.
  Buffer dequeue used descriptors as many as possible.
  Update enqueue used descriptors by cache line.

After all these methods done, single core vhost PvP performance with 64B
packet on Xeon 8180 can boost 35%.

v7:
- Rebase code
- Rename unroll macro and definitions
- Calculate flags when doing single dequeue

v6:
- Fix dequeue zcopy result check

v5:
- Remove disable sw prefetch as performance impact is small
- Change unroll pragma macro format
- Rename shadow counter elements names
- Clean dequeue update check condition
- Add inline functions replace of duplicated code
- Unify code style

v4:
- Support meson build
- Remove memory region cache for no clear performance gain and ABI break
- Not assume ring size is power of two

v3:
- Check available index overflow
- Remove dequeue remained descs number check
- Remove changes in split ring datapath
- Call memory write barriers once when updating used flags
- Rename some functions and macros
- Code style optimization

v2:
- Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc
- Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST)
- Optimize dequeue used ring update when in_order negotiated


Marvin Liu (13):
  vhost: add packed ring indexes increasing function
  vhost: add packed ring single enqueue
  vhost: try to unroll for each loop
  vhost: add packed ring batch enqueue
  vhost: add packed ring single dequeue
  vhost: add packed ring batch dequeue
  vhost: flush enqueue updates by cacheline
  vhost: flush batched enqueue descs directly
  vhost: buffer packed ring dequeue updates
  vhost: optimize packed ring enqueue
  vhost: add packed ring zcopy batch and single dequeue
  vhost: optimize packed ring dequeue
  vhost: optimize packed ring dequeue when in-order

 lib/librte_vhost/Makefile     |  18 +
 lib/librte_vhost/meson.build  |   7 +
 lib/librte_vhost/vhost.h      |  57 ++
 lib/librte_vhost/virtio_net.c | 945 +++++++++++++++++++++++++++-------
 4 files changed, 834 insertions(+), 193 deletions(-)