Message ID | 20191021220813.55236-1-yong.liu@intel.com (mailing list archive) |
---|---|
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id F225D1BEE1; Mon, 21 Oct 2019 16:28:01 +0200 (CEST) Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id 847E41BECD for <dev@dpdk.org>; Mon, 21 Oct 2019 16:28:00 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 21 Oct 2019 07:27:59 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.67,323,1566889200"; d="scan'208";a="191127098" Received: from npg-dpdk-virtual-marvin-dev.sh.intel.com ([10.67.119.142]) by orsmga008.jf.intel.com with ESMTP; 21 Oct 2019 07:27:57 -0700 From: Marvin Liu <yong.liu@intel.com> To: maxime.coquelin@redhat.com, tiwei.bie@intel.com, zhihong.wang@intel.com, stephen@networkplumber.org, gavin.hu@arm.com Cc: dev@dpdk.org, Marvin Liu <yong.liu@intel.com> Date: Tue, 22 Oct 2019 06:08:00 +0800 Message-Id: <20191021220813.55236-1-yong.liu@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191021154016.16274-1-yong.liu@intel.com> References: <20191021154016.16274-1-yong.liu@intel.com> Subject: [dpdk-dev] [PATCH v8 00/13] vhost packed ring performance optimization X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org> |
Series |
vhost packed ring performance optimization
|
|
Message
Marvin Liu
Oct. 21, 2019, 10:08 p.m. UTC
Packed ring has more compact ring format and thus can significantly reduce the number of cache miss. It can lead to better performance. This has been approved in virtio user driver, on normal E5 Xeon cpu single core performance can raise 12%. http://mails.dpdk.org/archives/dev/2018-April/095470.html However vhost performance with packed ring performance was decreased. Through analysis, mostly extra cost was from the calculating of each descriptor flag which depended on ring wrap counter. Moreover, both frontend and backend need to write same descriptors which will cause cache contention. Especially when doing vhost enqueue function, virtio refill packed ring function may write same cache line when vhost doing enqueue function. This kind of extra cache cost will reduce the benefit of reducing cache misses. For optimizing vhost packed ring performance, vhost enqueue and dequeue function will be splitted into fast and normal path. Several methods will be taken in fast path: Handle descriptors in one cache line by batch. Split loop function into more pieces and unroll them. Prerequisite check that whether I/O space can copy directly into mbuf space and vice versa. Prerequisite check that whether descriptor mapping is successful. Distinguish vhost used ring update function by enqueue and dequeue function. Buffer dequeue used descriptors as many as possible. Update enqueue used descriptors by cache line. After all these methods done, single core vhost PvP performance with 64B packet on Xeon 8180 can boost 35%. v8: - Allocate mbuf by virtio_dev_pktmbuf_alloc v7: - Rebase code - Rename unroll macro and definitions - Calculate flags when doing single dequeue v6: - Fix dequeue zcopy result check v5: - Remove disable sw prefetch as performance impact is small - Change unroll pragma macro format - Rename shadow counter elements names - Clean dequeue update check condition - Add inline functions replace of duplicated code - Unify code style v4: - Support meson build - Remove memory region cache for no clear performance gain and ABI break - Not assume ring size is power of two v3: - Check available index overflow - Remove dequeue remained descs number check - Remove changes in split ring datapath - Call memory write barriers once when updating used flags - Rename some functions and macros - Code style optimization v2: - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST) - Optimize dequeue used ring update when in_order negotiated Marvin Liu (13): vhost: add packed ring indexes increasing function vhost: add packed ring single enqueue vhost: try to unroll for each loop vhost: add packed ring batch enqueue vhost: add packed ring single dequeue vhost: add packed ring batch dequeue vhost: flush enqueue updates by cacheline vhost: flush batched enqueue descs directly vhost: buffer packed ring dequeue updates vhost: optimize packed ring enqueue vhost: add packed ring zcopy batch and single dequeue vhost: optimize packed ring dequeue vhost: optimize packed ring dequeue when in-order lib/librte_vhost/Makefile | 18 + lib/librte_vhost/meson.build | 7 + lib/librte_vhost/vhost.h | 57 ++ lib/librte_vhost/virtio_net.c | 948 +++++++++++++++++++++++++++------- 4 files changed, 837 insertions(+), 193 deletions(-)
Comments
I get some checkpatch warnings, and build fails with clang. Could you please fix these issues and send v9? Thanks, Maxime ### [PATCH] vhost: try to unroll for each loop WARNING:CAMELCASE: Avoid CamelCase: <_Pragma> #78: FILE: lib/librte_vhost/vhost.h:47: +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parenthesis #78: FILE: lib/librte_vhost/vhost.h:47: +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll 4") \ + for (iter = val; iter < size; iter++) ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parenthesis #83: FILE: lib/librte_vhost/vhost.h:52: +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ + for (iter = val; iter < size; iter++) ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in parenthesis #88: FILE: lib/librte_vhost/vhost.h:57: +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ + for (iter = val; iter < size; iter++) total: 3 errors, 1 warnings, 67 lines checked 0/1 valid patch/tmp/dpdk_build/lib/librte_vhost/virtio_net.c:2065:1: error: unused function 'free_zmbuf' [-Werror,-Wunused-function] free_zmbuf(struct vhost_virtqueue *vq) ^ 1 error generated. make[5]: *** [virtio_net.o] Error 1 make[4]: *** [librte_vhost] Error 2 make[4]: *** Waiting for unfinished jobs.... make[3]: *** [lib] Error 2 make[2]: *** [all] Error 2 make[1]: *** [pre_install] Error 2 make: *** [install] Error 2 On 10/22/19 12:08 AM, Marvin Liu wrote: > Packed ring has more compact ring format and thus can significantly > reduce the number of cache miss. It can lead to better performance. > This has been approved in virtio user driver, on normal E5 Xeon cpu > single core performance can raise 12%. > > http://mails.dpdk.org/archives/dev/2018-April/095470.html > > However vhost performance with packed ring performance was decreased. > Through analysis, mostly extra cost was from the calculating of each > descriptor flag which depended on ring wrap counter. Moreover, both > frontend and backend need to write same descriptors which will cause > cache contention. Especially when doing vhost enqueue function, virtio > refill packed ring function may write same cache line when vhost doing > enqueue function. This kind of extra cache cost will reduce the benefit > of reducing cache misses. > > For optimizing vhost packed ring performance, vhost enqueue and dequeue > function will be splitted into fast and normal path. > > Several methods will be taken in fast path: > Handle descriptors in one cache line by batch. > Split loop function into more pieces and unroll them. > Prerequisite check that whether I/O space can copy directly into mbuf > space and vice versa. > Prerequisite check that whether descriptor mapping is successful. > Distinguish vhost used ring update function by enqueue and dequeue > function. > Buffer dequeue used descriptors as many as possible. > Update enqueue used descriptors by cache line. > > After all these methods done, single core vhost PvP performance with 64B > packet on Xeon 8180 can boost 35%. > > v8: > - Allocate mbuf by virtio_dev_pktmbuf_alloc > > v7: > - Rebase code > - Rename unroll macro and definitions > - Calculate flags when doing single dequeue > > v6: > - Fix dequeue zcopy result check > > v5: > - Remove disable sw prefetch as performance impact is small > - Change unroll pragma macro format > - Rename shadow counter elements names > - Clean dequeue update check condition > - Add inline functions replace of duplicated code > - Unify code style > > v4: > - Support meson build > - Remove memory region cache for no clear performance gain and ABI break > - Not assume ring size is power of two > > v3: > - Check available index overflow > - Remove dequeue remained descs number check > - Remove changes in split ring datapath > - Call memory write barriers once when updating used flags > - Rename some functions and macros > - Code style optimization > > v2: > - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc > - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST) > - Optimize dequeue used ring update when in_order negotiated > > > Marvin Liu (13): > vhost: add packed ring indexes increasing function > vhost: add packed ring single enqueue > vhost: try to unroll for each loop > vhost: add packed ring batch enqueue > vhost: add packed ring single dequeue > vhost: add packed ring batch dequeue > vhost: flush enqueue updates by cacheline > vhost: flush batched enqueue descs directly > vhost: buffer packed ring dequeue updates > vhost: optimize packed ring enqueue > vhost: add packed ring zcopy batch and single dequeue > vhost: optimize packed ring dequeue > vhost: optimize packed ring dequeue when in-order > > lib/librte_vhost/Makefile | 18 + > lib/librte_vhost/meson.build | 7 + > lib/librte_vhost/vhost.h | 57 ++ > lib/librte_vhost/virtio_net.c | 948 +++++++++++++++++++++++++++------- > 4 files changed, 837 insertions(+), 193 deletions(-) >
> -----Original Message----- > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > Sent: Thursday, October 24, 2019 2:50 PM > To: Liu, Yong <yong.liu@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Wang, > Zhihong <zhihong.wang@intel.com>; stephen@networkplumber.org; > gavin.hu@arm.com > Cc: dev@dpdk.org > Subject: Re: [PATCH v8 00/13] vhost packed ring performance optimization > > I get some checkpatch warnings, and build fails with clang. > Could you please fix these issues and send v9? > Hi Maxime, Clang build fails will be fixed in v9. For checkpatch warning, it was due to pragma string inside. Previous version can avoid such warning, while format is a little messy as below. I prefer to keep code clean and more readable. How about your idea? +#ifdef UNROLL_PRAGMA_PARAM +#define VHOST_UNROLL_PRAGMA(param) _Pragma(param) +#else +#define VHOST_UNROLL_PRAGMA(param) do {} while (0); +#endif + VHOST_UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM) + for (i = 0; i < PACKED_BATCH_SIZE; i++) Regards, Marvin > Thanks, > Maxime > > ### [PATCH] vhost: try to unroll for each loop > > WARNING:CAMELCASE: Avoid CamelCase: <_Pragma> > #78: FILE: lib/librte_vhost/vhost.h:47: > +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll > 4") \ > > ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in > parenthesis > #78: FILE: lib/librte_vhost/vhost.h:47: > +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll > 4") \ > + for (iter = val; iter < size; iter++) > > ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in > parenthesis > #83: FILE: lib/librte_vhost/vhost.h:52: > +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ > + for (iter = val; iter < size; iter++) > > ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in > parenthesis > #88: FILE: lib/librte_vhost/vhost.h:57: > +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ > + for (iter = val; iter < size; iter++) > > total: 3 errors, 1 warnings, 67 lines checked > > 0/1 valid patch/tmp/dpdk_build/lib/librte_vhost/virtio_net.c:2065:1: > error: unused function 'free_zmbuf' [-Werror,-Wunused-function] > free_zmbuf(struct vhost_virtqueue *vq) > ^ > 1 error generated. > make[5]: *** [virtio_net.o] Error 1 > make[4]: *** [librte_vhost] Error 2 > make[4]: *** Waiting for unfinished jobs.... > make[3]: *** [lib] Error 2 > make[2]: *** [all] Error 2 > make[1]: *** [pre_install] Error 2 > make: *** [install] Error 2 > > > On 10/22/19 12:08 AM, Marvin Liu wrote: > > Packed ring has more compact ring format and thus can significantly > > reduce the number of cache miss. It can lead to better performance. > > This has been approved in virtio user driver, on normal E5 Xeon cpu > > single core performance can raise 12%. > > > > http://mails.dpdk.org/archives/dev/2018-April/095470.html > > > > However vhost performance with packed ring performance was decreased. > > Through analysis, mostly extra cost was from the calculating of each > > descriptor flag which depended on ring wrap counter. Moreover, both > > frontend and backend need to write same descriptors which will cause > > cache contention. Especially when doing vhost enqueue function, virtio > > refill packed ring function may write same cache line when vhost doing > > enqueue function. This kind of extra cache cost will reduce the benefit > > of reducing cache misses. > > > > For optimizing vhost packed ring performance, vhost enqueue and dequeue > > function will be splitted into fast and normal path. > > > > Several methods will be taken in fast path: > > Handle descriptors in one cache line by batch. > > Split loop function into more pieces and unroll them. > > Prerequisite check that whether I/O space can copy directly into mbuf > > space and vice versa. > > Prerequisite check that whether descriptor mapping is successful. > > Distinguish vhost used ring update function by enqueue and dequeue > > function. > > Buffer dequeue used descriptors as many as possible. > > Update enqueue used descriptors by cache line. > > > > After all these methods done, single core vhost PvP performance with 64B > > packet on Xeon 8180 can boost 35%. > > > > v8: > > - Allocate mbuf by virtio_dev_pktmbuf_alloc > > > > v7: > > - Rebase code > > - Rename unroll macro and definitions > > - Calculate flags when doing single dequeue > > > > v6: > > - Fix dequeue zcopy result check > > > > v5: > > - Remove disable sw prefetch as performance impact is small > > - Change unroll pragma macro format > > - Rename shadow counter elements names > > - Clean dequeue update check condition > > - Add inline functions replace of duplicated code > > - Unify code style > > > > v4: > > - Support meson build > > - Remove memory region cache for no clear performance gain and ABI break > > - Not assume ring size is power of two > > > > v3: > > - Check available index overflow > > - Remove dequeue remained descs number check > > - Remove changes in split ring datapath > > - Call memory write barriers once when updating used flags > > - Rename some functions and macros > > - Code style optimization > > > > v2: > > - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc > > - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST) > > - Optimize dequeue used ring update when in_order negotiated > > > > > > Marvin Liu (13): > > vhost: add packed ring indexes increasing function > > vhost: add packed ring single enqueue > > vhost: try to unroll for each loop > > vhost: add packed ring batch enqueue > > vhost: add packed ring single dequeue > > vhost: add packed ring batch dequeue > > vhost: flush enqueue updates by cacheline > > vhost: flush batched enqueue descs directly > > vhost: buffer packed ring dequeue updates > > vhost: optimize packed ring enqueue > > vhost: add packed ring zcopy batch and single dequeue > > vhost: optimize packed ring dequeue > > vhost: optimize packed ring dequeue when in-order > > > > lib/librte_vhost/Makefile | 18 + > > lib/librte_vhost/meson.build | 7 + > > lib/librte_vhost/vhost.h | 57 ++ > > lib/librte_vhost/virtio_net.c | 948 +++++++++++++++++++++++++++------- > > 4 files changed, 837 insertions(+), 193 deletions(-) > >
On 10/24/19 9:18 AM, Liu, Yong wrote: > > >> -----Original Message----- >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] >> Sent: Thursday, October 24, 2019 2:50 PM >> To: Liu, Yong <yong.liu@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Wang, >> Zhihong <zhihong.wang@intel.com>; stephen@networkplumber.org; >> gavin.hu@arm.com >> Cc: dev@dpdk.org >> Subject: Re: [PATCH v8 00/13] vhost packed ring performance optimization >> >> I get some checkpatch warnings, and build fails with clang. >> Could you please fix these issues and send v9? >> > > > Hi Maxime, > Clang build fails will be fixed in v9. For checkpatch warning, it was due to pragma string inside. > Previous version can avoid such warning, while format is a little messy as below. > I prefer to keep code clean and more readable. How about your idea? > > +#ifdef UNROLL_PRAGMA_PARAM > +#define VHOST_UNROLL_PRAGMA(param) _Pragma(param) > +#else > +#define VHOST_UNROLL_PRAGMA(param) do {} while (0); > +#endif > > + VHOST_UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM) > + for (i = 0; i < PACKED_BATCH_SIZE; i++) That's less clean indeed. I agree to waive the checkpatch errors. just fix the Clang build for patch 8 and we're good. Thanks, Maxime > Regards, > Marvin > >> Thanks, >> Maxime >> >> ### [PATCH] vhost: try to unroll for each loop >> >> WARNING:CAMELCASE: Avoid CamelCase: <_Pragma> >> #78: FILE: lib/librte_vhost/vhost.h:47: >> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll >> 4") \ >> >> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in >> parenthesis >> #78: FILE: lib/librte_vhost/vhost.h:47: >> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll >> 4") \ >> + for (iter = val; iter < size; iter++) >> >> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in >> parenthesis >> #83: FILE: lib/librte_vhost/vhost.h:52: >> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \ >> + for (iter = val; iter < size; iter++) >> >> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in >> parenthesis >> #88: FILE: lib/librte_vhost/vhost.h:57: >> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \ >> + for (iter = val; iter < size; iter++) >> >> total: 3 errors, 1 warnings, 67 lines checked >> >> 0/1 valid patch/tmp/dpdk_build/lib/librte_vhost/virtio_net.c:2065:1: >> error: unused function 'free_zmbuf' [-Werror,-Wunused-function] >> free_zmbuf(struct vhost_virtqueue *vq) >> ^ >> 1 error generated. >> make[5]: *** [virtio_net.o] Error 1 >> make[4]: *** [librte_vhost] Error 2 >> make[4]: *** Waiting for unfinished jobs.... >> make[3]: *** [lib] Error 2 >> make[2]: *** [all] Error 2 >> make[1]: *** [pre_install] Error 2 >> make: *** [install] Error 2 >> >> >> On 10/22/19 12:08 AM, Marvin Liu wrote: >>> Packed ring has more compact ring format and thus can significantly >>> reduce the number of cache miss. It can lead to better performance. >>> This has been approved in virtio user driver, on normal E5 Xeon cpu >>> single core performance can raise 12%. >>> >>> http://mails.dpdk.org/archives/dev/2018-April/095470.html >>> >>> However vhost performance with packed ring performance was decreased. >>> Through analysis, mostly extra cost was from the calculating of each >>> descriptor flag which depended on ring wrap counter. Moreover, both >>> frontend and backend need to write same descriptors which will cause >>> cache contention. Especially when doing vhost enqueue function, virtio >>> refill packed ring function may write same cache line when vhost doing >>> enqueue function. This kind of extra cache cost will reduce the benefit >>> of reducing cache misses. >>> >>> For optimizing vhost packed ring performance, vhost enqueue and dequeue >>> function will be splitted into fast and normal path. >>> >>> Several methods will be taken in fast path: >>> Handle descriptors in one cache line by batch. >>> Split loop function into more pieces and unroll them. >>> Prerequisite check that whether I/O space can copy directly into mbuf >>> space and vice versa. >>> Prerequisite check that whether descriptor mapping is successful. >>> Distinguish vhost used ring update function by enqueue and dequeue >>> function. >>> Buffer dequeue used descriptors as many as possible. >>> Update enqueue used descriptors by cache line. >>> >>> After all these methods done, single core vhost PvP performance with 64B >>> packet on Xeon 8180 can boost 35%. >>> >>> v8: >>> - Allocate mbuf by virtio_dev_pktmbuf_alloc >>> >>> v7: >>> - Rebase code >>> - Rename unroll macro and definitions >>> - Calculate flags when doing single dequeue >>> >>> v6: >>> - Fix dequeue zcopy result check >>> >>> v5: >>> - Remove disable sw prefetch as performance impact is small >>> - Change unroll pragma macro format >>> - Rename shadow counter elements names >>> - Clean dequeue update check condition >>> - Add inline functions replace of duplicated code >>> - Unify code style >>> >>> v4: >>> - Support meson build >>> - Remove memory region cache for no clear performance gain and ABI break >>> - Not assume ring size is power of two >>> >>> v3: >>> - Check available index overflow >>> - Remove dequeue remained descs number check >>> - Remove changes in split ring datapath >>> - Call memory write barriers once when updating used flags >>> - Rename some functions and macros >>> - Code style optimization >>> >>> v2: >>> - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc >>> - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST) >>> - Optimize dequeue used ring update when in_order negotiated >>> >>> >>> Marvin Liu (13): >>> vhost: add packed ring indexes increasing function >>> vhost: add packed ring single enqueue >>> vhost: try to unroll for each loop >>> vhost: add packed ring batch enqueue >>> vhost: add packed ring single dequeue >>> vhost: add packed ring batch dequeue >>> vhost: flush enqueue updates by cacheline >>> vhost: flush batched enqueue descs directly >>> vhost: buffer packed ring dequeue updates >>> vhost: optimize packed ring enqueue >>> vhost: add packed ring zcopy batch and single dequeue >>> vhost: optimize packed ring dequeue >>> vhost: optimize packed ring dequeue when in-order >>> >>> lib/librte_vhost/Makefile | 18 + >>> lib/librte_vhost/meson.build | 7 + >>> lib/librte_vhost/vhost.h | 57 ++ >>> lib/librte_vhost/virtio_net.c | 948 +++++++++++++++++++++++++++------- >>> 4 files changed, 837 insertions(+), 193 deletions(-) >>> >
Thanks, Maxime. Just sent out v9. > -----Original Message----- > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > Sent: Thursday, October 24, 2019 4:25 PM > To: Liu, Yong <yong.liu@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Wang, > Zhihong <zhihong.wang@intel.com>; stephen@networkplumber.org; > gavin.hu@arm.com > Cc: dev@dpdk.org > Subject: Re: [PATCH v8 00/13] vhost packed ring performance optimization > > > > On 10/24/19 9:18 AM, Liu, Yong wrote: > > > > > >> -----Original Message----- > >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > >> Sent: Thursday, October 24, 2019 2:50 PM > >> To: Liu, Yong <yong.liu@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; > Wang, > >> Zhihong <zhihong.wang@intel.com>; stephen@networkplumber.org; > >> gavin.hu@arm.com > >> Cc: dev@dpdk.org > >> Subject: Re: [PATCH v8 00/13] vhost packed ring performance optimization > >> > >> I get some checkpatch warnings, and build fails with clang. > >> Could you please fix these issues and send v9? > >> > > > > > > Hi Maxime, > > Clang build fails will be fixed in v9. For checkpatch warning, it was due > to pragma string inside. > > Previous version can avoid such warning, while format is a little messy > as below. > > I prefer to keep code clean and more readable. How about your idea? > > > > +#ifdef UNROLL_PRAGMA_PARAM > > +#define VHOST_UNROLL_PRAGMA(param) _Pragma(param) > > +#else > > +#define VHOST_UNROLL_PRAGMA(param) do {} while (0); > > +#endif > > > > + VHOST_UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM) > > + for (i = 0; i < PACKED_BATCH_SIZE; i++) > > That's less clean indeed. I agree to waive the checkpatch errors. > just fix the Clang build for patch 8 and we're good. > > Thanks, > Maxime > > > Regards, > > Marvin > > > >> Thanks, > >> Maxime > >> > >> ### [PATCH] vhost: try to unroll for each loop > >> > >> WARNING:CAMELCASE: Avoid CamelCase: <_Pragma> > >> #78: FILE: lib/librte_vhost/vhost.h:47: > >> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll > >> 4") \ > >> > >> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in > >> parenthesis > >> #78: FILE: lib/librte_vhost/vhost.h:47: > >> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll > >> 4") \ > >> + for (iter = val; iter < size; iter++) > >> > >> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in > >> parenthesis > >> #83: FILE: lib/librte_vhost/vhost.h:52: > >> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") > \ > >> + for (iter = val; iter < size; iter++) > >> > >> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in > >> parenthesis > >> #88: FILE: lib/librte_vhost/vhost.h:57: > >> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") > \ > >> + for (iter = val; iter < size; iter++) > >> > >> total: 3 errors, 1 warnings, 67 lines checked > >> > >> 0/1 valid patch/tmp/dpdk_build/lib/librte_vhost/virtio_net.c:2065:1: > >> error: unused function 'free_zmbuf' [-Werror,-Wunused-function] > >> free_zmbuf(struct vhost_virtqueue *vq) > >> ^ > >> 1 error generated. > >> make[5]: *** [virtio_net.o] Error 1 > >> make[4]: *** [librte_vhost] Error 2 > >> make[4]: *** Waiting for unfinished jobs.... > >> make[3]: *** [lib] Error 2 > >> make[2]: *** [all] Error 2 > >> make[1]: *** [pre_install] Error 2 > >> make: *** [install] Error 2 > >> > >> > >> On 10/22/19 12:08 AM, Marvin Liu wrote: > >>> Packed ring has more compact ring format and thus can significantly > >>> reduce the number of cache miss. It can lead to better performance. > >>> This has been approved in virtio user driver, on normal E5 Xeon cpu > >>> single core performance can raise 12%. > >>> > >>> http://mails.dpdk.org/archives/dev/2018-April/095470.html > >>> > >>> However vhost performance with packed ring performance was decreased. > >>> Through analysis, mostly extra cost was from the calculating of each > >>> descriptor flag which depended on ring wrap counter. Moreover, both > >>> frontend and backend need to write same descriptors which will cause > >>> cache contention. Especially when doing vhost enqueue function, virtio > >>> refill packed ring function may write same cache line when vhost doing > >>> enqueue function. This kind of extra cache cost will reduce the benefit > >>> of reducing cache misses. > >>> > >>> For optimizing vhost packed ring performance, vhost enqueue and dequeue > >>> function will be splitted into fast and normal path. > >>> > >>> Several methods will be taken in fast path: > >>> Handle descriptors in one cache line by batch. > >>> Split loop function into more pieces and unroll them. > >>> Prerequisite check that whether I/O space can copy directly into mbuf > >>> space and vice versa. > >>> Prerequisite check that whether descriptor mapping is successful. > >>> Distinguish vhost used ring update function by enqueue and dequeue > >>> function. > >>> Buffer dequeue used descriptors as many as possible. > >>> Update enqueue used descriptors by cache line. > >>> > >>> After all these methods done, single core vhost PvP performance with > 64B > >>> packet on Xeon 8180 can boost 35%. > >>> > >>> v8: > >>> - Allocate mbuf by virtio_dev_pktmbuf_alloc > >>> > >>> v7: > >>> - Rebase code > >>> - Rename unroll macro and definitions > >>> - Calculate flags when doing single dequeue > >>> > >>> v6: > >>> - Fix dequeue zcopy result check > >>> > >>> v5: > >>> - Remove disable sw prefetch as performance impact is small > >>> - Change unroll pragma macro format > >>> - Rename shadow counter elements names > >>> - Clean dequeue update check condition > >>> - Add inline functions replace of duplicated code > >>> - Unify code style > >>> > >>> v4: > >>> - Support meson build > >>> - Remove memory region cache for no clear performance gain and ABI > break > >>> - Not assume ring size is power of two > >>> > >>> v3: > >>> - Check available index overflow > >>> - Remove dequeue remained descs number check > >>> - Remove changes in split ring datapath > >>> - Call memory write barriers once when updating used flags > >>> - Rename some functions and macros > >>> - Code style optimization > >>> > >>> v2: > >>> - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc > >>> - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST) > >>> - Optimize dequeue used ring update when in_order negotiated > >>> > >>> > >>> Marvin Liu (13): > >>> vhost: add packed ring indexes increasing function > >>> vhost: add packed ring single enqueue > >>> vhost: try to unroll for each loop > >>> vhost: add packed ring batch enqueue > >>> vhost: add packed ring single dequeue > >>> vhost: add packed ring batch dequeue > >>> vhost: flush enqueue updates by cacheline > >>> vhost: flush batched enqueue descs directly > >>> vhost: buffer packed ring dequeue updates > >>> vhost: optimize packed ring enqueue > >>> vhost: add packed ring zcopy batch and single dequeue > >>> vhost: optimize packed ring dequeue > >>> vhost: optimize packed ring dequeue when in-order > >>> > >>> lib/librte_vhost/Makefile | 18 + > >>> lib/librte_vhost/meson.build | 7 + > >>> lib/librte_vhost/vhost.h | 57 ++ > >>> lib/librte_vhost/virtio_net.c | 948 +++++++++++++++++++++++++++------- > >>> 4 files changed, 837 insertions(+), 193 deletions(-) > >>> > >