[v2,00/16] vhost packed ring performance optimization
mbox series

Message ID 20190919163643.24130-1-yong.liu@intel.com
Headers show
Series
  • vhost packed ring performance optimization
Related show

Message

Liu, Yong Sept. 19, 2019, 4:36 p.m. UTC
Packed ring has more compact ring format and thus can significantly
reduce the number of cache miss. It can lead to better performance.
This has been approved in virtio user driver, on normal E5 Xeon cpu
single core performance can raise 12%.

http://mails.dpdk.org/archives/dev/2018-April/095470.html

However vhost performance with packed ring performance was decreased.
Through analysis, mostly extra cost was from the calculating of each
descriptor flag which depended on ring wrap counter. Moreover, both
frontend and backend need to write same descriptors which will cause
cache contention. Especially when doing vhost enqueue function, virtio
refill packed ring function may write same cache line when vhost doing
enqueue function. This kind of extra cache cost will reduce the benefit
of reducing cache misses. 

For optimizing vhost packed ring performance, vhost enqueue and dequeue
function will be splitted into fast and normal path.

Several methods will be taken in fast path:
  Uroll burst loop function into more pieces.
  Handle descriptors in one cache line simultaneously.
  Prerequisite check that whether I/O space can copy directly into mbuf
    space and vice versa. 
  Prerequisite check that whether descriptor mapping is successful.
  Distinguish vhost used ring update function by enqueue and dequeue
    function.
  Buffer dequeue used descriptors as many as possible.
  Update enqueue used descriptors by cache line.
  Cache memory region structure for fast conversion.
  Disable sofware prefetch is hardware can do better.

After all these methods done, single core vhost PvP performance with 64B
packet on Xeon 8180 can boost 40%.

v2:
- Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc
- Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST)
- Optimize dequeue used ring update when in_order negotiated

Marvin Liu (16):
  vhost: add single packet enqueue function
  vhost: unify unroll pragma parameter
  vhost: add burst enqueue function for packed ring
  vhost: add single packet dequeue function
  vhost: add burst dequeue function
  vhost: rename flush shadow used ring functions
  vhost: flush vhost enqueue shadow ring by burst
  vhost: add flush function for burst enqueue
  vhost: buffer vhost dequeue shadow ring
  vhost: split enqueue and dequeue flush functions
  vhost: optimize enqueue function of packed ring
  vhost: add burst and single zero dequeue functions
  vhost: optimize dequeue function of packed ring
  vhost: cache address translation result
  vhost: check whether disable software pre-fetch
  vhost: optimize packed ring dequeue when in-order

 lib/librte_vhost/Makefile     |   24 +
 lib/librte_vhost/rte_vhost.h  |   27 +
 lib/librte_vhost/vhost.h      |   33 +
 lib/librte_vhost/virtio_net.c | 1071 +++++++++++++++++++++++++++------
 4 files changed, 960 insertions(+), 195 deletions(-)

Comments

Gavin Hu (Arm Technology China) Sept. 23, 2019, 9:05 a.m. UTC | #1
Hi Marvin,

A general comment for the series, could you mark V1 Superseded?

/Gavin

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Marvin Liu
> Sent: Friday, September 20, 2019 12:36 AM
> To: maxime.coquelin@redhat.com; tiwei.bie@intel.com;
> zhihong.wang@intel.com
> Cc: dev@dpdk.org; Marvin Liu <yong.liu@intel.com>
> Subject: [dpdk-dev] [PATCH v2 00/16] vhost packed ring performance
> optimization
>
> Packed ring has more compact ring format and thus can significantly
> reduce the number of cache miss. It can lead to better performance.
> This has been approved in virtio user driver, on normal E5 Xeon cpu
> single core performance can raise 12%.
>
> http://mails.dpdk.org/archives/dev/2018-April/095470.html
>
> However vhost performance with packed ring performance was decreased.
> Through analysis, mostly extra cost was from the calculating of each
> descriptor flag which depended on ring wrap counter. Moreover, both
> frontend and backend need to write same descriptors which will cause
> cache contention. Especially when doing vhost enqueue function, virtio
> refill packed ring function may write same cache line when vhost doing
> enqueue function. This kind of extra cache cost will reduce the benefit
> of reducing cache misses.
>
> For optimizing vhost packed ring performance, vhost enqueue and dequeue
> function will be splitted into fast and normal path.
>
> Several methods will be taken in fast path:
>   Uroll burst loop function into more pieces.
>   Handle descriptors in one cache line simultaneously.
>   Prerequisite check that whether I/O space can copy directly into mbuf
>     space and vice versa.
>   Prerequisite check that whether descriptor mapping is successful.
>   Distinguish vhost used ring update function by enqueue and dequeue
>     function.
>   Buffer dequeue used descriptors as many as possible.
>   Update enqueue used descriptors by cache line.
>   Cache memory region structure for fast conversion.
>   Disable sofware prefetch is hardware can do better.
>
> After all these methods done, single core vhost PvP performance with 64B
> packet on Xeon 8180 can boost 40%.
>
> v2:
> - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc
> - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST)
> - Optimize dequeue used ring update when in_order negotiated
>
> Marvin Liu (16):
>   vhost: add single packet enqueue function
>   vhost: unify unroll pragma parameter
>   vhost: add burst enqueue function for packed ring
>   vhost: add single packet dequeue function
>   vhost: add burst dequeue function
>   vhost: rename flush shadow used ring functions
>   vhost: flush vhost enqueue shadow ring by burst
>   vhost: add flush function for burst enqueue
>   vhost: buffer vhost dequeue shadow ring
>   vhost: split enqueue and dequeue flush functions
>   vhost: optimize enqueue function of packed ring
>   vhost: add burst and single zero dequeue functions
>   vhost: optimize dequeue function of packed ring
>   vhost: cache address translation result
>   vhost: check whether disable software pre-fetch
>   vhost: optimize packed ring dequeue when in-order
>
>  lib/librte_vhost/Makefile     |   24 +
>  lib/librte_vhost/rte_vhost.h  |   27 +
>  lib/librte_vhost/vhost.h      |   33 +
>  lib/librte_vhost/virtio_net.c | 1071 +++++++++++++++++++++++++++------
>  4 files changed, 960 insertions(+), 195 deletions(-)
>
> --
> 2.17.1

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Liu, Yong Sept. 23, 2019, 9:29 a.m. UTC | #2
Sure, have changed state of V1.

> -----Original Message-----
> From: Gavin Hu (Arm Technology China) [mailto:Gavin.Hu@arm.com]
> Sent: Monday, September 23, 2019 5:05 PM
> To: Liu, Yong <yong.liu@intel.com>; maxime.coquelin@redhat.com; Bie, Tiwei
> <tiwei.bie@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH v2 00/16] vhost packed ring performance
> optimization
> 
> Hi Marvin,
> 
> A general comment for the series, could you mark V1 Superseded?
> 
> /Gavin
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Marvin Liu
> > Sent: Friday, September 20, 2019 12:36 AM
> > To: maxime.coquelin@redhat.com; tiwei.bie@intel.com;
> > zhihong.wang@intel.com
> > Cc: dev@dpdk.org; Marvin Liu <yong.liu@intel.com>
> > Subject: [dpdk-dev] [PATCH v2 00/16] vhost packed ring performance
> > optimization
> >
> > Packed ring has more compact ring format and thus can significantly
> > reduce the number of cache miss. It can lead to better performance.
> > This has been approved in virtio user driver, on normal E5 Xeon cpu
> > single core performance can raise 12%.
> >
> > http://mails.dpdk.org/archives/dev/2018-April/095470.html
> >
> > However vhost performance with packed ring performance was decreased.
> > Through analysis, mostly extra cost was from the calculating of each
> > descriptor flag which depended on ring wrap counter. Moreover, both
> > frontend and backend need to write same descriptors which will cause
> > cache contention. Especially when doing vhost enqueue function, virtio
> > refill packed ring function may write same cache line when vhost doing
> > enqueue function. This kind of extra cache cost will reduce the benefit
> > of reducing cache misses.
> >
> > For optimizing vhost packed ring performance, vhost enqueue and dequeue
> > function will be splitted into fast and normal path.
> >
> > Several methods will be taken in fast path:
> >   Uroll burst loop function into more pieces.
> >   Handle descriptors in one cache line simultaneously.
> >   Prerequisite check that whether I/O space can copy directly into mbuf
> >     space and vice versa.
> >   Prerequisite check that whether descriptor mapping is successful.
> >   Distinguish vhost used ring update function by enqueue and dequeue
> >     function.
> >   Buffer dequeue used descriptors as many as possible.
> >   Update enqueue used descriptors by cache line.
> >   Cache memory region structure for fast conversion.
> >   Disable sofware prefetch is hardware can do better.
> >
> > After all these methods done, single core vhost PvP performance with 64B
> > packet on Xeon 8180 can boost 40%.
> >
> > v2:
> > - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc
> > - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST)
> > - Optimize dequeue used ring update when in_order negotiated
> >
> > Marvin Liu (16):
> >   vhost: add single packet enqueue function
> >   vhost: unify unroll pragma parameter
> >   vhost: add burst enqueue function for packed ring
> >   vhost: add single packet dequeue function
> >   vhost: add burst dequeue function
> >   vhost: rename flush shadow used ring functions
> >   vhost: flush vhost enqueue shadow ring by burst
> >   vhost: add flush function for burst enqueue
> >   vhost: buffer vhost dequeue shadow ring
> >   vhost: split enqueue and dequeue flush functions
> >   vhost: optimize enqueue function of packed ring
> >   vhost: add burst and single zero dequeue functions
> >   vhost: optimize dequeue function of packed ring
> >   vhost: cache address translation result
> >   vhost: check whether disable software pre-fetch
> >   vhost: optimize packed ring dequeue when in-order
> >
> >  lib/librte_vhost/Makefile     |   24 +
> >  lib/librte_vhost/rte_vhost.h  |   27 +
> >  lib/librte_vhost/vhost.h      |   33 +
> >  lib/librte_vhost/virtio_net.c | 1071 +++++++++++++++++++++++++++------
> >  4 files changed, 960 insertions(+), 195 deletions(-)
> >
> > --
> > 2.17.1
> 
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.