[v6,00/11] implement packed virtqueues
mbox series

Message ID 20180921103308.16357-1-jfreimann@redhat.com
Headers show
Series
  • implement packed virtqueues
Related show

Message

Jens Freimann Sept. 21, 2018, 10:32 a.m. UTC
This is a basic implementation of packed virtqueues as specified in the
Virtio 1.1 draft. A compiled version of the current draft is available
at https://github.com/oasis-tcs/virtio-docs.git (or as .pdf at
https://github.com/oasis-tcs/virtio-docs/blob/master/virtio-v1.1-packed-wd10.pdf

A packed virtqueue is different from a split virtqueue in that it
consists of only a single descriptor ring that replaces available and
used ring, index and descriptor buffer.

Each descriptor is readable and writable and has a flags field. These flags
will mark if a descriptor is available or used.  To detect new available descriptors
even after the ring has wrapped, device and driver each have a
single-bit wrap counter that is flipped from 0 to 1 and vice versa every time
the last descriptor in the ring is used/made available.

The idea behind this is to 1. improve performance by avoiding cache misses
and 2. be easier for devices to implement.

Regarding performance: with these patches I get 21.13 Mpps on my system
as compared to 18.8 Mpps with the virtio 1.0 code. Packet size was 64
bytes, 0.05% acceptable loss.  Test setup is described as in
http://dpdk.org/doc/guides/howto/pvp_reference_benchmark.html

Packet generator:
MoonGen
Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
Intel X710 NIC
RHEL 7.4

Device under test:
Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
Intel X710 NIC
RHEL 7.4

VM on DuT: RHEL7.4

I plan to do more performance test with bigger frame sizes.

changes from v5->v6:
* fix VIRTQUEUE_DUMP macro
* rework mergeable rx buffer path, support out of order (not sure if I
  need a .next field to support chains) 
* move wmb in virtio_receive_pkts_packed() (Gavin)
* rename to virtio_init_split/_packed (Maxime)
* add support for ctrl virtqueues (Tiwei, thx Max for fixing)
* rework tx path to support update_packet_stats and
  virtqueue_xmit_offload, TODO: merge with split-ring code to
  avoid a lot of duplicate code
* remove unnecessary check for avoiding to call VIRTQUEUE_DUMP (Maxime)

changes from v4->v5:
* fix VIRTQUEUE_DUMP macro
* fix wrap counter logic in transmit and receive functions  

changes from v3->v4:
* added helpers to increment index and set available/used flags
* driver keeps track of number of descriptors used
* change logic in set_rxtx_funcs()
* add patch for ctrl virtqueue with support for packed virtqueues
* rename virtio-1.1.h to virtio-packed.h
* fix wrong sizeof() in "vhost: vring address setup for packed queues"
* fix coding style of function definition in "net/virtio: add packed
  virtqueue helpers"
* fix padding in vring_size()
* move patches to enable packed virtqueues end of series
* v4 has two open problems: I'm sending it out anyway for feedback/help:
 * when VIRTIO_NET_F_MRG_RXBUF enabled only 128 packets are send in
   guest, i.e. when ring is full for the first time. I suspect a bug in
   setting the avail/used flags

changes from v2->v3:
* implement event suppression
* add code do dump packed virtqueues
* don't use assert in vhost code
* rename virtio-user parameter to packed-vq
* support rxvf flush

changes from v1->v2:
* don't use VIRTQ_DESC_F_NEXT in used descriptors (Jason)
* no rte_panice() in guest triggerable code (Maxime)
* use unlikely when checking for vq (Maxime)
* rename everything from _1_1 to _packed  (Yuanhan)
* add two more patches to implement mergeable receive buffers

*** BLURB HERE ***

Jens Freimann (10):
  net/virtio: vring init for packed queues
  net/virtio: add packed virtqueue defines
  net/virtio: add packed virtqueue helpers
  net/virtio: flush packed receive virtqueues
  net/virtio: dump packed virtqueue data
  net/virtio: implement transmit path for packed queues
  net/virtio: implement receive path for packed queues
  net/virtio: add support for mergeable buffers with packed virtqueues
  net/virtio: add virtio send command packed queue support
  net/virtio: enable packed virtqueues by default

Yuanhan Liu (1):
  net/virtio-user: add option to use packed queues

 drivers/net/virtio/virtio_ethdev.c            | 135 ++++-
 drivers/net/virtio/virtio_ethdev.h            |   5 +
 drivers/net/virtio/virtio_pci.h               |   8 +
 drivers/net/virtio/virtio_ring.h              |  96 +++-
 drivers/net/virtio/virtio_rxtx.c              | 490 +++++++++++++++++-
 .../net/virtio/virtio_user/virtio_user_dev.c  |  10 +-
 .../net/virtio/virtio_user/virtio_user_dev.h  |   2 +-
 drivers/net/virtio/virtio_user_ethdev.c       |  14 +-
 drivers/net/virtio/virtqueue.c                |  21 +
 drivers/net/virtio/virtqueue.h                |  50 +-
 10 files changed, 796 insertions(+), 35 deletions(-)

Comments

Tiwei Bie Sept. 21, 2018, 12:32 p.m. UTC | #1
On Fri, Sep 21, 2018 at 12:32:57PM +0200, Jens Freimann wrote:
> This is a basic implementation of packed virtqueues as specified in the
> Virtio 1.1 draft. A compiled version of the current draft is available
> at https://github.com/oasis-tcs/virtio-docs.git (or as .pdf at
> https://github.com/oasis-tcs/virtio-docs/blob/master/virtio-v1.1-packed-wd10.pdf
> 
> A packed virtqueue is different from a split virtqueue in that it
> consists of only a single descriptor ring that replaces available and
> used ring, index and descriptor buffer.
> 
> Each descriptor is readable and writable and has a flags field. These flags
> will mark if a descriptor is available or used.  To detect new available descriptors
> even after the ring has wrapped, device and driver each have a
> single-bit wrap counter that is flipped from 0 to 1 and vice versa every time
> the last descriptor in the ring is used/made available.
> 
> The idea behind this is to 1. improve performance by avoiding cache misses
> and 2. be easier for devices to implement.
> 
> Regarding performance: with these patches I get 21.13 Mpps on my system
> as compared to 18.8 Mpps with the virtio 1.0 code. Packet size was 64

Did you enable multiple-queue and use multiple cores on
vhost side? If not, I guess the above performance gain
is the gain in vhost side instead of virtio side.

If you use more cores on vhost side or virtio side, will
you see any performance changes?

Did you do any performance test with the kernel vhost-net
backend (with zero-copy enabled and disabled)? I think we
also need some performance data for these two cases. And
it can help us to make sure that it works with the kernel
backends.

And for the "virtio-PMD + vhost-PMD" test cases, I think
we need below performance data:

#1. The maximum 1 core performance of virtio PMD when using split ring.
#2. The maximum 1 core performance of virtio PMD when using packed ring.
#3. The maximum 1 core performance of vhost PMD when using split ring.
#4. The maximum 1 core performance of vhost PMD when using packed ring.

And then we can have a clear understanding of the
performance gain in DPDK with packed ring.

And FYI, the maximum 1 core performance of virtio PMD
can be got in below steps:

1. Launch vhost-PMD with multiple queues, and use multiple
   CPU cores for forwarding.
2. Launch virtio-PMD with multiple queues and use 1 CPU
   core for forwarding.
3. Repeat above two steps with adding more CPU cores
   for forwarding in vhost-PMD side until we can't see
   performance increase anymore.

Besides, I just did a quick glance at the Tx implementation,
it still assumes the descs will be written back in order
by device. You can find more details from my comments on
that patch.

Thanks



> bytes, 0.05% acceptable loss.  Test setup is described as in
> http://dpdk.org/doc/guides/howto/pvp_reference_benchmark.html
> 
> Packet generator:
> MoonGen
> Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
> Intel X710 NIC
> RHEL 7.4
> 
> Device under test:
> Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
> Intel X710 NIC
> RHEL 7.4
> 
> VM on DuT: RHEL7.4
> 
> I plan to do more performance test with bigger frame sizes.
> 
> changes from v5->v6:
> * fix VIRTQUEUE_DUMP macro
> * rework mergeable rx buffer path, support out of order (not sure if I
>   need a .next field to support chains) 
> * move wmb in virtio_receive_pkts_packed() (Gavin)
> * rename to virtio_init_split/_packed (Maxime)
> * add support for ctrl virtqueues (Tiwei, thx Max for fixing)
> * rework tx path to support update_packet_stats and
>   virtqueue_xmit_offload, TODO: merge with split-ring code to
>   avoid a lot of duplicate code
> * remove unnecessary check for avoiding to call VIRTQUEUE_DUMP (Maxime)
> 
> changes from v4->v5:
> * fix VIRTQUEUE_DUMP macro
> * fix wrap counter logic in transmit and receive functions  
> 
> changes from v3->v4:
> * added helpers to increment index and set available/used flags
> * driver keeps track of number of descriptors used
> * change logic in set_rxtx_funcs()
> * add patch for ctrl virtqueue with support for packed virtqueues
> * rename virtio-1.1.h to virtio-packed.h
> * fix wrong sizeof() in "vhost: vring address setup for packed queues"
> * fix coding style of function definition in "net/virtio: add packed
>   virtqueue helpers"
> * fix padding in vring_size()
> * move patches to enable packed virtqueues end of series
> * v4 has two open problems: I'm sending it out anyway for feedback/help:
>  * when VIRTIO_NET_F_MRG_RXBUF enabled only 128 packets are send in
>    guest, i.e. when ring is full for the first time. I suspect a bug in
>    setting the avail/used flags
> 
> changes from v2->v3:
> * implement event suppression
> * add code do dump packed virtqueues
> * don't use assert in vhost code
> * rename virtio-user parameter to packed-vq
> * support rxvf flush
> 
> changes from v1->v2:
> * don't use VIRTQ_DESC_F_NEXT in used descriptors (Jason)
> * no rte_panice() in guest triggerable code (Maxime)
> * use unlikely when checking for vq (Maxime)
> * rename everything from _1_1 to _packed  (Yuanhan)
> * add two more patches to implement mergeable receive buffers
> 
> *** BLURB HERE ***
> 
> Jens Freimann (10):
>   net/virtio: vring init for packed queues
>   net/virtio: add packed virtqueue defines
>   net/virtio: add packed virtqueue helpers
>   net/virtio: flush packed receive virtqueues
>   net/virtio: dump packed virtqueue data
>   net/virtio: implement transmit path for packed queues
>   net/virtio: implement receive path for packed queues
>   net/virtio: add support for mergeable buffers with packed virtqueues
>   net/virtio: add virtio send command packed queue support
>   net/virtio: enable packed virtqueues by default
> 
> Yuanhan Liu (1):
>   net/virtio-user: add option to use packed queues
> 
>  drivers/net/virtio/virtio_ethdev.c            | 135 ++++-
>  drivers/net/virtio/virtio_ethdev.h            |   5 +
>  drivers/net/virtio/virtio_pci.h               |   8 +
>  drivers/net/virtio/virtio_ring.h              |  96 +++-
>  drivers/net/virtio/virtio_rxtx.c              | 490 +++++++++++++++++-
>  .../net/virtio/virtio_user/virtio_user_dev.c  |  10 +-
>  .../net/virtio/virtio_user/virtio_user_dev.h  |   2 +-
>  drivers/net/virtio/virtio_user_ethdev.c       |  14 +-
>  drivers/net/virtio/virtqueue.c                |  21 +
>  drivers/net/virtio/virtqueue.h                |  50 +-
>  10 files changed, 796 insertions(+), 35 deletions(-)
> 
> -- 
> 2.17.1
>
Jens Freimann Sept. 21, 2018, 2:06 p.m. UTC | #2
On Fri, Sep 21, 2018 at 08:32:22PM +0800, Tiwei Bie wrote:
>On Fri, Sep 21, 2018 at 12:32:57PM +0200, Jens Freimann wrote:
>> This is a basic implementation of packed virtqueues as specified in the
>> Virtio 1.1 draft. A compiled version of the current draft is available
>> at https://github.com/oasis-tcs/virtio-docs.git (or as .pdf at
>> https://github.com/oasis-tcs/virtio-docs/blob/master/virtio-v1.1-packed-wd10.pdf
>>
>> A packed virtqueue is different from a split virtqueue in that it
>> consists of only a single descriptor ring that replaces available and
>> used ring, index and descriptor buffer.
>>
>> Each descriptor is readable and writable and has a flags field. These flags
>> will mark if a descriptor is available or used.  To detect new available descriptors
>> even after the ring has wrapped, device and driver each have a
>> single-bit wrap counter that is flipped from 0 to 1 and vice versa every time
>> the last descriptor in the ring is used/made available.
>>
>> The idea behind this is to 1. improve performance by avoiding cache misses
>> and 2. be easier for devices to implement.
>>
>> Regarding performance: with these patches I get 21.13 Mpps on my system
>> as compared to 18.8 Mpps with the virtio 1.0 code. Packet size was 64
>
>Did you enable multiple-queue and use multiple cores on
>vhost side? If not, I guess the above performance gain
>is the gain in vhost side instead of virtio side.

I tested several variations back then and they all looked very good.
But code change a lot meanwhile and I need to do more benchmarking
in any case.

>
>If you use more cores on vhost side or virtio side, will
>you see any performance changes?
>
>Did you do any performance test with the kernel vhost-net
>backend (with zero-copy enabled and disabled)? I think we
>also need some performance data for these two cases. And
>it can help us to make sure that it works with the kernel
>backends.

I tested against vhost-kernel but only to test functionality not
to benchmark. 
>
>And for the "virtio-PMD + vhost-PMD" test cases, I think
>we need below performance data:
>
>#1. The maximum 1 core performance of virtio PMD when using split ring.
>#2. The maximum 1 core performance of virtio PMD when using packed ring.
>#3. The maximum 1 core performance of vhost PMD when using split ring.
>#4. The maximum 1 core performance of vhost PMD when using packed ring.
>
>And then we can have a clear understanding of the
>performance gain in DPDK with packed ring.
>
>And FYI, the maximum 1 core performance of virtio PMD
>can be got in below steps:
>
>1. Launch vhost-PMD with multiple queues, and use multiple
>   CPU cores for forwarding.
>2. Launch virtio-PMD with multiple queues and use 1 CPU
>   core for forwarding.
>3. Repeat above two steps with adding more CPU cores
>   for forwarding in vhost-PMD side until we can't see
>   performance increase anymore.

 Thanks for the suggestions, I'll come back with more
numbers.

>
>Besides, I just did a quick glance at the Tx implementation,
>it still assumes the descs will be written back in order
>by device. You can find more details from my comments on
>that patch.

Saw it and noted. I had hoped to be able to avoid the list but
I see no way around it now. 

Thanks for your review Tiwei!

regards,
Jens