[RFC,0/3] Add mdev (Mediated device) support in DPDK
mbox series

Message ID 20190403071844.21126-1-tiwei.bie@intel.com
Headers show
Series
  • Add mdev (Mediated device) support in DPDK
Related show

Message

Tiwei Bie April 3, 2019, 7:18 a.m. UTC
Hi everyone,

This is a draft implementation of the mdev (Mediated device [1])
bus support in DPDK. Mdev is a way to virtualize devices in Linux
kernel. Based on the device-api (mdev_type/device_api), there could
be different types of mdev devices (e.g. vfio-pci). In this RFC,
one mdev bus is introduced to scan the mdev devices in the system
and do the probe based on the device-api.

Take the mdev devices whose device-api is "vfio-pci" as an example,
in this RFC, these devices will be probed by a mdev driver provided
by PCI bus, which will plug them to the PCI bus. And they will be
probed with the drivers registered on the PCI bus based on VendorID/
DeviceID/... then.

                     +----------+
                     | mdev bus |
                     +----+-----+
                          |
         +----------------+----+------+------+
         |                     |      |      |
   mdev_vfio_pci               ......
(device-api: vfio-pci)

There are also other ways to add mdev device support in DPDK (e.g.
let PCI bus scan /sys/bus/mdev/devices directly). Comments would be
appreciated!

[1] https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt

Thanks,
Tiwei

Tiwei Bie (3):
  eal: add a helper for reading string from sysfs
  bus/mdev: add mdev bus support
  bus/pci: add mdev support

 config/common_base                        |   5 +
 config/common_linux                       |   1 +
 drivers/bus/Makefile                      |   1 +
 drivers/bus/mdev/Makefile                 |  41 +++
 drivers/bus/mdev/linux/Makefile           |   6 +
 drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
 drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
 drivers/bus/mdev/meson.build              |  15 ++
 drivers/bus/mdev/private.h                |  90 +++++++
 drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
 drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
 drivers/bus/meson.build                   |   2 +-
 drivers/bus/pci/Makefile                  |   3 +
 drivers/bus/pci/linux/Makefile            |   4 +
 drivers/bus/pci/linux/pci_vfio.c          |  35 ++-
 drivers/bus/pci/linux/pci_vfio_mdev.c     | 305 +++++++++++++++++++++
 drivers/bus/pci/meson.build               |   4 +-
 drivers/bus/pci/pci_common.c              |  17 +-
 drivers/bus/pci/private.h                 |   9 +
 drivers/bus/pci/rte_bus_pci.h             |  11 +-
 lib/librte_eal/common/eal_filesystem.h    |   7 +
 lib/librte_eal/freebsd/eal/eal.c          |  22 ++
 lib/librte_eal/linux/eal/eal.c            |  22 ++
 lib/librte_eal/rte_eal_version.map        |   1 +
 mk/rte.app.mk                             |   1 +
 25 files changed, 1163 insertions(+), 19 deletions(-)
 create mode 100644 drivers/bus/mdev/Makefile
 create mode 100644 drivers/bus/mdev/linux/Makefile
 create mode 100644 drivers/bus/mdev/linux/mdev.c
 create mode 100644 drivers/bus/mdev/mdev.c
 create mode 100644 drivers/bus/mdev/meson.build
 create mode 100644 drivers/bus/mdev/private.h
 create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
 create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map
 create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c

Comments

Alejandro Lucero April 8, 2019, 8:44 a.m. UTC | #1
On Wed, Apr 3, 2019 at 8:19 AM Tiwei Bie <tiwei.bie@intel.com> wrote:

> Hi everyone,
>
> This is a draft implementation of the mdev (Mediated device [1])
> bus support in DPDK. Mdev is a way to virtualize devices in Linux
> kernel. Based on the device-api (mdev_type/device_api), there could
> be different types of mdev devices (e.g. vfio-pci). In this RFC,
> one mdev bus is introduced to scan the mdev devices in the system
> and do the probe based on the device-api.
>
> Take the mdev devices whose device-api is "vfio-pci" as an example,
> in this RFC, these devices will be probed by a mdev driver provided
> by PCI bus, which will plug them to the PCI bus. And they will be
> probed with the drivers registered on the PCI bus based on VendorID/
> DeviceID/... then.
>
>                      +----------+
>                      | mdev bus |
>                      +----+-----+
>                           |
>          +----------------+----+------+------+
>          |                     |      |      |
>    mdev_vfio_pci               ......
> (device-api: vfio-pci)
>
> There are also other ways to add mdev device support in DPDK (e.g.
> let PCI bus scan /sys/bus/mdev/devices directly). Comments would be
> appreciated!
>
>
Hi Tiwei,

Thanks for the patchset. I was close to send a patchset with the same mdev
support, but I'm glad to see your patchset first because I think it is
interesting to see another view of how to implemented this.

After going through your patch I was a bit confused about how the mdev
device to mdev driver match was done. But then I realized the approach you
are following is different to my implementation, likely due to having
different purposes. If I understand the idea behind, you want to have same
PCI PMD drivers working with devices, PCI devices, created from mediated
devices. That is the reason there is just one mdev driver, the one for
vfio-pci mediated devices type.

My approach was different and I though having specific PMD mdev support was
necessary, with the PMD requiring to register a mdev driver. I can see,
after reading your patch, it can be perfectly possible to have the same
PMDs for "pure" PCI devices and PCI devices made from mediated devices, and
if the PMD requires to do something different due to the mediated devices
intrinsics, then explicitly supporting that per PMD. I got specific ioctl
calls between the PMD and the mediating driver but this can also be done
with your approach.

I'm working on having a mediated PF, what is a different purpose than the
Intel scalable I/O idea, so I will merge this patchset with my code and see
if it works.

Thanks!


> [1]
> https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt
>
> Thanks,
> Tiwei
>
> Tiwei Bie (3):
>   eal: add a helper for reading string from sysfs
>   bus/mdev: add mdev bus support
>   bus/pci: add mdev support
>
>  config/common_base                        |   5 +
>  config/common_linux                       |   1 +
>  drivers/bus/Makefile                      |   1 +
>  drivers/bus/mdev/Makefile                 |  41 +++
>  drivers/bus/mdev/linux/Makefile           |   6 +
>  drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
>  drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
>  drivers/bus/mdev/meson.build              |  15 ++
>  drivers/bus/mdev/private.h                |  90 +++++++
>  drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
>  drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
>  drivers/bus/meson.build                   |   2 +-
>  drivers/bus/pci/Makefile                  |   3 +
>  drivers/bus/pci/linux/Makefile            |   4 +
>  drivers/bus/pci/linux/pci_vfio.c          |  35 ++-
>  drivers/bus/pci/linux/pci_vfio_mdev.c     | 305 +++++++++++++++++++++
>  drivers/bus/pci/meson.build               |   4 +-
>  drivers/bus/pci/pci_common.c              |  17 +-
>  drivers/bus/pci/private.h                 |   9 +
>  drivers/bus/pci/rte_bus_pci.h             |  11 +-
>  lib/librte_eal/common/eal_filesystem.h    |   7 +
>  lib/librte_eal/freebsd/eal/eal.c          |  22 ++
>  lib/librte_eal/linux/eal/eal.c            |  22 ++
>  lib/librte_eal/rte_eal_version.map        |   1 +
>  mk/rte.app.mk                             |   1 +
>  25 files changed, 1163 insertions(+), 19 deletions(-)
>  create mode 100644 drivers/bus/mdev/Makefile
>  create mode 100644 drivers/bus/mdev/linux/Makefile
>  create mode 100644 drivers/bus/mdev/linux/mdev.c
>  create mode 100644 drivers/bus/mdev/mdev.c
>  create mode 100644 drivers/bus/mdev/meson.build
>  create mode 100644 drivers/bus/mdev/private.h
>  create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
>  create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map
>  create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
>
> --
> 2.17.1
>
>
Tiwei Bie April 8, 2019, 9:36 a.m. UTC | #2
On Mon, Apr 08, 2019 at 09:44:07AM +0100, Alejandro Lucero wrote:
> On Wed, Apr 3, 2019 at 8:19 AM Tiwei Bie <tiwei.bie@intel.com> wrote:
> > Hi everyone,
> >
> > This is a draft implementation of the mdev (Mediated device [1])
> > bus support in DPDK. Mdev is a way to virtualize devices in Linux
> > kernel. Based on the device-api (mdev_type/device_api), there could
> > be different types of mdev devices (e.g. vfio-pci). In this RFC,
> > one mdev bus is introduced to scan the mdev devices in the system
> > and do the probe based on the device-api.
> >
> > Take the mdev devices whose device-api is "vfio-pci" as an example,
> > in this RFC, these devices will be probed by a mdev driver provided
> > by PCI bus, which will plug them to the PCI bus. And they will be
> > probed with the drivers registered on the PCI bus based on VendorID/
> > DeviceID/... then.
> >
> >                      +----------+
> >                      | mdev bus |
> >                      +----+-----+
> >                           |
> >          +----------------+----+------+------+
> >          |                     |      |      |
> >    mdev_vfio_pci               ......
> > (device-api: vfio-pci)
> >
> > There are also other ways to add mdev device support in DPDK (e.g.
> > let PCI bus scan /sys/bus/mdev/devices directly). Comments would be
> > appreciated!
> 
> Hi Tiwei,
> 
> Thanks for the patchset. I was close to send a patchset with the same mdev
> support, but I'm glad to see your patchset first because I think it is
> interesting to see another view of how to implemented this.
> 
> After going through your patch I was a bit confused about how the mdev device
> to mdev driver match was done. But then I realized the approach you are
> following is different to my implementation, likely due to having different
> purposes. If I understand the idea behind, you want to have same PCI PMD
> drivers working with devices, PCI devices, created from mediated devices.

Exactly!

> That
> is the reason there is just one mdev driver, the one for vfio-pci mediated
> devices type.
> 
> My approach was different and I though having specific PMD mdev support was
> necessary, with the PMD requiring to register a mdev driver. I can see, after
> reading your patch, it can be perfectly possible to have the same PMDs for
> "pure" PCI devices and PCI devices made from mediated devices, and if the PMD
> requires to do something different due to the mediated devices intrinsics, then
> explicitly supporting that per PMD. I got specific ioctl calls between the PMD
> and the mediating driver but this can also be done with your approach.
> 
> I'm working on having a mediated PF, what is a different purpose than the Intel
> scalable I/O idea, so I will merge this patchset with my code and see if it
> works. 

Cool! Thanks!

> 
> Thanks!
>  
> 
> > [1] https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt
> > 
> > Thanks,
> > Tiwei
> > 
> > Tiwei Bie (3):
> >   eal: add a helper for reading string from sysfs
> >   bus/mdev: add mdev bus support
> >   bus/pci: add mdev support
> > 
> >  config/common_base                        |   5 +
> >  config/common_linux                       |   1 +
> >  drivers/bus/Makefile                      |   1 +
> >  drivers/bus/mdev/Makefile                 |  41 +++
> >  drivers/bus/mdev/linux/Makefile           |   6 +
> >  drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
> >  drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
> >  drivers/bus/mdev/meson.build              |  15 ++
> >  drivers/bus/mdev/private.h                |  90 +++++++
> >  drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
> >  drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
> >  drivers/bus/meson.build                   |   2 +-
> >  drivers/bus/pci/Makefile                  |   3 +
> >  drivers/bus/pci/linux/Makefile            |   4 +
> >  drivers/bus/pci/linux/pci_vfio.c          |  35 ++-
> >  drivers/bus/pci/linux/pci_vfio_mdev.c     | 305 +++++++++++++++++++++
> >  drivers/bus/pci/meson.build               |   4 +-
> >  drivers/bus/pci/pci_common.c              |  17 +-
> >  drivers/bus/pci/private.h                 |   9 +
> >  drivers/bus/pci/rte_bus_pci.h             |  11 +-
> >  lib/librte_eal/common/eal_filesystem.h    |   7 +
> >  lib/librte_eal/freebsd/eal/eal.c          |  22 ++
> >  lib/librte_eal/linux/eal/eal.c            |  22 ++
> >  lib/librte_eal/rte_eal_version.map        |   1 +
> >  mk/rte.app.mk                             |   1 +
> >  25 files changed, 1163 insertions(+), 19 deletions(-)
> >  create mode 100644 drivers/bus/mdev/Makefile
> >  create mode 100644 drivers/bus/mdev/linux/Makefile
> >  create mode 100644 drivers/bus/mdev/linux/mdev.c
> >  create mode 100644 drivers/bus/mdev/mdev.c
> >  create mode 100644 drivers/bus/mdev/meson.build
> >  create mode 100644 drivers/bus/mdev/private.h
> >  create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
> >  create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map
> >  create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
> > 
> > --
> > 2.17.1
> 
>
Francois Ozog April 10, 2019, 10:02 a.m. UTC | #3
Hi all,

I presented an approach in Fosdem
(https://archive.fosdem.org/2018/schedule/event/netmdev/) and feel
happy someone is picking up.

If we step back a little, the mdev concept is to allow userland to be
given a direct control over the hardware data path on a device still
controlled by the kernel.
From a code base perspective, this can shrink down PMD code size b y a
significant size: only 10% of the PMD code is actual data path, the
rest being device control!
The concept is perfect for DPDK, SPDK and many other scenarios (AI
accelerators).
Should the work be triggered by DPDK community, it should be
applicable to a broader set of communities: SPDK, VPP, ODP, AF_XDP....

We bumped into many sharing (between kernel and userland) complexities
particularly when a single PCI device controls two ports.
So let's assume we try to solve a subset of the cases: coherent IO
memory and a dedicated PCI space (by whatever mechanism) per port.

What are the "things to solve"?

1) enumeration: enumerating and "capturing" an mdev device (the patch I assume)
2) bifurcation: designating the queues to capture in userland (may be
all) with a hardware driven rule (flow director or more generic)
3) memory management: dealing with rings and buffer management on rx
and tx paths

The bifurcation can be as simple as : all queues in userland, or quite
rich: TCP port 80 goes to userland while the rest (ICMP...) go to
kernel. If the kernel gets some of the traffic there will be a routing
information sharing problem to solve. We had a few experiments here.
Conclusion is its doable but many corner cases make it a big work. And
it would be nice if the queue selection can be made very generic (and
not tied to flow director).
Let's state this is for further study for  now.

Lets focus on memory management of VFIO exposed devices.
I haven't refreshed my knowledge of the VFIO framework so you may want
to correct a few points...
First of all, DPDK is made to switch packets and particularly between ports.
With VFIO, this means all devices are in the same virtual IOVA which
is tricky to implement in the kernel.
There are a few strategies to do that all requiring significant mdev
extensions and more probably a kernel infrastructure change. The good
news is it can be made in such a way that selected drivers implement
the change, not requiring all the drivers to be touched.
Another big question is: is the kernel allocating the memory then the
userland gets a map to it, or does the userland allocates the memory
and the kernel just maintains the IOVA mapping.
I would favor kernel allocation and userland gets a map to it (in the
unified IOVA). One reason being that memory allocation strategy can be
very different from hardware to hardware:
- driver allocates packet buffers and populate a single ring of packet per queue
- driver allocates packet buffers of different sizes and populate
multiple rings per queue (for instance rings of 128, 256, 1024, 2048
byte arrays per queue)
- driver allocates an unstructured memory area (say 32MB) and give it
to hardware (no prepopulation of rings).
So the userland framework (DPDK, SPDK, ODP, VPP, AF_XDP,
proprietary...) can just query for queues and rings to the kernel
driver that knows what has to be done for the driver. The userland
framework just has to create the relevant objects (queues, rings,
packet buffers) to the provided kernel information.

Exposing VFIO devices to DPDK and other frameworks is a major topic,
and I suggest that at the same time enumeration is done, a broader
discussion on the data path itself happens.
Data path discussion is about memory management (above) and packet
descriptors. Exposing hardware dependent structures in the userland is
not the most widely accepted wisdom.
So I would rather assume hardware natively produce hardware, vendor,
OS independent descriptors. Candidates can be: DPDK mbuf, VPP vlib_buf
or virtio 1.1. I would favor a packet descriptor that supports a
combination of inline offloads (VxLAN + IPSec + TSO...) : if virtio
1.1 could be extended with some DPDK mbuf fields that would be perfect
;-) That looks science fiction but I know that some smartNICs and
other hardware, the hardware produced packet descriptor format can be
flexible....

Cheers

FF



On Mon, 8 Apr 2019 at 11:36, Tiwei Bie <tiwei.bie@intel.com> wrote:
>
> On Mon, Apr 08, 2019 at 09:44:07AM +0100, Alejandro Lucero wrote:
> > On Wed, Apr 3, 2019 at 8:19 AM Tiwei Bie <tiwei.bie@intel.com> wrote:
> > > Hi everyone,
> > >
> > > This is a draft implementation of the mdev (Mediated device [1])
> > > bus support in DPDK. Mdev is a way to virtualize devices in Linux
> > > kernel. Based on the device-api (mdev_type/device_api), there could
> > > be different types of mdev devices (e.g. vfio-pci). In this RFC,
> > > one mdev bus is introduced to scan the mdev devices in the system
> > > and do the probe based on the device-api.
> > >
> > > Take the mdev devices whose device-api is "vfio-pci" as an example,
> > > in this RFC, these devices will be probed by a mdev driver provided
> > > by PCI bus, which will plug them to the PCI bus. And they will be
> > > probed with the drivers registered on the PCI bus based on VendorID/
> > > DeviceID/... then.
> > >
> > >                      +----------+
> > >                      | mdev bus |
> > >                      +----+-----+
> > >                           |
> > >          +----------------+----+------+------+
> > >          |                     |      |      |
> > >    mdev_vfio_pci               ......
> > > (device-api: vfio-pci)
> > >
> > > There are also other ways to add mdev device support in DPDK (e.g.
> > > let PCI bus scan /sys/bus/mdev/devices directly). Comments would be
> > > appreciated!
> >
> > Hi Tiwei,
> >
> > Thanks for the patchset. I was close to send a patchset with the same mdev
> > support, but I'm glad to see your patchset first because I think it is
> > interesting to see another view of how to implemented this.
> >
> > After going through your patch I was a bit confused about how the mdev device
> > to mdev driver match was done. But then I realized the approach you are
> > following is different to my implementation, likely due to having different
> > purposes. If I understand the idea behind, you want to have same PCI PMD
> > drivers working with devices, PCI devices, created from mediated devices.
>
> Exactly!
>
> > That
> > is the reason there is just one mdev driver, the one for vfio-pci mediated
> > devices type.
> >
> > My approach was different and I though having specific PMD mdev support was
> > necessary, with the PMD requiring to register a mdev driver. I can see, after
> > reading your patch, it can be perfectly possible to have the same PMDs for
> > "pure" PCI devices and PCI devices made from mediated devices, and if the PMD
> > requires to do something different due to the mediated devices intrinsics, then
> > explicitly supporting that per PMD. I got specific ioctl calls between the PMD
> > and the mediating driver but this can also be done with your approach.
> >
> > I'm working on having a mediated PF, what is a different purpose than the Intel
> > scalable I/O idea, so I will merge this patchset with my code and see if it
> > works.
>
> Cool! Thanks!
>
> >
> > Thanks!
> >
> >
> > > [1] https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt
> > >
> > > Thanks,
> > > Tiwei
> > >
> > > Tiwei Bie (3):
> > >   eal: add a helper for reading string from sysfs
> > >   bus/mdev: add mdev bus support
> > >   bus/pci: add mdev support
> > >
> > >  config/common_base                        |   5 +
> > >  config/common_linux                       |   1 +
> > >  drivers/bus/Makefile                      |   1 +
> > >  drivers/bus/mdev/Makefile                 |  41 +++
> > >  drivers/bus/mdev/linux/Makefile           |   6 +
> > >  drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
> > >  drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
> > >  drivers/bus/mdev/meson.build              |  15 ++
> > >  drivers/bus/mdev/private.h                |  90 +++++++
> > >  drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
> > >  drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
> > >  drivers/bus/meson.build                   |   2 +-
> > >  drivers/bus/pci/Makefile                  |   3 +
> > >  drivers/bus/pci/linux/Makefile            |   4 +
> > >  drivers/bus/pci/linux/pci_vfio.c          |  35 ++-
> > >  drivers/bus/pci/linux/pci_vfio_mdev.c     | 305 +++++++++++++++++++++
> > >  drivers/bus/pci/meson.build               |   4 +-
> > >  drivers/bus/pci/pci_common.c              |  17 +-
> > >  drivers/bus/pci/private.h                 |   9 +
> > >  drivers/bus/pci/rte_bus_pci.h             |  11 +-
> > >  lib/librte_eal/common/eal_filesystem.h    |   7 +
> > >  lib/librte_eal/freebsd/eal/eal.c          |  22 ++
> > >  lib/librte_eal/linux/eal/eal.c            |  22 ++
> > >  lib/librte_eal/rte_eal_version.map        |   1 +
> > >  mk/rte.app.mk                             |   1 +
> > >  25 files changed, 1163 insertions(+), 19 deletions(-)
> > >  create mode 100644 drivers/bus/mdev/Makefile
> > >  create mode 100644 drivers/bus/mdev/linux/Makefile
> > >  create mode 100644 drivers/bus/mdev/linux/mdev.c
> > >  create mode 100644 drivers/bus/mdev/mdev.c
> > >  create mode 100644 drivers/bus/mdev/meson.build
> > >  create mode 100644 drivers/bus/mdev/private.h
> > >  create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
> > >  create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map
> > >  create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
> > >
> > > --
> > > 2.17.1
> >
> >



--
François-Frédéric Ozog | Director Linaro Edge & Fog Computing Group
T: +33.67221.6485
francois.ozog@linaro.org | Skype: ffozog