[dpdk-dev,v4,11/12] virtio: Add QTest support for virtio-net PMD

Message ID 1457512409-24403-12-git-send-email-mukawa@igel.co.jp (mailing list archive)
State Superseded, archived
Delegated to: Yuanhan Liu
Headers

Commit Message

Tetsuya Mukawa March 9, 2016, 8:33 a.m. UTC
  The patch adds a new virtio-net PMD configuration that allows the PMD to
work on host as if the PMD is in VM.
Here is new configuration for virtio-net PMD.
 - CONFIG_RTE_VIRTIO_VDEV_QTEST
To use this mode, EAL needs map all hugepages as one file. Also the file
should be mapped between (1 << 31) and (1 << 44). And start address
should be aligned by EAL memory size.

To allocate like above, use below options.
 --single-file
 --range-virtaddr=0x80000000-0x100000000000
 --align-memsize
If a free region cannot be found, EAL will return error.

To prepare virtio-net device on host, the users need to invoke QEMU
process in special QTest mode. This mode is mainly used for testing QEMU
devices from outer process. In this mode, no guest runs.
Here is QEMU command line.

 $ qemu-system-x86_64 \
     -machine pc-i440fx-1.4,accel=qtest \
     -display none -qtest-log /dev/null \
     -qtest unix:/tmp/socket,server \
     -netdev type=tap,script=/etc/qemu-ifup,id=net0,queues=1 \
     -device
virtio-net-pci,netdev=net0,mq=on,disable-modern=false,addr=3 \
     -chardev socket,id=chr1,path=/tmp/ivshmem,server \
     -device ivshmem,size=1G,chardev=chr1,vectors=1,addr=4

 * Should use QEMU-2.5.1, or above.
 * QEMU process is needed per port.
 * virtio-1.0 device are only supported.
 * The vhost backends like vhost-net and vhost-user can be specified.
 * In most cases, just using above command is enough, but you can also
   specify other QEMU virtio-net options like mac address.
 * Only checked "pc-i440fx-1.4" machine, but may work with other
   machines.
 * Should not add "--enable-kvm" to QEMU command line.

After invoking QEMU, the PMD can connect to QEMU process using unix
domain sockets. Over these sockets, virtio-net, ivshmem and piix3
device in QEMU are probed by the PMD.
Here is example of command line.

 $ testpmd -c f -n 1 -m 1024 --no-pci --single-file \
      --range-virtaddr=0x80000000-0x100000000000 --align-memsize \
      --vdev="eth_qtest_virtio0,qtest=/tmp/socket,ivshmem=/tmp/ivshmem"\
      -- --disable-hw-vlan --txqflags=0xf00 -i

Please specify same unix domain sockets and memory size in both QEMU
and DPDK command lines like above.
The share memory size should be power of 2, because ivshmem only
accepts such memory size.

Signed-off-by: Tetsuya Mukawa <mukawa@igel.co.jp>
---
 drivers/net/virtio/qtest.h         |  55 +++++
 drivers/net/virtio/virtio_ethdev.c | 457 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 501 insertions(+), 11 deletions(-)
  

Comments

Tetsuya Mukawa June 2, 2016, 3:29 a.m. UTC | #1
The patches will work on below patch series.
 - [PATCH v5 0/8] virtio support for container

It seems his implementation will be changed a bit.
So, this patch series are also going to be changed to follow his implementation.


[Changes]
v5 changes:
 - Rebase on latest dpdk-next-virtio.
 - Follow Jianfeng's implementation to support virtual virtio-net device.
 - Split the patch series like followings.
   - This patch series.
     Only support basic functions.
     The functions to handle LSC interrupt and '--range-virtaddr' was
     removed from this patch series.
     This patch needs EAL memory mapped between (1<<31) to (1<<44).
     To allocate such a memory, just assume the user will use '--base-virtaddr'.
     If appropriate memory cannot be allocated, this PMD will exit as error.
     Then the users can try other values.
  - Supplement patches to support link status interrupt.
  - Supplement patches to support '--range-virtaddr'.
    This EAL option will help to allocate memory mapped between (1<<31) to
    (1<<44).

v4 changes:
 - Rebase on latest master.
 - Split patches.
 - To abstract qtest code more, change interface between current virtio
   code and qtest code.
 - Rename qtest.c to qtest_utils.c
 - Change implementation like below.
   - Set pci device information out of qtest abstraction, then pass it to
     qtest to initialize devices.
 - Remove redundant condition checking from qtest_raw_send/recv().
 - Fix return value of qtest_raw_send().

v3 changes:
 - Rebase on latest master.
 - remove "-qtest-virtio" option, then add "--range-virtaddr" and
   "--align-memsize" options.
 - Fix typos in qtest.c

v2 changes:
 - Rebase on above patch seiries.
 - Rebase on master
 - Add "--qtest-virtio" EAL option.
 - Fixes in qtest.c
  - Fix error handling for the case qtest connection is closed.
  - Use eventfd for interrupt messaging.
  - Use linux header for PCI register definitions.
  - Fix qtest_raw_send/recv to handle error correctly.
  - Fix bit mask of PCI_CONFIG_ADDR.
  - Describe memory and ioport usage of qtest guest in qtest.c
  - Remove loop that is for finding PCI devices.


[Abstraction]

Normally, virtio-net PMD only works on VM, because there is no virtio-net device on host.
This patches extend  virtio-net PMD to be able to work on host as virtual PMD.
But we didn't implement virtio-net device as a part of virtio-net PMD.
To prepare virtio-net device for the PMD, start QEMU process with special QTest mode, then connect it from virtio-net PMD through unix domain socket.

The PMD can connect to anywhere QEMU virtio-net device can.
For example, the PMD can connects to vhost-net kernel module and vhost-user backend application.
Similar to virtio-net PMD on QEMU, application memory that uses virtio-net PMD will be shared between vhost backend application.
But vhost backend application memory will not be shared.

Main target of this PMD is container like docker, rkt, lxc and etc.
We can isolate related processes(virtio-net PMD process, QEMU and vhost-user backend process) by container.
But, to communicate through unix domain socket, shared directory will be needed.


[How to use]

 Please use QEMU-2.5.1, or above.
 (So far, QEMU-2.5.1 hasn't been released yet, so please checkout master from QEMU repository)

 - Compile
 Set "CONFIG_RTE_VIRTIO_VDEV_QTEST=y" in config/common_linux.
 Then compile it.

 - Start QEMU like below.
 $ qemu-system-x86_64 \
              -machine pc-i440fx-1.4,accel=qtest \
              -display none -qtest-log /dev/null \
              -qtest unix:/tmp/socket,server \
              -netdev type=tap,script=/etc/qemu-ifup,id=net0,queues=1 \
              -device virtio-net-pci,netdev=net0,mq=on,disable-modern=false,addr=3 \
              -chardev socket,id=chr1,path=/tmp/ivshmem,server \
              -device ivshmem,size=1G,chardev=chr1,vectors=1,addr=4

 - Start DPDK application like below
 $ testpmd -c f -n 1 -m 1024 --no-pci --base-virtaddr=0x400000000 \
             --vdev="eth_virtio_qtest0,qtest=/tmp/socket,ivshmem=/tmp/ivshmem"\
             -- --disable-hw-vlan --txqflags=0xf00 -i

(*1) Please Specify same memory size in QEMU and DPDK command line.
(*2) Should use qemu-2.5.1, or above.
(*3) QEMU process is needed per port.
(*4) virtio-1.0 device are only supported.
(*5) The vhost backends like vhost-net and vhost-user can be specified.
(*6) In most cases, just using above command is enough, but you can also
     specify other QEMU virtio-net options.
(*7) Only checked "pc-i440fx-1.4" machine, but may work with other
     machines. It depends on a machine has piix3 south bridge.
     If the machine doesn't have, virtio-net PMD cannot receive status
     changed interrupts.
(*8) Should not add "--enable-kvm" to QEMU command line.


[Detailed Description]

 - virtio-net device implementation
The PMD uses QEMU virtio-net device. To do that, QEMU QTest functionality is used.
QTest is a test framework of QEMU devices. It allows us to implement a device driver outside of QEMU.
With QTest, we can implement DPDK application and virtio-net PMD as standalone process on host.
When QEMU is invoked as QTest mode, any guest code will not run.
To know more about QTest, see below.
http://wiki.qemu.org/Features/QTest

 - probing devices
QTest provides a unix domain socket. Through this socket, driver process can access to I/O port and memory of QEMU virtual machine.
The PMD will send I/O port accesses to probe pci devices.
If we can find virtio-net and ivshmem device, initialize the devices.
Also, I/O port accesses of virtio-net PMD will be sent through socket, and virtio-net PMD can initialize vitio-net device on QEMU correctly.

 - ivshmem device to share memory
To share memory that virtio-net PMD process uses, ivshmem device will be used.
Because ivshmem device can only handle one file descriptor, shared memory should be consist of one file.
To allocate such a memory, EAL has new option called "--single-file".
Also, the hugepages should be mapped between "1 << 31" to "1 << 44".
To map like above, use '--base-virtaddr' option.
While initializing ivshmem device, we can set BAR(Base Address Register).
It represents which memory QEMU vcpu can access to this shared memory.
We will specify host virtual address of shared memory as this address.
It is very useful because we don't need to apply patch to QEMU to calculate address offset.
(For example, if virtio-net PMD process will allocate memory from shared memory, then specify the virtual address of it to virtio-net register, QEMU virtio-net device can understand it without calculating address offset.)



Tetsuya Mukawa (6):
  virtio, qtest: Add QTest utility basic functions
  virtio, qtest: Add pci device initialization function to qtest utils
  virtio, qtest: Add functionality to share memory between QTest guest
  virtio, qtest: Add misc functions to handle pci information
  virtio: Add QTest support to vtpci abstraction
  virtio: Add QTest support for virtio-net PMD

 config/common_linuxapp                             |    2 +
 drivers/net/virtio/Makefile                        |    6 +
 drivers/net/virtio/virtio_ethdev.c                 |    3 +-
 drivers/net/virtio/virtio_ethdev.h                 |    1 +
 drivers/net/virtio/virtio_qtest/qtest.h            |   95 ++
 drivers/net/virtio/virtio_qtest/qtest_utils.c      | 1087 ++++++++++++++++++++
 drivers/net/virtio/virtio_qtest/qtest_utils.h      |  289 ++++++
 drivers/net/virtio/virtio_qtest/virtio_qtest_dev.c |  393 +++++++
 drivers/net/virtio/virtio_qtest/virtio_qtest_dev.h |   42 +
 drivers/net/virtio/virtio_qtest/virtio_qtest_pci.c |  407 ++++++++
 drivers/net/virtio/virtio_qtest/virtio_qtest_pci.h |   39 +
 drivers/net/virtio/virtqueue.h                     |    6 +-
 12 files changed, 2365 insertions(+), 5 deletions(-)
 create mode 100644 drivers/net/virtio/virtio_qtest/qtest.h
 create mode 100644 drivers/net/virtio/virtio_qtest/qtest_utils.c
 create mode 100644 drivers/net/virtio/virtio_qtest/qtest_utils.h
 create mode 100644 drivers/net/virtio/virtio_qtest/virtio_qtest_dev.c
 create mode 100644 drivers/net/virtio/virtio_qtest/virtio_qtest_dev.h
 create mode 100644 drivers/net/virtio/virtio_qtest/virtio_qtest_pci.c
 create mode 100644 drivers/net/virtio/virtio_qtest/virtio_qtest_pci.h
  
Tetsuya Mukawa June 2, 2016, 3:30 a.m. UTC | #2
This is patches to support LSC interrupt handling for virtio-qtest.
This patches should be on below patches.
 - [PATCH v5 0/6] Virtio-net PMD: QEMU QTest extension for container

To support LSC interrupts, vtpci abstraction was expanded to handle interrupt from
pci devices.
Actually, this PMD is handling a virtual virtio-net device. So handling interrupts
are a bit different from actual pci devices. In this case, all interrupts are come
from unix domain socket connected to QEMU.


Tetsuya Mukawa (2):
  virtio: Handle interrupt things under vtpci abstraction
  virtio, qtest: Add functionality to handle interrupt

 drivers/net/virtio/virtio_ethdev.c                 |  17 +-
 drivers/net/virtio/virtio_pci.c                    |  86 +++++---
 drivers/net/virtio/virtio_pci.h                    |   7 +
 drivers/net/virtio/virtio_qtest/qtest.h            |   3 +-
 drivers/net/virtio/virtio_qtest/qtest_utils.c      | 225 ++++++++++++++++++++-
 drivers/net/virtio/virtio_qtest/qtest_utils.h      |  68 ++++++-
 drivers/net/virtio/virtio_qtest/virtio_qtest_dev.c |  23 ++-
 drivers/net/virtio/virtio_qtest/virtio_qtest_pci.c |  64 ++++--
 8 files changed, 432 insertions(+), 61 deletions(-)
  
Yuanhan Liu June 2, 2016, 7:31 a.m. UTC | #3
On Thu, Jun 02, 2016 at 12:29:39PM +0900, Tetsuya Mukawa wrote:
> The patches will work on below patch series.
>  - [PATCH v5 0/8] virtio support for container
> 
> It seems his implementation will be changed a bit.
> So, this patch series are also going to be changed to follow his implementation.

Hi Tetsuya,

TBH, I was considering to reject your v4: the code was quite messy. But
this v5 changed my mind a bit: it's much cleaner.

But still, I'd ask do we really need 2 virtio for container solutions?

That results to the same question that I'm sure you have already
answered before: in which way your solution outweighs Jianfeng's?

The reason I want to ask again is: 1), I wasn't actively participating
the discussion in last release, besides some common comments on virtio,
2), maybe it's time to make a decision that should we take one solution
only, if so, which one, or should we take both?

Thomas is Cc'ed, hope he can help on the decision making.

	--yliu
  
Tetsuya Mukawa June 2, 2016, 9:30 a.m. UTC | #4
Hi Yuanhan,

On 2016/06/02 16:31, Yuanhan Liu wrote:
> But still, I'd ask do we really need 2 virtio for container solutions?

I appreciate your comments.
Let me have time to discuss it with our team.

Thanks,
Tetsuya
  
Yuanhan Liu June 3, 2016, 4:17 a.m. UTC | #5
On Thu, Jun 02, 2016 at 06:30:18PM +0900, Tetsuya Mukawa wrote:
> Hi Yuanhan,
> 
> On 2016/06/02 16:31, Yuanhan Liu wrote:
> > But still, I'd ask do we really need 2 virtio for container solutions?
> 
> I appreciate your comments.

No, I appreciate your effort for contributing to DPDK! vhost-pmd stuff
is just brilliant!

> Let me have time to discuss it with our team.

I'm wondering could we have one solution only. IMO, the drawback of
having two (quite different) solutions might outweighs the benefit
it takes. Say, it might just confuse user.

OTOH, I'm wondering could you adapt to Jianfeng's solution? If not,
what's the missing parts, and could we fix it? I'm thinking having
one unified solution will keep ours energy/focus on one thing, making
it better and better! Having two just splits the energy; it also
introduces extra burden for maintaining.

	--yliu
  
Thomas Monjalon June 3, 2016, 1:51 p.m. UTC | #6
2016-06-03 12:17, Yuanhan Liu:
> On Thu, Jun 02, 2016 at 06:30:18PM +0900, Tetsuya Mukawa wrote:
> > Hi Yuanhan,
> > 
> > On 2016/06/02 16:31, Yuanhan Liu wrote:
> > > But still, I'd ask do we really need 2 virtio for container solutions?
> > 
> > I appreciate your comments.
> 
> No, I appreciate your effort for contributing to DPDK! vhost-pmd stuff
> is just brilliant!
> 
> > Let me have time to discuss it with our team.
> 
> I'm wondering could we have one solution only. IMO, the drawback of
> having two (quite different) solutions might outweighs the benefit
> it takes. Say, it might just confuse user.

+1

> OTOH, I'm wondering could you adapt to Jianfeng's solution? If not,
> what's the missing parts, and could we fix it? I'm thinking having
> one unified solution will keep ours energy/focus on one thing, making
> it better and better! Having two just splits the energy; it also
> introduces extra burden for maintaining.

+1
  
Tetsuya Mukawa June 6, 2016, 5:10 a.m. UTC | #7
Hi Yuanhan,

Sorry for late replying.

On 2016/06/03 13:17, Yuanhan Liu wrote:
> On Thu, Jun 02, 2016 at 06:30:18PM +0900, Tetsuya Mukawa wrote:
>> Hi Yuanhan,
>>
>> On 2016/06/02 16:31, Yuanhan Liu wrote:
>>> But still, I'd ask do we really need 2 virtio for container solutions?
>>
>> I appreciate your comments.
> 
> No, I appreciate your effort for contributing to DPDK! vhost-pmd stuff
> is just brilliant!
> 
>> Let me have time to discuss it with our team.
> 
> I'm wondering could we have one solution only. IMO, the drawback of
> having two (quite different) solutions might outweighs the benefit
> it takes. Say, it might just confuse user.

I agree with this.
If we have 2 solutions, it would confuse the DPDK users.

> 
> OTOH, I'm wondering could you adapt to Jianfeng's solution? If not,
> what's the missing parts, and could we fix it? I'm thinking having
> one unified solution will keep ours energy/focus on one thing, making
> it better and better! Having two just splits the energy; it also
> introduces extra burden for maintaining.

Of course, I adopt Jiangeng's solution basically.
Actually, his solution is almost similar I tried to implement at first.

I guess here is pros/cons of 2 solutions.

[Jianfeng's solution]
- Pros
Don't need to invoke QEMU process.
- Cons
If virtio-net specification is changed, we need to implement it by
ourselves. Also, LSC interrupt and control queue functions are not
supported yet.
I agree both functions may not be so important, and if we need it
we can implement them, but we need to pay energy to implement them.

[My solution]
- Pros
Basic principle of my implementation is not to reinvent the wheel.
We can use a virtio-net device of QEMU implementation, it means we don't
need to maintain virtio-net device by ourselves, and we can use all of
functions supported by QEMU virtio-net device.
- Cons
Need to invoke QEMU process.


Anyway, we can choose one of belows.
1. Take advantage of invoking less processes.
2. Take advantage of maintainability of virtio-net device.

Honestly, I'm OK if my solution is not merged.
Thus, it should be decided to let DPDK better.

What do you think?
Which is better for DPDK?

Thanks,
Tetsuya

> 
> 	--yliu
>
  
Yuanhan Liu June 6, 2016, 7:21 a.m. UTC | #8
On Mon, Jun 06, 2016 at 02:10:46PM +0900, Tetsuya Mukawa wrote:
> Hi Yuanhan,
> 
> Sorry for late replying.

Never mind.

> 
> On 2016/06/03 13:17, Yuanhan Liu wrote:
> > On Thu, Jun 02, 2016 at 06:30:18PM +0900, Tetsuya Mukawa wrote:
> >> Hi Yuanhan,
> >>
> >> On 2016/06/02 16:31, Yuanhan Liu wrote:
> >>> But still, I'd ask do we really need 2 virtio for container solutions?
> >>
> >> I appreciate your comments.
> > 
> > No, I appreciate your effort for contributing to DPDK! vhost-pmd stuff
> > is just brilliant!
> > 
> >> Let me have time to discuss it with our team.
> > 
> > I'm wondering could we have one solution only. IMO, the drawback of
> > having two (quite different) solutions might outweighs the benefit
> > it takes. Say, it might just confuse user.
> 
> I agree with this.
> If we have 2 solutions, it would confuse the DPDK users.
> 
> > 
> > OTOH, I'm wondering could you adapt to Jianfeng's solution? If not,
> > what's the missing parts, and could we fix it? I'm thinking having
> > one unified solution will keep ours energy/focus on one thing, making
> > it better and better! Having two just splits the energy; it also
> > introduces extra burden for maintaining.
> 
> Of course, I adopt Jiangeng's solution basically.
> Actually, his solution is almost similar I tried to implement at first.
> 
> I guess here is pros/cons of 2 solutions.
> 
> [Jianfeng's solution]
> - Pros
> Don't need to invoke QEMU process.
> - Cons
> If virtio-net specification is changed, we need to implement it by
> ourselves. Also, LSC interrupt and control queue functions are not
> supported yet.

Jianfeng have made and sent out the patch to enable ctrl queue and
multiple queue support.

For the LSC part, no much idea yet so far. But I'm assuming it will
not take too much effort, either.

> I agree both functions may not be so important, and if we need it
> we can implement them, but we need to pay energy to implement them.
> 
> [My solution]
> - Pros
> Basic principle of my implementation is not to reinvent the wheel.

Yes, that's a good point. However, it's not that hard as we would have
thought in the first time: the tough part that dequeue/enqueue packets
from/to vring is actually offloaded to DPDK vhost-user. That means we
only need re-implement the control path of virtio-net device, plus the
vhost-user frontend. If you have a detailed look of your patchset as
well Jianfeng's, you might find that the two patchset are actually with
same code size. 

> We can use a virtio-net device of QEMU implementation, it means we don't
> need to maintain virtio-net device by ourselves, and we can use all of
> functions supported by QEMU virtio-net device.
> - Cons
> Need to invoke QEMU process.

Another thing is that it makes the usage a bit harder: look at the
long qemu cli options of your example usage. It also has some traps,
say, "--enable-kvm" is not allowed, which is a default option used
with QEMU.

And judging that we actually don't take too much effort to implement
a virtio device emulation, I'd prefer it slightly. I guess something
light weight and easier for use is more important here.

Actually, I have foreseen another benefit of adding virtio-user device
emulation: we now might be able to add a rte_vhost_dequeue/enqueue_burst()
unit test case. We simply can't do it before, since we depend on QEMU
for testing, which is not acceptable for a unit test case. Making it
be a unit test case would help us spotting any bad changes that would
introduce bugs easily and automatically.

	--yliu

> Anyway, we can choose one of belows.
> 1. Take advantage of invoking less processes.
> 2. Take advantage of maintainability of virtio-net device.
> 
> Honestly, I'm OK if my solution is not merged.
> Thus, it should be decided to let DPDK better.
> 
> What do you think?
> Which is better for DPDK?
> 
> Thanks,
> Tetsuya
> 
> > 
> > 	--yliu
> >
  
Jianfeng Tan June 6, 2016, 8:03 a.m. UTC | #9
Hi,


On 6/6/2016 1:10 PM, Tetsuya Mukawa wrote:
> Hi Yuanhan,
>
> Sorry for late replying.
>
> On 2016/06/03 13:17, Yuanhan Liu wrote:
>> On Thu, Jun 02, 2016 at 06:30:18PM +0900, Tetsuya Mukawa wrote:
>>> Hi Yuanhan,
>>>
>>> On 2016/06/02 16:31, Yuanhan Liu wrote:
>>>> But still, I'd ask do we really need 2 virtio for container solutions?
>>> I appreciate your comments.
>> No, I appreciate your effort for contributing to DPDK! vhost-pmd stuff
>> is just brilliant!
>>
>>> Let me have time to discuss it with our team.
>> I'm wondering could we have one solution only. IMO, the drawback of
>> having two (quite different) solutions might outweighs the benefit
>> it takes. Say, it might just confuse user.
> I agree with this.
> If we have 2 solutions, it would confuse the DPDK users.
>
>> OTOH, I'm wondering could you adapt to Jianfeng's solution? If not,
>> what's the missing parts, and could we fix it? I'm thinking having
>> one unified solution will keep ours energy/focus on one thing, making
>> it better and better! Having two just splits the energy; it also
>> introduces extra burden for maintaining.
> Of course, I adopt Jiangeng's solution basically.
> Actually, his solution is almost similar I tried to implement at first.
>
> I guess here is pros/cons of 2 solutions.
>
> [Jianfeng's solution]
> - Pros
> Don't need to invoke QEMU process.
> - Cons
> If virtio-net specification is changed, we need to implement it by
> ourselves.

It will barely introduce any change when virtio-net specification is 
changed as far as I can see. The only part we care is the how desc, 
avail, used distribute on memory, which is a very small part.

It's true that my solution now seriously depend on vhost-user protocol, 
which is defined in QEMU. I cannot see a big problem there so far.

>   Also, LSC interrupt and control queue functions are not
> supported yet.
> I agree both functions may not be so important, and if we need it
> we can implement them, but we need to pay energy to implement them.

LSC is really less important than rxq interrupt (IMO). We don't know how 
long will rxq interrupt of virtio be available for QEMU, but we can 
accelerate it if we avoid using QEMU.

Actually, if the vhost backend is vhost-user (the main use case), 
current qemu have limited control queue support, because it needs the 
support from the vhost user backend.

Add one more con of my solution:
- Need to write another logic to support other virtio device (say 
virtio-scsi), if it's easier of Tetsuya's solution to do that?

>
> [My solution]
> - Pros
> Basic principle of my implementation is not to reinvent the wheel.
> We can use a virtio-net device of QEMU implementation, it means we don't
> need to maintain virtio-net device by ourselves, and we can use all of
> functions supported by QEMU virtio-net device.
> - Cons
> Need to invoke QEMU process.

Two more possible cons:
a) This solution also needs to maintain qtest utility, right?
b) There's still address arrange restriction, right? Although we can use 
"--base-virtaddr=0x400000000" to relieve this question, but how about if 
there are 2 or more devices? (By the way, is there still address arrange 
requirement for 32 bit system)
c) Actually, IMO this solution is sensitive to any virtio spec change 
(io port, pci configuration space).

>
>
> Anyway, we can choose one of belows.
> 1. Take advantage of invoking less processes.
> 2. Take advantage of maintainability of virtio-net device.
>
> Honestly, I'm OK if my solution is not merged.
> Thus, it should be decided to let DPDK better.

Yes, agreed.

Thanks,
Jianfeng

>
> What do you think?
> Which is better for DPDK?
>
> Thanks,
> Tetsuya
>
>> 	--yliu
>>
  
Tetsuya Mukawa June 6, 2016, 8:33 a.m. UTC | #10
On 2016/06/06 16:21, Yuanhan Liu wrote:
> On Mon, Jun 06, 2016 at 02:10:46PM +0900, Tetsuya Mukawa wrote:
>> Hi Yuanhan,
>>
>> Sorry for late replying.
> 
> Never mind.
> 
>>
>> On 2016/06/03 13:17, Yuanhan Liu wrote:
>>> On Thu, Jun 02, 2016 at 06:30:18PM +0900, Tetsuya Mukawa wrote:
>>>> Hi Yuanhan,
>>>>
>>>> On 2016/06/02 16:31, Yuanhan Liu wrote:
>>>>> But still, I'd ask do we really need 2 virtio for container solutions?
>>>>
>>>> I appreciate your comments.
>>>
>>> No, I appreciate your effort for contributing to DPDK! vhost-pmd stuff
>>> is just brilliant!
>>>
>>>> Let me have time to discuss it with our team.
>>>
>>> I'm wondering could we have one solution only. IMO, the drawback of
>>> having two (quite different) solutions might outweighs the benefit
>>> it takes. Say, it might just confuse user.
>>
>> I agree with this.
>> If we have 2 solutions, it would confuse the DPDK users.
>>
>>>
>>> OTOH, I'm wondering could you adapt to Jianfeng's solution? If not,
>>> what's the missing parts, and could we fix it? I'm thinking having
>>> one unified solution will keep ours energy/focus on one thing, making
>>> it better and better! Having two just splits the energy; it also
>>> introduces extra burden for maintaining.
>>
>> Of course, I adopt Jiangeng's solution basically.
>> Actually, his solution is almost similar I tried to implement at first.
>>
>> I guess here is pros/cons of 2 solutions.
>>
>> [Jianfeng's solution]
>> - Pros
>> Don't need to invoke QEMU process.
>> - Cons
>> If virtio-net specification is changed, we need to implement it by
>> ourselves. Also, LSC interrupt and control queue functions are not
>> supported yet.
> 
> Jianfeng have made and sent out the patch to enable ctrl queue and
> multiple queue support.

Sorry, I haven't noticed that ctrl queue has been already enabled.

> 
> For the LSC part, no much idea yet so far. But I'm assuming it will
> not take too much effort, either.
> 
>> I agree both functions may not be so important, and if we need it
>> we can implement them, but we need to pay energy to implement them.
>>
>> [My solution]
>> - Pros
>> Basic principle of my implementation is not to reinvent the wheel.
> 
> Yes, that's a good point. However, it's not that hard as we would have
> thought in the first time: the tough part that dequeue/enqueue packets
> from/to vring is actually offloaded to DPDK vhost-user. That means we
> only need re-implement the control path of virtio-net device, plus the
> vhost-user frontend. If you have a detailed look of your patchset as
> well Jianfeng's, you might find that the two patchset are actually with
> same code size. 

Yes, I know this.
So far, the amount of code is almost same, but in the future we may need
to implement more, if virtio-net specification is revised.

> 
>> We can use a virtio-net device of QEMU implementation, it means we don't
>> need to maintain virtio-net device by ourselves, and we can use all of
>> functions supported by QEMU virtio-net device.
>> - Cons
>> Need to invoke QEMU process.
> 
> Another thing is that it makes the usage a bit harder: look at the
> long qemu cli options of your example usage. It also has some traps,
> say, "--enable-kvm" is not allowed, which is a default option used
> with QEMU.

Probably a kind of shell script will help the users.

> 
> And judging that we actually don't take too much effort to implement
> a virtio device emulation, I'd prefer it slightly. I guess something
> light weight and easier for use is more important here.

This is very important point.
If so, we don't need much effort when virtio-spec is changed.

> 
> Actually, I have foreseen another benefit of adding virtio-user device
> emulation: we now might be able to add a rte_vhost_dequeue/enqueue_burst()
> unit test case. We simply can't do it before, since we depend on QEMU
> for testing, which is not acceptable for a unit test case. Making it
> be a unit test case would help us spotting any bad changes that would
> introduce bugs easily and automatically.

As you mentioned above, QEMU process is not related with
dequeuing/enqueuing.
So I guess we may have a testing for rte_vhost_dequeue/enqueue_burst()
regardless of choice.

>> Anyway, we can choose one of belows.
>> 1. Take advantage of invoking less processes.
>> 2. Take advantage of maintainability of virtio-net device.

If container usage that DPDK assumes is to invoke hundreds containers in
one host, we should take Jiangfeng's solution.

Also, if implementing a new feature and maintaining Jiangfeng's
virtio-net device are not so hard, we should take his solution.

I guess this is the point we need to consider.
What do you think?

Thanks,
Tetsuya

>>
>> Honestly, I'm OK if my solution is not merged.
>> Thus, it should be decided to let DPDK better.
>>
>> What do you think?
>> Which is better for DPDK?
>>
>> Thanks,
>> Tetsuya
>>
>>>
>>> 	--yliu
>>>
  
Yuanhan Liu June 6, 2016, 8:49 a.m. UTC | #11
On Mon, Jun 06, 2016 at 05:33:31PM +0900, Tetsuya Mukawa wrote:
> >> [My solution]
> >> - Pros
> >> Basic principle of my implementation is not to reinvent the wheel.
> > 
> > Yes, that's a good point. However, it's not that hard as we would have
> > thought in the first time: the tough part that dequeue/enqueue packets
> > from/to vring is actually offloaded to DPDK vhost-user. That means we
> > only need re-implement the control path of virtio-net device, plus the
> > vhost-user frontend. If you have a detailed look of your patchset as
> > well Jianfeng's, you might find that the two patchset are actually with
> > same code size. 
> 
> Yes, I know this.
> So far, the amount of code is almost same, but in the future we may need
> to implement more, if virtio-net specification is revised.

It didn't take too much effort to implement from scratch, I doubt it
will for future revise. And, virtio-net spec is unlikely revised, or
to be precisely, unlikely revised quite often. Therefore, I don't see
big issues here.

> >> We can use a virtio-net device of QEMU implementation, it means we don't
> >> need to maintain virtio-net device by ourselves, and we can use all of
> >> functions supported by QEMU virtio-net device.
> >> - Cons
> >> Need to invoke QEMU process.
> > 
> > Another thing is that it makes the usage a bit harder: look at the
> > long qemu cli options of your example usage. It also has some traps,
> > say, "--enable-kvm" is not allowed, which is a default option used
> > with QEMU.
> 
> Probably a kind of shell script will help the users.

Yeah, that would help. But if we have a choice to make it simpler in the
beginning, why not then? :-)

> > 
> > And judging that we actually don't take too much effort to implement
> > a virtio device emulation, I'd prefer it slightly. I guess something
> > light weight and easier for use is more important here.
> 
> This is very important point.
> If so, we don't need much effort when virtio-spec is changed.

I'd assume so.

> > Actually, I have foreseen another benefit of adding virtio-user device
> > emulation: we now might be able to add a rte_vhost_dequeue/enqueue_burst()
> > unit test case. We simply can't do it before, since we depend on QEMU
> > for testing, which is not acceptable for a unit test case. Making it
> > be a unit test case would help us spotting any bad changes that would
> > introduce bugs easily and automatically.
> 
> As you mentioned above, QEMU process is not related with
> dequeuing/enqueuing.
> So I guess we may have a testing for rte_vhost_dequeue/enqueue_burst()
> regardless of choice.

Yes, we don't need the dequeue/enqueue part, but we need the vhost-user
initialization part from QEMU vhost-user. Now that we have vhost-user
frontend from virtio-user, we have no dependency on QEMU any more.

> >> Anyway, we can choose one of belows.
> >> 1. Take advantage of invoking less processes.
> >> 2. Take advantage of maintainability of virtio-net device.
> 
> If container usage that DPDK assumes is to invoke hundreds containers in
> one host,

I barely know about container, but I would assume that's not rare.

> we should take Jiangfeng's solution.
> 
> Also, if implementing a new feature and maintaining Jiangfeng's
> virtio-net device are not so hard,

As stated, I would assume so.

	--yliu

> we should take his solution.
> 
> I guess this is the point we need to consider.
> What do you think?
> 
> Thanks,
> Tetsuya
> 
> >>
> >> Honestly, I'm OK if my solution is not merged.
> >> Thus, it should be decided to let DPDK better.
> >>
> >> What do you think?
> >> Which is better for DPDK?
> >>
> >> Thanks,
> >> Tetsuya
> >>
> >>>
> >>> 	--yliu
> >>>
  
Tetsuya Mukawa June 6, 2016, 9:28 a.m. UTC | #12
On 2016/06/06 17:03, Tan, Jianfeng wrote:
> Hi,
> 
> 
> On 6/6/2016 1:10 PM, Tetsuya Mukawa wrote:
>> Hi Yuanhan,
>>
>> Sorry for late replying.
>>
>> On 2016/06/03 13:17, Yuanhan Liu wrote:
>>> On Thu, Jun 02, 2016 at 06:30:18PM +0900, Tetsuya Mukawa wrote:
>>>> Hi Yuanhan,
>>>>
>>>> On 2016/06/02 16:31, Yuanhan Liu wrote:
>>>>> But still, I'd ask do we really need 2 virtio for container solutions?
>>>> I appreciate your comments.
>>> No, I appreciate your effort for contributing to DPDK! vhost-pmd stuff
>>> is just brilliant!
>>>
>>>> Let me have time to discuss it with our team.
>>> I'm wondering could we have one solution only. IMO, the drawback of
>>> having two (quite different) solutions might outweighs the benefit
>>> it takes. Say, it might just confuse user.
>> I agree with this.
>> If we have 2 solutions, it would confuse the DPDK users.
>>
>>> OTOH, I'm wondering could you adapt to Jianfeng's solution? If not,
>>> what's the missing parts, and could we fix it? I'm thinking having
>>> one unified solution will keep ours energy/focus on one thing, making
>>> it better and better! Having two just splits the energy; it also
>>> introduces extra burden for maintaining.
>> Of course, I adopt Jiangeng's solution basically.
>> Actually, his solution is almost similar I tried to implement at first.
>>
>> I guess here is pros/cons of 2 solutions.
>>
>> [Jianfeng's solution]
>> - Pros
>> Don't need to invoke QEMU process.
>> - Cons
>> If virtio-net specification is changed, we need to implement it by
>> ourselves.
> 
> It will barely introduce any change when virtio-net specification is
> changed as far as I can see. The only part we care is the how desc,
> avail, used distribute on memory, which is a very small part.

It's a good news, because we don't pay much effort to follow latest
virtio-net specification.

> 
> It's true that my solution now seriously depend on vhost-user protocol,
> which is defined in QEMU. I cannot see a big problem there so far.
> 
>>   Also, LSC interrupt and control queue functions are not
>> supported yet.
>> I agree both functions may not be so important, and if we need it
>> we can implement them, but we need to pay energy to implement them.
> 
> LSC is really less important than rxq interrupt (IMO). We don't know how
> long will rxq interrupt of virtio be available for QEMU, but we can
> accelerate it if we avoid using QEMU.
> 
> Actually, if the vhost backend is vhost-user (the main use case),
> current qemu have limited control queue support, because it needs the
> support from the vhost user backend.
> 
> Add one more con of my solution:
> - Need to write another logic to support other virtio device (say
> virtio-scsi), if it's easier of Tetsuya's solution to do that?
> 

Probably, my solution will be easier to do that.
My solution has enough facility to access to io port and PCI
configuration space of virtio-scsi device of QEMU.
So, if you invoke with QEMU with virtio-scsi, only you need to do is
changing PCI interface of current virtio-scsi PMD.
(I just assume currently we have virtio-scsi PMD.)
If the virtio-scsi PMD works on QEMU, same code should work with only
changing PCI interface.

>>
>> [My solution]
>> - Pros
>> Basic principle of my implementation is not to reinvent the wheel.
>> We can use a virtio-net device of QEMU implementation, it means we don't
>> need to maintain virtio-net device by ourselves, and we can use all of
>> functions supported by QEMU virtio-net device.
>> - Cons
>> Need to invoke QEMU process.
> 
> Two more possible cons:
> a) This solution also needs to maintain qtest utility, right?

But the spec of qtest will be more stable than virtio-net.

> b) There's still address arrange restriction, right? Although we can use
> "--base-virtaddr=0x400000000" to relieve this question, but how about if
> there are 2 or more devices? (By the way, is there still address arrange
> requirement for 32 bit system)

Our solutions are a virtio-net driver, and a vhost-user backend driver
needs to access to memory allocated by virtio-net driver.
If an application has 2 devices, it means 2 vhost-user backend PMD needs
to access to the same application memory, right?
Also, currently each virtio-net device has an one QEMU process.
So, I am not sure what will be problem if we have 2 devices.

BTW, 44bits limitations comes from current QEMU implementation itself.
(Actually, if modern virtio device is used, we should be able to remove
the restriction.)

> c) Actually, IMO this solution is sensitive to any virtio spec change
> (io port, pci configuration space).

In this case, virtio-net PMD itself will need to be fixed.
Then, my implementation will be also fixed with the same way.
Current implementation has only PCI abstraction that Yuanhan introduced,
so you may think my solution depends on above things, but actually, my
implementation depends on only how to access to io port and PCI
configuration space. This is what "qtest.h" provides.

Thanks,
Tetsuya

> 
>>
>>
>> Anyway, we can choose one of belows.
>> 1. Take advantage of invoking less processes.
>> 2. Take advantage of maintainability of virtio-net device.
>>
>> Honestly, I'm OK if my solution is not merged.
>> Thus, it should be decided to let DPDK better.
> 
> Yes, agreed.
> 
> Thanks,
> Jianfeng
> 
>>
>> What do you think?
>> Which is better for DPDK?
>>
>> Thanks,
>> Tetsuya
>>
>>>     --yliu
>>>
>
  
Tetsuya Mukawa June 6, 2016, 9:30 a.m. UTC | #13
On 2016/06/06 17:49, Yuanhan Liu wrote:
> On Mon, Jun 06, 2016 at 05:33:31PM +0900, Tetsuya Mukawa wrote:
>>>> [My solution]
>>>> - Pros
>>>> Basic principle of my implementation is not to reinvent the wheel.
>>>
>>> Yes, that's a good point. However, it's not that hard as we would have
>>> thought in the first time: the tough part that dequeue/enqueue packets
>>> from/to vring is actually offloaded to DPDK vhost-user. That means we
>>> only need re-implement the control path of virtio-net device, plus the
>>> vhost-user frontend. If you have a detailed look of your patchset as
>>> well Jianfeng's, you might find that the two patchset are actually with
>>> same code size. 
>>
>> Yes, I know this.
>> So far, the amount of code is almost same, but in the future we may need
>> to implement more, if virtio-net specification is revised.
> 
> It didn't take too much effort to implement from scratch, I doubt it
> will for future revise. And, virtio-net spec is unlikely revised, or
> to be precisely, unlikely revised quite often. Therefore, I don't see
> big issues here.
> 
>>>> We can use a virtio-net device of QEMU implementation, it means we don't
>>>> need to maintain virtio-net device by ourselves, and we can use all of
>>>> functions supported by QEMU virtio-net device.
>>>> - Cons
>>>> Need to invoke QEMU process.
>>>
>>> Another thing is that it makes the usage a bit harder: look at the
>>> long qemu cli options of your example usage. It also has some traps,
>>> say, "--enable-kvm" is not allowed, which is a default option used
>>> with QEMU.
>>
>> Probably a kind of shell script will help the users.
> 
> Yeah, that would help. But if we have a choice to make it simpler in the
> beginning, why not then? :-)
> 
>>>
>>> And judging that we actually don't take too much effort to implement
>>> a virtio device emulation, I'd prefer it slightly. I guess something
>>> light weight and easier for use is more important here.
>>
>> This is very important point.
>> If so, we don't need much effort when virtio-spec is changed.
> 
> I'd assume so.
> 
>>> Actually, I have foreseen another benefit of adding virtio-user device
>>> emulation: we now might be able to add a rte_vhost_dequeue/enqueue_burst()
>>> unit test case. We simply can't do it before, since we depend on QEMU
>>> for testing, which is not acceptable for a unit test case. Making it
>>> be a unit test case would help us spotting any bad changes that would
>>> introduce bugs easily and automatically.
>>
>> As you mentioned above, QEMU process is not related with
>> dequeuing/enqueuing.
>> So I guess we may have a testing for rte_vhost_dequeue/enqueue_burst()
>> regardless of choice.
> 
> Yes, we don't need the dequeue/enqueue part, but we need the vhost-user
> initialization part from QEMU vhost-user. Now that we have vhost-user
> frontend from virtio-user, we have no dependency on QEMU any more.
> 
>>>> Anyway, we can choose one of belows.
>>>> 1. Take advantage of invoking less processes.
>>>> 2. Take advantage of maintainability of virtio-net device.
>>
>> If container usage that DPDK assumes is to invoke hundreds containers in
>> one host,
> 
> I barely know about container, but I would assume that's not rare.

Hi Yuanhan,

It's great to hear it's not so hard to maintain Jiangfeng's virtio-net
device features.

Please let me make sure how we can invoke many DPDK applications in
hundreds containers.
(Do we have a way to do? Or, will we have it in the future?)

Thanks,
Tetsuya

> 
>> we should take Jiangfeng's solution.
>>
>> Also, if implementing a new feature and maintaining Jiangfeng's
>> virtio-net device are not so hard,
> 
> As stated, I would assume so.




> 
> 	--yliu
> 
>> we should take his solution.
>>
>> I guess this is the point we need to consider.
>> What do you think?
>>
>> Thanks,
>> Tetsuya
>>
>>>>
>>>> Honestly, I'm OK if my solution is not merged.
>>>> Thus, it should be decided to let DPDK better.
>>>>
>>>> What do you think?
>>>> Which is better for DPDK?
>>>>
>>>> Thanks,
>>>> Tetsuya
>>>>
>>>>>
>>>>> 	--yliu
>>>>>
  
Yuanhan Liu June 6, 2016, 9:58 a.m. UTC | #14
On Mon, Jun 06, 2016 at 06:30:00PM +0900, Tetsuya Mukawa wrote:
> On 2016/06/06 17:49, Yuanhan Liu wrote:
> > On Mon, Jun 06, 2016 at 05:33:31PM +0900, Tetsuya Mukawa wrote:
> >>>> [My solution]
> >>>> - Pros
> >>>> Basic principle of my implementation is not to reinvent the wheel.
> >>>
> >>> Yes, that's a good point. However, it's not that hard as we would have
> >>> thought in the first time: the tough part that dequeue/enqueue packets
> >>> from/to vring is actually offloaded to DPDK vhost-user. That means we
> >>> only need re-implement the control path of virtio-net device, plus the
> >>> vhost-user frontend. If you have a detailed look of your patchset as
> >>> well Jianfeng's, you might find that the two patchset are actually with
> >>> same code size. 
> >>
> >> Yes, I know this.
> >> So far, the amount of code is almost same, but in the future we may need
> >> to implement more, if virtio-net specification is revised.
> > 
> > It didn't take too much effort to implement from scratch, I doubt it
> > will for future revise. And, virtio-net spec is unlikely revised, or
> > to be precisely, unlikely revised quite often. Therefore, I don't see
> > big issues here.
> > 
> >>>> We can use a virtio-net device of QEMU implementation, it means we don't
> >>>> need to maintain virtio-net device by ourselves, and we can use all of
> >>>> functions supported by QEMU virtio-net device.
> >>>> - Cons
> >>>> Need to invoke QEMU process.
> >>>
> >>> Another thing is that it makes the usage a bit harder: look at the
> >>> long qemu cli options of your example usage. It also has some traps,
> >>> say, "--enable-kvm" is not allowed, which is a default option used
> >>> with QEMU.
> >>
> >> Probably a kind of shell script will help the users.
> > 
> > Yeah, that would help. But if we have a choice to make it simpler in the
> > beginning, why not then? :-)
> > 
> >>>
> >>> And judging that we actually don't take too much effort to implement
> >>> a virtio device emulation, I'd prefer it slightly. I guess something
> >>> light weight and easier for use is more important here.
> >>
> >> This is very important point.
> >> If so, we don't need much effort when virtio-spec is changed.
> > 
> > I'd assume so.
> > 
> >>> Actually, I have foreseen another benefit of adding virtio-user device
> >>> emulation: we now might be able to add a rte_vhost_dequeue/enqueue_burst()
> >>> unit test case. We simply can't do it before, since we depend on QEMU
> >>> for testing, which is not acceptable for a unit test case. Making it
> >>> be a unit test case would help us spotting any bad changes that would
> >>> introduce bugs easily and automatically.
> >>
> >> As you mentioned above, QEMU process is not related with
> >> dequeuing/enqueuing.
> >> So I guess we may have a testing for rte_vhost_dequeue/enqueue_burst()
> >> regardless of choice.
> > 
> > Yes, we don't need the dequeue/enqueue part, but we need the vhost-user
> > initialization part from QEMU vhost-user. Now that we have vhost-user
> > frontend from virtio-user, we have no dependency on QEMU any more.
> > 
> >>>> Anyway, we can choose one of belows.
> >>>> 1. Take advantage of invoking less processes.
> >>>> 2. Take advantage of maintainability of virtio-net device.
> >>
> >> If container usage that DPDK assumes is to invoke hundreds containers in
> >> one host,
> > 
> > I barely know about container, but I would assume that's not rare.
> 
> Hi Yuanhan,
> 
> It's great to hear it's not so hard to maintain Jiangfeng's virtio-net
> device features.
> 
> Please let me make sure how we can invoke many DPDK applications in
> hundreds containers.
> (Do we have a way to do? Or, will we have it in the future?)

One thing that I have thought of is that we should remove the huge page
dependency of current usage: huge page would be a very limited resource.

Note that I don't mean to remove support of huge page; DPDK supports
that by default and support it well after all. What I mean is to make
it work for the non-hugepage cases as well, so that it could fit for
the hundreds of containers case.

	--yliu
  
Jianfeng Tan June 6, 2016, 10:35 a.m. UTC | #15
Hi,

On 6/6/2016 5:28 PM, Tetsuya Mukawa wrote:
> On 2016/06/06 17:03, Tan, Jianfeng wrote:
>> Hi,
>>
>>
>> On 6/6/2016 1:10 PM, Tetsuya Mukawa wrote:
>>> Hi Yuanhan,
>>>
>>> Sorry for late replying.
>>>
>>> On 2016/06/03 13:17, Yuanhan Liu wrote:
>>>> On Thu, Jun 02, 2016 at 06:30:18PM +0900, Tetsuya Mukawa wrote:
>>>>> Hi Yuanhan,
>>>>>
>>>>> On 2016/06/02 16:31, Yuanhan Liu wrote:
>>>>>> But still, I'd ask do we really need 2 virtio for container solutions?
>>>>> I appreciate your comments.
>>>> No, I appreciate your effort for contributing to DPDK! vhost-pmd stuff
>>>> is just brilliant!
>>>>
>>>>> Let me have time to discuss it with our team.
>>>> I'm wondering could we have one solution only. IMO, the drawback of
>>>> having two (quite different) solutions might outweighs the benefit
>>>> it takes. Say, it might just confuse user.
>>> I agree with this.
>>> If we have 2 solutions, it would confuse the DPDK users.
>>>
>>>> OTOH, I'm wondering could you adapt to Jianfeng's solution? If not,
>>>> what's the missing parts, and could we fix it? I'm thinking having
>>>> one unified solution will keep ours energy/focus on one thing, making
>>>> it better and better! Having two just splits the energy; it also
>>>> introduces extra burden for maintaining.
>>> Of course, I adopt Jiangeng's solution basically.
>>> Actually, his solution is almost similar I tried to implement at first.
>>>
>>> I guess here is pros/cons of 2 solutions.
>>>
>>> [Jianfeng's solution]
>>> - Pros
>>> Don't need to invoke QEMU process.
>>> - Cons
>>> If virtio-net specification is changed, we need to implement it by
>>> ourselves.
>> It will barely introduce any change when virtio-net specification is
>> changed as far as I can see. The only part we care is the how desc,
>> avail, used distribute on memory, which is a very small part.
> It's a good news, because we don't pay much effort to follow latest
> virtio-net specification.
>
>> It's true that my solution now seriously depend on vhost-user protocol,
>> which is defined in QEMU. I cannot see a big problem there so far.
>>
>>>    Also, LSC interrupt and control queue functions are not
>>> supported yet.
>>> I agree both functions may not be so important, and if we need it
>>> we can implement them, but we need to pay energy to implement them.
>> LSC is really less important than rxq interrupt (IMO). We don't know how
>> long will rxq interrupt of virtio be available for QEMU, but we can
>> accelerate it if we avoid using QEMU.
>>
>> Actually, if the vhost backend is vhost-user (the main use case),
>> current qemu have limited control queue support, because it needs the
>> support from the vhost user backend.
>>
>> Add one more con of my solution:
>> - Need to write another logic to support other virtio device (say
>> virtio-scsi), if it's easier of Tetsuya's solution to do that?
>>
> Probably, my solution will be easier to do that.
> My solution has enough facility to access to io port and PCI
> configuration space of virtio-scsi device of QEMU.
> So, if you invoke with QEMU with virtio-scsi, only you need to do is
> changing PCI interface of current virtio-scsi PMD.
> (I just assume currently we have virtio-scsi PMD.)
> If the virtio-scsi PMD works on QEMU, same code should work with only
> changing PCI interface.
>
>>> [My solution]
>>> - Pros
>>> Basic principle of my implementation is not to reinvent the wheel.
>>> We can use a virtio-net device of QEMU implementation, it means we don't
>>> need to maintain virtio-net device by ourselves, and we can use all of
>>> functions supported by QEMU virtio-net device.
>>> - Cons
>>> Need to invoke QEMU process.
>> Two more possible cons:
>> a) This solution also needs to maintain qtest utility, right?
> But the spec of qtest will be more stable than virtio-net.
>
>> b) There's still address arrange restriction, right? Although we can use
>> "--base-virtaddr=0x400000000" to relieve this question, but how about if
>> there are 2 or more devices? (By the way, is there still address arrange
>> requirement for 32 bit system)
> Our solutions are a virtio-net driver, and a vhost-user backend driver
> needs to access to memory allocated by virtio-net driver.
> If an application has 2 devices, it means 2 vhost-user backend PMD needs
> to access to the same application memory, right?
> Also, currently each virtio-net device has an one QEMU process.
> So, I am not sure what will be problem if we have 2 devices.

OK, my bad. Multiple devices should have just one 
"--base-virtaddr=0x400000000".

>
> BTW, 44bits limitations comes from current QEMU implementation itself.
> (Actually, if modern virtio device is used, we should be able to remove
> the restriction.)

Good to know.

>
>> c) Actually, IMO this solution is sensitive to any virtio spec change
>> (io port, pci configuration space).
> In this case, virtio-net PMD itself will need to be fixed.
> Then, my implementation will be also fixed with the same way.
> Current implementation has only PCI abstraction that Yuanhan introduced,
> so you may think my solution depends on above things, but actually, my
> implementation depends on only how to access to io port and PCI
> configuration space. This is what "qtest.h" provides.

Gotcha.

Thanks,
Jianfeng
  
Jianfeng Tan June 6, 2016, 10:50 a.m. UTC | #16
Hi,


On 6/6/2016 5:30 PM, Tetsuya Mukawa wrote:
> On 2016/06/06 17:49, Yuanhan Liu wrote:
>> On Mon, Jun 06, 2016 at 05:33:31PM +0900, Tetsuya Mukawa wrote:
>>>>> [My solution]
>>>>> - Pros
>>>>> Basic principle of my implementation is not to reinvent the wheel.
>>>> Yes, that's a good point. However, it's not that hard as we would have
>>>> thought in the first time: the tough part that dequeue/enqueue packets
>>>> from/to vring is actually offloaded to DPDK vhost-user. That means we
>>>> only need re-implement the control path of virtio-net device, plus the
>>>> vhost-user frontend. If you have a detailed look of your patchset as
>>>> well Jianfeng's, you might find that the two patchset are actually with
>>>> same code size.
>>> Yes, I know this.
>>> So far, the amount of code is almost same, but in the future we may need
>>> to implement more, if virtio-net specification is revised.
>> It didn't take too much effort to implement from scratch, I doubt it
>> will for future revise. And, virtio-net spec is unlikely revised, or
>> to be precisely, unlikely revised quite often. Therefore, I don't see
>> big issues here.
>>
>>>>> We can use a virtio-net device of QEMU implementation, it means we don't
>>>>> need to maintain virtio-net device by ourselves, and we can use all of
>>>>> functions supported by QEMU virtio-net device.
>>>>> - Cons
>>>>> Need to invoke QEMU process.
>>>> Another thing is that it makes the usage a bit harder: look at the
>>>> long qemu cli options of your example usage. It also has some traps,
>>>> say, "--enable-kvm" is not allowed, which is a default option used
>>>> with QEMU.
>>> Probably a kind of shell script will help the users.
>> Yeah, that would help. But if we have a choice to make it simpler in the
>> beginning, why not then? :-)
>>
>>>> And judging that we actually don't take too much effort to implement
>>>> a virtio device emulation, I'd prefer it slightly. I guess something
>>>> light weight and easier for use is more important here.
>>> This is very important point.
>>> If so, we don't need much effort when virtio-spec is changed.
>> I'd assume so.
>>
>>>> Actually, I have foreseen another benefit of adding virtio-user device
>>>> emulation: we now might be able to add a rte_vhost_dequeue/enqueue_burst()
>>>> unit test case. We simply can't do it before, since we depend on QEMU
>>>> for testing, which is not acceptable for a unit test case. Making it
>>>> be a unit test case would help us spotting any bad changes that would
>>>> introduce bugs easily and automatically.
>>> As you mentioned above, QEMU process is not related with
>>> dequeuing/enqueuing.
>>> So I guess we may have a testing for rte_vhost_dequeue/enqueue_burst()
>>> regardless of choice.
>> Yes, we don't need the dequeue/enqueue part, but we need the vhost-user
>> initialization part from QEMU vhost-user. Now that we have vhost-user
>> frontend from virtio-user, we have no dependency on QEMU any more.
>>
>>>>> Anyway, we can choose one of belows.
>>>>> 1. Take advantage of invoking less processes.
>>>>> 2. Take advantage of maintainability of virtio-net device.
>>> If container usage that DPDK assumes is to invoke hundreds containers in
>>> one host,
>> I barely know about container, but I would assume that's not rare.
> Hi Yuanhan,
>
> It's great to hear it's not so hard to maintain Jiangfeng's virtio-net
> device features.
>
> Please let me make sure how we can invoke many DPDK applications in
> hundreds containers.
> (Do we have a way to do? Or, will we have it in the future?)

Just to add some option here, we cannot say no to that kind of use case. 
To have many instances, we can:

(1) add a restriction of "cpu share" on each instance, relying on kernel 
to schedule.
(2) enable interrupt mode, so that one instance can go to sleep when it 
has no pkts to receive and awoke by vhost backend when pkts come.

Option 2 is my choice.

Thanks,
Jianfeng

>
> Thanks,
> Tetsuya
  
Tetsuya Mukawa June 7, 2016, 7:12 a.m. UTC | #17
On 2016/06/06 19:50, Tan, Jianfeng wrote:
>> Please let me make sure how we can invoke many DPDK applications in
>> hundreds containers.
>> (Do we have a way to do? Or, will we have it in the future?)
> 
> Just to add some option here, we cannot say no to that kind of use case.
> To have many instances, we can:
> 
> (1) add a restriction of "cpu share" on each instance, relying on kernel
> to schedule.
> (2) enable interrupt mode, so that one instance can go to sleep when it
> has no pkts to receive and awoke by vhost backend when pkts come.
> 
> Option 2 is my choice.

Hi Yuanhan and Jianfeng,

Thanks for your descriptions about how you will invoke many DPDK
applications in containers.
I guess we have almost talked everything we need to consider to choose
one of container implementations.

We may have one conclusion about this choice.
If we can easily maintain virtio device implementation, also if we have
an use-case to invoke hundreds of DPDK applications in containers, I
guess Jianfeng's implementation will be nice.
Anyway, we just follow virtio maintainers choice.

Thanks,
Tetsuya

> 
> Thanks,
> Jianfeng
> 
>>
>> Thanks,
>> Tetsuya
> 
>
  
Yuanhan Liu June 7, 2016, 7:33 a.m. UTC | #18
On Tue, Jun 07, 2016 at 04:12:28PM +0900, Tetsuya Mukawa wrote:
> On 2016/06/06 19:50, Tan, Jianfeng wrote:
> >> Please let me make sure how we can invoke many DPDK applications in
> >> hundreds containers.
> >> (Do we have a way to do? Or, will we have it in the future?)
> > 
> > Just to add some option here, we cannot say no to that kind of use case.
> > To have many instances, we can:
> > 
> > (1) add a restriction of "cpu share" on each instance, relying on kernel
> > to schedule.
> > (2) enable interrupt mode, so that one instance can go to sleep when it
> > has no pkts to receive and awoke by vhost backend when pkts come.
> > 
> > Option 2 is my choice.
> 
> Hi Yuanhan and Jianfeng,
> 
> Thanks for your descriptions about how you will invoke many DPDK
> applications in containers.
> I guess we have almost talked everything we need to consider to choose
> one of container implementations.
> 
> We may have one conclusion about this choice.
> If we can easily maintain virtio device implementation,

AFAIK, yes.

> also if we have
> an use-case to invoke hundreds of DPDK applications in containers, I

Don't know yet, but it seems easier to achieve that with Jianfeng's
solution.

> guess Jianfeng's implementation will be nice.

I'm afraid that's what I'm seeing.

> Anyway, we just follow virtio maintainers choice.

Thanks, and of course, contribution is huge welcome so that we could
have a better container solution!

	--yliu
  

Patch

diff --git a/drivers/net/virtio/qtest.h b/drivers/net/virtio/qtest.h
index 46b9ee6..421e62c 100644
--- a/drivers/net/virtio/qtest.h
+++ b/drivers/net/virtio/qtest.h
@@ -35,5 +35,60 @@ 
 #define _VIRTIO_QTEST_H_
 
 #define QTEST_DRV_NAME		        "eth_qtest_virtio"
+#define QTEST_DEVICE_NUM                3
+
+#include <linux/pci_regs.h>
+
+/* Device information */
+#define VIRTIO_NET_DEVICE_ID            0x1000
+#define VIRTIO_NET_VENDOR_ID            0x1af4
+#define VIRTIO_NET_IRQ_NUM              10
+#define IVSHMEM_DEVICE_ID               0x1110
+#define IVSHMEM_VENDOR_ID               0x1af4
+#define PIIX3_DEVICE_ID                 0x7000
+#define PIIX3_VENDOR_ID                 0x8086
+
+/* ------------------------------------------------------------
+ * IO port mapping of qtest guest
+ * ------------------------------------------------------------
+ * 0x0000 - 0xbfff : not used
+ * 0xc000 - 0xc03f : virtio-net(BAR0)
+ * 0xc040 - 0xffff : not used
+ *
+ * ------------------------------------------------------------
+ * Memory mapping of qtest quest
+ * ------------------------------------------------------------
+ * 0x00000000_00000000 - 0x00000000_3fffffff : not used
+ * 0x00000000_40000000 - 0x00000000_40000fff : virtio-net(BAR1)
+ * 0x00000000_40001000 - 0x00000000_40ffffff : not used
+ * 0x00000000_41000000 - 0x00000000_417fffff : virtio-net(BAR4)
+ * 0x00000000_41800000 - 0x00000000_41ffffff : not used
+ * 0x00000000_42000000 - 0x00000000_420000ff : ivshmem(BAR0)
+ * 0x00000000_42000100 - 0x00000000_42ffffff : not used
+ * 0x00000000_80000000 - 0xffffffff_ffffffff : ivshmem(BAR2)
+ *
+ * We can only specify start address of a region. The region size
+ * will be defined by the device implementation in QEMU.
+ * The size will be pow of 2 according to the PCI specification.
+ * Also, the region start address should be aligned by region size.
+ *
+ * BAR2 of ivshmem will be used to mmap DPDK application memory.
+ * So this address will be dynamically changed, but not to overlap
+ * others, it should be mmaped between above addresses. Such allocation
+ * is done by EAL. Check rte_eal_get_free_region() also.
+ */
+#define VIRTIO_NET_IO_START             0xc000
+#define VIRTIO_NET_MEMORY1_START	0x40000000
+#define VIRTIO_NET_MEMORY2_START	0x41000000
+#define IVSHMEM_MEMORY_START            0x42000000
+
+static inline struct rte_pci_id
+qtest_get_pci_id_of_virtio_net(void)
+{
+	struct rte_pci_id id =  {VIRTIO_NET_DEVICE_ID,
+		VIRTIO_NET_VENDOR_ID, PCI_ANY_ID, PCI_ANY_ID};
+
+	return id;
+}
 
 #endif /* _VIRTIO_QTEST_H_ */
diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 747596d..4e454db 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -60,6 +60,10 @@ 
 #include "virtqueue.h"
 #include "virtio_rxtx.h"
 
+#ifdef RTE_VIRTIO_VDEV_QTEST
+#include "qtest.h"
+#include "qtest_utils.h"
+#endif
 
 static int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
 static int eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev);
@@ -387,7 +391,7 @@  int virtio_dev_queue_setup(struct rte_eth_dev *dev,
 			return -ENOMEM;
 		}
 	}
-#ifdef RTE_VIRTIO_VDEV
+#if defined(RTE_VIRTIO_VDEV) || defined(RTE_VIRTIO_VDEV_QTEST)
 	else
 		vq->vq_ring_mem = (phys_addr_t)mz->addr; /* Use vaddr!!! */
 #endif
@@ -431,7 +435,7 @@  int virtio_dev_queue_setup(struct rte_eth_dev *dev,
 
 		if (virtio_dev_check(dev, RTE_ETH_DEV_PCI, NULL, 0))
 			vq->virtio_net_hdr_mem = mz->phys_addr;
-#ifdef RTE_VIRTIO_VDEV
+#if defined(RTE_VIRTIO_VDEV) || defined(RTE_VIRTIO_VDEV_QTEST)
 		else
 			vq->virtio_net_hdr_mem = (phys_addr_t)mz->addr;
 #endif
@@ -441,7 +445,7 @@  int virtio_dev_queue_setup(struct rte_eth_dev *dev,
 
 	if (virtio_dev_check(dev, RTE_ETH_DEV_PCI, NULL, 0))
 		vq->offset = offsetof(struct rte_mbuf, buf_physaddr);
-#ifdef RTE_VIRTIO_VDEV
+#if defined(RTE_VIRTIO_VDEV) || defined(RTE_VIRTIO_VDEV_QTEST)
 	else
 		vq->offset = offsetof(struct rte_mbuf, buf_addr);
 #endif
@@ -999,6 +1003,23 @@  virtio_interrupt_handler(__rte_unused struct rte_intr_handle *handle,
 	isr = vtpci_isr(hw);
 	PMD_DRV_LOG(INFO, "interrupt status = %#x", isr);
 
+#ifdef RTE_VIRTIO_VDEV_QTEST
+	if (virtio_dev_check(dev, RTE_ETH_DEV_VIRTUAL, QTEST_DRV_NAME, 0)) {
+		if (qtest_intr_enable(hw->qsession) < 0)
+			PMD_DRV_LOG(ERR, "interrupt enable failed");
+		/*
+		 * If last qtest message is interrupt, 'isr' will be 0
+		 * becasue socket has been closed already.
+		 * But still we want to notice this event to EAL.
+		 * So just ignore isr value.
+		 */
+		if (virtio_dev_link_update(dev, 0) == 0)
+			_rte_eth_dev_callback_process(dev,
+					RTE_ETH_EVENT_INTR_LSC);
+		return;
+	}
+#endif
+
 	if (virtio_dev_check(dev, RTE_ETH_DEV_PCI, NULL, 0))
 		if (rte_intr_enable(&dev->pci_dev->intr_handle) < 0)
 			PMD_DRV_LOG(ERR, "interrupt enable failed");
@@ -1058,6 +1079,13 @@  eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		if (vtpci_init(eth_dev, hw) < 0)
 			return -1;
 	}
+#ifdef RTE_VIRTIO_VDEV_QTEST
+	else if (virtio_dev_check(eth_dev,
+				RTE_ETH_DEV_VIRTUAL, QTEST_DRV_NAME, 0)) {
+		if (vtpci_init(eth_dev, hw) < 0)
+			return -1;
+	}
+#endif
 
 	/* Reset the device although not necessary at startup */
 	vtpci_reset(hw);
@@ -1077,6 +1105,13 @@  eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 
 		rte_eth_copy_pci_info(eth_dev, pci_dev);
 	}
+#ifdef RTE_VIRTIO_VDEV_QTEST
+	else if (virtio_dev_check(eth_dev,
+				RTE_ETH_DEV_VIRTUAL, QTEST_DRV_NAME, 0)) {
+		if (!vtpci_with_feature(hw, VIRTIO_NET_F_STATUS))
+			pci_dev->driver->drv_flags &= ~RTE_PCI_DRV_INTR_LSC;
+	}
+#endif
 
 	rx_func_get(eth_dev);
 
@@ -1165,6 +1200,26 @@  eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 						   virtio_interrupt_handler,
 						   eth_dev);
 	}
+#ifdef RTE_VIRTIO_VDEV_QTEST
+	else if (virtio_dev_check(eth_dev, RTE_ETH_DEV_VIRTUAL,
+				QTEST_DRV_NAME, 0)) {
+		struct rte_pci_id id;
+
+		id = qtest_get_pci_id_of_virtio_net();
+		RTE_SET_USED(id);
+
+		PMD_INIT_LOG(DEBUG, "port %d vendorID=0x%x deviceID=0x%x",
+				eth_dev->data->port_id,
+				id.vendor_id, id.device_id);
+
+		/* Setup interrupt callback  */
+		if (virtio_dev_check(eth_dev, RTE_ETH_DEV_VIRTUAL,
+					NULL, RTE_ETH_DEV_INTR_LSC))
+			qtest_intr_callback_register(hw->qsession,
+					virtio_interrupt_handler, eth_dev);
+	}
+#endif
+
 	virtio_dev_cq_start(eth_dev);
 
 	return 0;
@@ -1202,7 +1257,15 @@  eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
 					     virtio_interrupt_handler,
 					     eth_dev);
 
-	rte_eal_pci_unmap_device(pci_dev);
+#ifdef RTE_VIRTIO_VDEV_QTEST
+	else if (virtio_dev_check(eth_dev, RTE_ETH_DEV_VIRTUAL,
+				QTEST_DRV_NAME, RTE_ETH_DEV_INTR_LSC))
+		qtest_intr_callback_unregister(hw->qsession,
+				virtio_interrupt_handler, eth_dev);
+#endif
+
+	if (virtio_dev_check(eth_dev, RTE_ETH_DEV_PCI, NULL, 0))
+		rte_eal_pci_unmap_device(pci_dev);
 
 	PMD_INIT_LOG(DEBUG, "dev_uninit completed");
 
@@ -1284,16 +1347,34 @@  virtio_dev_start(struct rte_eth_dev *dev)
 
 	/* check if lsc interrupt feature is enabled */
 	if (dev->data->dev_conf.intr_conf.lsc) {
-		if (!virtio_dev_check(dev, RTE_ETH_DEV_PCI,
-					NULL, RTE_PCI_DRV_INTR_LSC)) {
+		int pdev_has_lsc = 0, vdev_has_lsc = 0;
+
+		pdev_has_lsc = virtio_dev_check(dev, RTE_ETH_DEV_PCI,
+				NULL, RTE_PCI_DRV_INTR_LSC);
+#ifdef RTE_VIRTIO_VDEV_QTEST
+		vdev_has_lsc = virtio_dev_check(dev, RTE_ETH_DEV_VIRTUAL,
+				QTEST_DRV_NAME, RTE_ETH_DEV_INTR_LSC);
+#endif
+
+		if ((!pdev_has_lsc) && (!vdev_has_lsc)) {
 			PMD_DRV_LOG(ERR, "link status not supported by host");
 			return -ENOTSUP;
 		}
 
-		if (rte_intr_enable(&dev->pci_dev->intr_handle) < 0) {
-			PMD_DRV_LOG(ERR, "interrupt enable failed");
-			return -EIO;
+		if (pdev_has_lsc) {
+			if (rte_intr_enable(&dev->pci_dev->intr_handle) < 0) {
+				PMD_DRV_LOG(ERR, "interrupt enable failed");
+				return -EIO;
+			}
 		}
+#ifdef RTE_VIRTIO_VDEV_QTEST
+		else if (vdev_has_lsc) {
+			if (qtest_intr_enable(hw->qsession) < 0) {
+				PMD_DRV_LOG(ERR, "interrupt enable failed");
+				return -EIO;
+			}
+		}
+#endif
 	}
 
 	/* Initialize Link state */
@@ -1387,11 +1468,20 @@  static void
 virtio_dev_stop(struct rte_eth_dev *dev)
 {
 	struct rte_eth_link link;
+	struct virtio_hw *hw = dev->data->dev_private;
 
 	PMD_INIT_LOG(DEBUG, "stop");
+	RTE_SET_USED(hw);
 
-	if (dev->data->dev_conf.intr_conf.lsc)
-		rte_intr_disable(&dev->pci_dev->intr_handle);
+	if (dev->data->dev_conf.intr_conf.lsc) {
+		if (virtio_dev_check(dev, RTE_ETH_DEV_PCI, NULL, 0))
+			rte_intr_disable(&dev->pci_dev->intr_handle);
+#ifdef RTE_VIRTIO_VDEV_QTEST
+		else if (virtio_dev_check(dev, RTE_ETH_DEV_VIRTUAL,
+					QTEST_DRV_NAME, 0))
+			qtest_intr_disable(hw->qsession);
+#endif
+	}
 
 	memset(&link, 0, sizeof(link));
 	virtio_dev_atomic_write_link_status(dev, &link);
@@ -1628,3 +1718,348 @@  static struct rte_driver rte_cvio_driver = {
 PMD_REGISTER_DRIVER(rte_cvio_driver);
 
 #endif
+
+#ifdef RTE_VIRTIO_VDEV_QTEST
+
+#define ETH_VIRTIO_NET_ARG_QTEST_PATH           "qtest"
+#define ETH_VIRTIO_NET_ARG_IVSHMEM_PATH         "ivshmem"
+#define ETH_VIRTIO_NET_ARG_VIRTIO_NET_ADDR      "virtio-net-addr"
+#define ETH_VIRTIO_NET_ARG_IVSHMEM_ADDR         "ivshmem-addr"
+#define ETH_VIRTIO_NET_ARG_PIIX3_ADDR           "piix3-addr"
+
+static const char *valid_qtest_args[] = {
+	ETH_VIRTIO_NET_ARG_QTEST_PATH,
+	ETH_VIRTIO_NET_ARG_IVSHMEM_PATH,
+	ETH_VIRTIO_NET_ARG_VIRTIO_NET_ADDR,
+	ETH_VIRTIO_NET_ARG_IVSHMEM_ADDR,
+	ETH_VIRTIO_NET_ARG_PIIX3_ADDR,
+	NULL
+};
+
+static int
+get_socket_path_arg(const char *key __rte_unused,
+		const char *value, void *extra_args)
+{
+	char **p;
+
+	if ((value == NULL) || (extra_args == NULL))
+		return -EINVAL;
+
+	p = extra_args;
+	*p = strdup(value);
+
+	if (*p == NULL)
+		return -1;
+
+	return 0;
+}
+
+static int
+get_pci_addr_arg(const char *key __rte_unused,
+		const char *value, void *extra_args)
+{
+	struct rte_pci_addr *addr = extra_args;
+
+	if ((value == NULL) || (extra_args == NULL))
+		return -EINVAL;
+
+	if (eal_parse_pci_DomBDF(value, addr) != 0)
+		return -1;
+
+	if (addr->domain != 0)
+		return -1;
+
+	return 0;
+}
+
+static int
+virtio_net_eth_dev_free(struct rte_eth_dev *eth_dev)
+{
+	struct virtio_hw *hw;
+	int ret;
+
+	ret = rte_eth_dev_release_port(eth_dev);
+	if (ret < 0) {
+		PMD_INIT_LOG(ERR, "cannot release a port\n");
+		return -1;
+	}
+
+	hw = eth_dev->data->dev_private;
+	rte_free(hw);
+
+	return 0;
+}
+
+static struct rte_eth_dev *
+virtio_net_eth_dev_alloc(const char *name)
+{
+	struct rte_eth_dev *eth_dev;
+	struct rte_eth_dev_data *data;
+	struct virtio_hw *hw;
+	int ret;
+
+	eth_dev = rte_eth_dev_allocate(name, RTE_ETH_DEV_VIRTUAL);
+	if (eth_dev == NULL) {
+		PMD_INIT_LOG(ERR, "cannot alloc a port\n");
+		return NULL;
+	}
+
+	data = eth_dev->data;
+
+	hw = rte_zmalloc(NULL, sizeof(*hw), 0);
+	if (hw == NULL) {
+		PMD_INIT_LOG(ERR, "malloc virtio_hw failed\n");
+		ret = rte_eth_dev_release_port(eth_dev);
+		if (ret < 0)
+			rte_panic("cannot release a port");
+		return NULL;
+	}
+
+	data->dev_private = hw;
+	eth_dev->driver = &rte_virtio_pmd;
+	return eth_dev;
+}
+
+static int
+virtio_net_eth_pmd_parse_socket_path(struct rte_kvargs *kvlist,
+		const char *option, char **path)
+{
+	int ret;
+
+	if (rte_kvargs_count(kvlist, option) == 1) {
+		ret = rte_kvargs_process(kvlist, option,
+				&get_socket_path_arg, path);
+		if (ret != 0) {
+			PMD_INIT_LOG(ERR,
+					"Failed to connect to %s socket", option);
+			return -1;
+		}
+	} else {
+		PMD_INIT_LOG(ERR, "No argument specified for %s", option);
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+virtio_net_eth_pmd_parse_pci_addr(struct rte_kvargs *kvlist,
+		const char *option, struct rte_pci_addr *addr,
+		struct rte_pci_addr *default_addr)
+{
+	int ret;
+
+	if (rte_kvargs_count(kvlist, option) == 1) {
+		ret = rte_kvargs_process(kvlist, option,
+				&get_pci_addr_arg, addr);
+		if (ret != 0) {
+			PMD_INIT_LOG(ERR,
+					"Specified invalid address in '%s'", option);
+			return -1;
+		}
+	} else
+		/* copy default pci address */
+		*addr = *default_addr;
+
+	return 0;
+}
+
+static int
+virtio_prepare_target_devices(struct qtest_pci_device *devices,
+			struct rte_kvargs *kvlist)
+{
+	struct qtest_pci_device *virtio_net, *ivshmem, *piix3;
+	struct rte_pci_addr default_addr;
+	const struct rte_memseg *ms;
+	int ret;
+
+	ms = rte_eal_get_physmem_layout();
+	/* if EAL memory size isn't pow of 2, ivshmem will refuse it */
+	if ((ms[0].len & (ms[0].len - 1)) != 0) {
+		PMD_DRV_LOG(ERR, "memory size must be power of 2\n");
+		return -1;
+	}
+
+	virtio_net = &devices[0];
+	ivshmem = &devices[1];
+	piix3 = &devices[2];
+
+	virtio_net->name = "virtio-net";
+	virtio_net->device_id = VIRTIO_NET_DEVICE_ID;
+	virtio_net->vendor_id = VIRTIO_NET_VENDOR_ID;
+	virtio_net->init = qtest_init_pci_device;
+	virtio_net->bar[0].addr = PCI_BASE_ADDRESS_0;
+	virtio_net->bar[0].type = QTEST_PCI_BAR_IO;
+	virtio_net->bar[0].region_start = VIRTIO_NET_IO_START;
+	virtio_net->bar[1].addr = PCI_BASE_ADDRESS_1;
+	virtio_net->bar[1].type = QTEST_PCI_BAR_MEMORY_32;
+	virtio_net->bar[1].region_start = VIRTIO_NET_MEMORY1_START;
+	virtio_net->bar[4].addr = PCI_BASE_ADDRESS_4;
+	virtio_net->bar[4].type = QTEST_PCI_BAR_MEMORY_64;
+	virtio_net->bar[4].region_start = VIRTIO_NET_MEMORY2_START;
+
+	ivshmem->name = "ivshmem";
+	ivshmem->device_id = IVSHMEM_DEVICE_ID;
+	ivshmem->vendor_id = IVSHMEM_VENDOR_ID;
+	ivshmem->init = qtest_init_pci_device;
+	ivshmem->bar[0].addr = PCI_BASE_ADDRESS_0;
+	ivshmem->bar[0].type = QTEST_PCI_BAR_MEMORY_32;
+	ivshmem->bar[0].region_start = IVSHMEM_MEMORY_START;
+	ivshmem->bar[2].addr = PCI_BASE_ADDRESS_2;
+	ivshmem->bar[2].type = QTEST_PCI_BAR_MEMORY_64;
+	/* In host mode, only one memory segment is vaild */
+	ivshmem->bar[2].region_start = (uint64_t)ms[0].addr;
+
+	/* piix3 is needed to route irqs from virtio-net to ioapic */
+	piix3->name = "piix3";
+	piix3->device_id = PIIX3_DEVICE_ID;
+	piix3->vendor_id = PIIX3_VENDOR_ID;
+	piix3->init = qtest_init_piix3_device;
+
+	/*
+	 * Set pci addresses specified by command line.
+	 * QTest utils will only check specified pci address.
+	 * If it's wrong, a target device won't be found.
+	 */
+	default_addr.domain = 0;
+	default_addr.bus = 0;
+	default_addr.function = 0;
+
+	default_addr.devid = 3;
+	ret = virtio_net_eth_pmd_parse_pci_addr(kvlist,
+			ETH_VIRTIO_NET_ARG_VIRTIO_NET_ADDR,
+			&virtio_net->specified_addr, &default_addr);
+	if (ret < 0)
+		return -1;
+
+	default_addr.devid = 4;
+	ret = virtio_net_eth_pmd_parse_pci_addr(kvlist,
+			ETH_VIRTIO_NET_ARG_IVSHMEM_ADDR,
+			&ivshmem->specified_addr, &default_addr);
+	if (ret < 0)
+		return -1;
+
+	default_addr.devid = 1;
+	ret = virtio_net_eth_pmd_parse_pci_addr(kvlist,
+			ETH_VIRTIO_NET_ARG_PIIX3_ADDR,
+			&piix3->specified_addr, &default_addr);
+	if (ret < 0)
+		return -1;
+
+	return 0;
+}
+/*
+ * Initialization when "CONFIG_RTE_VIRTIO_VDEV_QTEST" is enabled.
+ */
+static int
+rte_qtest_virtio_pmd_init(const char *name, const char *params)
+{
+	struct rte_kvargs *kvlist;
+	struct virtio_hw *hw = NULL;
+	struct rte_eth_dev *eth_dev = NULL;
+	char *qtest_path = NULL, *ivshmem_path = NULL;
+	struct qtest_pci_device devices[QTEST_DEVICE_NUM];
+	int ret;
+
+	if (params == NULL || params[0] == '\0')
+		return -EINVAL;
+
+	kvlist = rte_kvargs_parse(params, valid_qtest_args);
+	if (kvlist == NULL) {
+		PMD_INIT_LOG(ERR, "error when parsing param");
+		return -EFAULT;
+	}
+
+	ret = virtio_net_eth_pmd_parse_socket_path(kvlist,
+			ETH_VIRTIO_NET_ARG_IVSHMEM_PATH, &ivshmem_path);
+	if (ret < 0)
+		goto error;
+
+	ret = virtio_net_eth_pmd_parse_socket_path(kvlist,
+			ETH_VIRTIO_NET_ARG_QTEST_PATH, &qtest_path);
+	if (ret < 0)
+		goto error;
+
+	ret = virtio_prepare_target_devices(devices, kvlist);
+	if (ret < 0)
+		goto error;
+
+	eth_dev = virtio_net_eth_dev_alloc(name);
+	if (eth_dev == NULL)
+		goto error;
+
+	hw = eth_dev->data->dev_private;
+	hw->qsession = qtest_vdev_init(qtest_path, ivshmem_path,
+			VIRTIO_NET_IRQ_NUM, devices, QTEST_DEVICE_NUM);
+	if (hw->qsession == NULL)
+		goto error;
+
+	/* originally, this will be called in rte_eal_pci_probe() */
+	ret = eth_virtio_dev_init(eth_dev);
+	if (ret < 0)
+		goto error;
+
+	eth_dev->driver = NULL;
+	eth_dev->data->dev_flags |= RTE_ETH_DEV_DETACHABLE;
+	eth_dev->data->kdrv = RTE_KDRV_NONE;
+	eth_dev->data->drv_name = QTEST_DRV_NAME;
+
+	free(qtest_path);
+	free(ivshmem_path);
+	rte_kvargs_free(kvlist);
+	return 0;
+
+error:
+	if (hw != NULL && hw->qsession != NULL)
+		qtest_vdev_uninit(hw->qsession);
+	if (eth_dev)
+		virtio_net_eth_dev_free(eth_dev);
+	if (qtest_path)
+		free(qtest_path);
+	if (ivshmem_path)
+		free(ivshmem_path);
+	rte_kvargs_free(kvlist);
+	return -EFAULT;
+}
+
+/*
+ * Finalization when "CONFIG_RTE_VIRTIO_VDEV_QTEST" is enabled.
+ */
+static int
+rte_qtest_virtio_pmd_uninit(const char *name)
+{
+	struct rte_eth_dev *eth_dev;
+	struct virtio_hw *hw;
+	int ret;
+
+	if (name == NULL)
+		return -EINVAL;
+
+	/* find the ethdev entry */
+	eth_dev = rte_eth_dev_allocated(name);
+	if (eth_dev == NULL)
+		return -ENODEV;
+
+	ret = eth_virtio_dev_uninit(eth_dev);
+	if (ret != 0)
+		return -EFAULT;
+
+	hw = eth_dev->data->dev_private;
+	qtest_vdev_uninit(hw->qsession);
+
+	ret = virtio_net_eth_dev_free(eth_dev);
+	if (ret != 0)
+		return -EFAULT;
+
+	return 0;
+}
+
+static struct rte_driver rte_qtest_virtio_driver = {
+	.name   = QTEST_DRV_NAME,
+	.type   = PMD_VDEV,
+	.init   = rte_qtest_virtio_pmd_init,
+	.uninit = rte_qtest_virtio_pmd_uninit,
+};
+
+PMD_REGISTER_DRIVER(rte_qtest_virtio_driver);
+#endif /* RTE_VIRTIO_VDEV_QTEST */