[dpdk-dev,RFC,5/5] virtio: Extend virtio-net PMD to support container environment

Message ID	1453374478-30996-6-git-send-email-mukawa@igel.co.jp (mailing list archive)
State	Changes Requested, archived
Headers	From: Tetsuya Mukawa <mukawa@igel.co.jp> To: dev@dpdk.org, yuanhan.liu@linux.intel.com, jianfeng.tan@intel.com Date: Thu, 21 Jan 2016 20:07:58 +0900 Message-Id: <1453374478-30996-6-git-send-email-mukawa@igel.co.jp> In-Reply-To: <1453374478-30996-1-git-send-email-mukawa@igel.co.jp> References: <1453374478-30996-1-git-send-email-mukawa@igel.co.jp> In-Reply-To: <1453108389-21006-2-git-send-email-mukawa@igel.co.jp> References: <1453108389-21006-2-git-send-email-mukawa@igel.co.jp> Subject: [dpdk-dev] [RFC PATCH 5/5] virtio: Extend virtio-net PMD to support container environment Precedence: list Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>

Message ID

1453374478-30996-6-git-send-email-mukawa@igel.co.jp (mailing list archive)

State

Changes Requested, archived

Headers

From: Tetsuya Mukawa <mukawa@igel.co.jp>
To: dev@dpdk.org,
	yuanhan.liu@linux.intel.com,
	jianfeng.tan@intel.com
Date: Thu, 21 Jan 2016 20:07:58 +0900
Message-Id: <1453374478-30996-6-git-send-email-mukawa@igel.co.jp>
In-Reply-To: <1453374478-30996-1-git-send-email-mukawa@igel.co.jp>
References: <1453374478-30996-1-git-send-email-mukawa@igel.co.jp>
In-Reply-To: <1453108389-21006-2-git-send-email-mukawa@igel.co.jp>
References: <1453108389-21006-2-git-send-email-mukawa@igel.co.jp>
Subject: [dpdk-dev] [RFC PATCH 5/5] virtio: Extend virtio-net PMD to support
	container environment
Precedence: list
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Commit Message

Tetsuya Mukawa Jan. 21, 2016, 11:07 a.m. UTC

  virtio: Extend virtio-net PMD to support container environment

The patch adds a new virtio-net PMD configuration that allows the PMD to
work on host as if the PMD is in VM.
Here is new configuration for virtio-net PMD.
 - CONFIG_RTE_LIBRTE_VIRTIO_HOST_MODE
To use this mode, EAL needs physically contiguous memory. To allocate
such memory, add "--shm" option to application command line.

To prepare virtio-net device on host, the users need to invoke QEMU
process in special qtest mode. This mode is mainly used for testing QEMU
devices from outer process. In this mode, no guest runs.
Here is QEMU command line.

 $ qemu-system-x86_64 \
             -machine pc-i440fx-1.4,accel=qtest \
             -display none -qtest-log /dev/null \
             -qtest unix:/tmp/socket,server \
             -netdev type=tap,script=/etc/qemu-ifup,id=net0,queues=1\
             -device virtio-net-pci,netdev=net0,mq=on \
             -chardev socket,id=chr1,path=/tmp/ivshmem,server \
             -device ivshmem,size=1G,chardev=chr1,vectors=1

 * QEMU process is needed per port.
 * Virtio-1.0 device is supported.
 * In most cases, just using above command is enough.
 * The vhost backends like vhost-net and vhost-user can be specified.
 * Only checked "pc-i440fx-1.4" machine, but may work with other
   machines. It depends on a machine has piix3 south bridge.
   If the machine doesn't have, virtio-net PMD cannot receive status
   changed interrupts.
 * Should not add "--enable-kvm" to QEMU command line.

After invoking QEMU, the PMD can connect to QEMU process using unix
domain sockets. Over these sockets, virtio-net, ivshmem and piix3
device in QEMU are probed by the PMD.
Here is example of command line.

 $ testpmd -c f -n 1 -m 1024 --shm \
      --vdev="eth_virtio_net0,qtest=/tmp/socket,ivshmem=/tmp/ivshmem"\
      -- --disable-hw-vlan --txqflags=0xf00 -i

Please specify same unix domain sockets and memory size in both QEMU
and DPDK command lines like above.
The share memory size should be power of 2, because ivshmem only
accepts such memry size.

Signed-off-by: Tetsuya Mukawa <mukawa@igel.co.jp>
---
 config/common_linuxapp             |    1 +
 drivers/net/virtio/Makefile        |    4 +
 drivers/net/virtio/qtest.c         | 1237 ++++++++++++++++++++++++++++++++++++
 drivers/net/virtio/virtio_ethdev.c |  450 ++++++++++---
 drivers/net/virtio/virtio_ethdev.h |   12 +
 drivers/net/virtio/virtio_pci.c    |  190 +++++-
 drivers/net/virtio/virtio_pci.h    |   16 +
 drivers/net/virtio/virtio_rxtx.c   |    3 +-
 drivers/net/virtio/virtqueue.h     |    9 +-
 9 files changed, 1845 insertions(+), 77 deletions(-)
 create mode 100644 drivers/net/virtio/qtest.c

Comments

Huawei Xie Jan. 22, 2016, 8:14 a.m. UTC | #1

On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
> virtio: Extend virtio-net PMD to support container environment
>
> The patch adds a new virtio-net PMD configuration that allows the PMD to
> work on host as if the PMD is in VM.
> Here is new configuration for virtio-net PMD.
>  - CONFIG_RTE_LIBRTE_VIRTIO_HOST_MODE
> To use this mode, EAL needs physically contiguous memory. To allocate
> such memory, add "--shm" option to application command line.
>
> To prepare virtio-net device on host, the users need to invoke QEMU
> process in special qtest mode. This mode is mainly used for testing QEMU
> devices from outer process. In this mode, no guest runs.
> Here is QEMU command line.
>
>  $ qemu-system-x86_64 \
>              -machine pc-i440fx-1.4,accel=qtest \
>              -display none -qtest-log /dev/null \
>              -qtest unix:/tmp/socket,server \
>              -netdev type=tap,script=/etc/qemu-ifup,id=net0,queues=1\
>              -device virtio-net-pci,netdev=net0,mq=on \
>              -chardev socket,id=chr1,path=/tmp/ivshmem,server \
>              -device ivshmem,size=1G,chardev=chr1,vectors=1
>
>  * QEMU process is needed per port.

Does qtest supports hot plug virtio-net pci device, so that we could run
one QEMU process in host, which provisions the virtio-net virtual
devices for the container?

Tetsuya Mukawa Jan. 22, 2016, 10:37 a.m. UTC | #2

On 2016/01/22 17:14, Xie, Huawei wrote:
> On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
>> virtio: Extend virtio-net PMD to support container environment
>>
>> The patch adds a new virtio-net PMD configuration that allows the PMD to
>> work on host as if the PMD is in VM.
>> Here is new configuration for virtio-net PMD.
>>  - CONFIG_RTE_LIBRTE_VIRTIO_HOST_MODE
>> To use this mode, EAL needs physically contiguous memory. To allocate
>> such memory, add "--shm" option to application command line.
>>
>> To prepare virtio-net device on host, the users need to invoke QEMU
>> process in special qtest mode. This mode is mainly used for testing QEMU
>> devices from outer process. In this mode, no guest runs.
>> Here is QEMU command line.
>>
>>  $ qemu-system-x86_64 \
>>              -machine pc-i440fx-1.4,accel=qtest \
>>              -display none -qtest-log /dev/null \
>>              -qtest unix:/tmp/socket,server \
>>              -netdev type=tap,script=/etc/qemu-ifup,id=net0,queues=1\
>>              -device virtio-net-pci,netdev=net0,mq=on \
>>              -chardev socket,id=chr1,path=/tmp/ivshmem,server \
>>              -device ivshmem,size=1G,chardev=chr1,vectors=1
>>
>>  * QEMU process is needed per port.
> Does qtest supports hot plug virtio-net pci device, so that we could run
> one QEMU process in host, which provisions the virtio-net virtual
> devices for the container?

Theoretically, we can use hot plug in some cases.
But I guess we have 3 concerns here.

1. Security.
If we share QEMU process between multiple DPDK applications, this QEMU
process will have all fds of  the applications on different containers.
In some cases, it will be security concern.
So, I guess we need to support current 1:1 configuration at least.

2. shared memory.
Currently, QEMU and DPDK application will map shared memory using same
virtual address.
So if multiple DPDK application connects to one QEMU process, each DPDK
application should have different address for shared memory. I guess
this will be a big limitation.

3. PCI bridge.
So far, QEMU has one PCI bridge, so we can connect almost 10 PCI devices
to QEMU.
(I forget correct number, but it's almost 10, because some slots are
reserved by QEMU)
A DPDK application needs both virtio-net and ivshmem device, so I guess
almost 5 DPDK applications can connect to one QEMU process, so far.
To add more PCI bridges solves this.
But we need to add a lot of implementation to support cascaded PCI
bridges and PCI devices.
(Also we need to solve above "2nd" concern.)

Anyway, if we use virtio-net PMD and vhost-user PMD, QEMU process will
not do anything after initialization.
(QEMU will try to read a qtest socket, then be stopped because there is
no message after initialization)
So I guess we can ignore overhead of these QEMU processes.
If someone cannot ignore it, I guess this is the one of cases that it's
nice to use your light weight container implementation.

Thanks,
Tetsuya

Huawei Xie Jan. 25, 2016, 10:15 a.m. UTC | #3

On 1/22/2016 6:38 PM, Tetsuya Mukawa wrote:
> On 2016/01/22 17:14, Xie, Huawei wrote:
>> On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
>>> virtio: Extend virtio-net PMD to support container environment
>>>
>>> The patch adds a new virtio-net PMD configuration that allows the PMD to
>>> work on host as if the PMD is in VM.
>>> Here is new configuration for virtio-net PMD.
>>>  - CONFIG_RTE_LIBRTE_VIRTIO_HOST_MODE
>>> To use this mode, EAL needs physically contiguous memory. To allocate
>>> such memory, add "--shm" option to application command line.
>>>
>>> To prepare virtio-net device on host, the users need to invoke QEMU
>>> process in special qtest mode. This mode is mainly used for testing QEMU
>>> devices from outer process. In this mode, no guest runs.
>>> Here is QEMU command line.
>>>
>>>  $ qemu-system-x86_64 \
>>>              -machine pc-i440fx-1.4,accel=qtest \
>>>              -display none -qtest-log /dev/null \
>>>              -qtest unix:/tmp/socket,server \
>>>              -netdev type=tap,script=/etc/qemu-ifup,id=net0,queues=1\
>>>              -device virtio-net-pci,netdev=net0,mq=on \
>>>              -chardev socket,id=chr1,path=/tmp/ivshmem,server \
>>>              -device ivshmem,size=1G,chardev=chr1,vectors=1
>>>
>>>  * QEMU process is needed per port.
>> Does qtest supports hot plug virtio-net pci device, so that we could run
>> one QEMU process in host, which provisions the virtio-net virtual
>> devices for the container?
> Theoretically, we can use hot plug in some cases.
> But I guess we have 3 concerns here.
>
> 1. Security.
> If we share QEMU process between multiple DPDK applications, this QEMU
> process will have all fds of  the applications on different containers.
> In some cases, it will be security concern.
> So, I guess we need to support current 1:1 configuration at least.
>
> 2. shared memory.
> Currently, QEMU and DPDK application will map shared memory using same
> virtual address.
> So if multiple DPDK application connects to one QEMU process, each DPDK
> application should have different address for shared memory. I guess
> this will be a big limitation.
>
> 3. PCI bridge.
> So far, QEMU has one PCI bridge, so we can connect almost 10 PCI devices
> to QEMU.
> (I forget correct number, but it's almost 10, because some slots are
> reserved by QEMU)
> A DPDK application needs both virtio-net and ivshmem device, so I guess
> almost 5 DPDK applications can connect to one QEMU process, so far.
> To add more PCI bridges solves this.
> But we need to add a lot of implementation to support cascaded PCI
> bridges and PCI devices.
> (Also we need to solve above "2nd" concern.)
>
> Anyway, if we use virtio-net PMD and vhost-user PMD, QEMU process will
> not do anything after initialization.
> (QEMU will try to read a qtest socket, then be stopped because there is
> no message after initialization)
> So I guess we can ignore overhead of these QEMU processes.
> If someone cannot ignore it, I guess this is the one of cases that it's
> nice to use your light weight container implementation.

Thanks for the explanation, and also in your opinion where is the best
place to run the QEMU instance? If we run QEMU instances in host, for
vhost-kernel support, we could get rid of the root privilege issue.

Another issue is do you plan to support multiple virtio devices in
container? Currently i find the code assuming only one virtio-net device
in QEMU, right?

Btw, i have read most of your qtest code. No obvious issues found so far
but quite a couple of nits. You must have spent a lot of time on this.
It is great work!

> Thanks,
> Tetsuya
>

Huawei Xie Jan. 25, 2016, 10:17 a.m. UTC | #4

On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
> +static void
> +qtest_handle_one_message(struct qtest_session *s, char *buf)
> +{
> +	int ret;
> +
> +	if (strncmp(buf, interrupt_message, strlen(interrupt_message)) == 0) {
> +		if (rte_atomic16_read(&s->enable_intr) == 0)
> +			return;
> +
> +		/* relay interrupt to pipe */
> +		ret = write(s->irqfds.writefd, "1", 1);

How about the interrupt latency? Seems it is quite long.

> +		if (ret < 0)
> +			rte_panic("cannot relay interrupt\n");
> +	} else {
> +		/* relay normal message to pipe */
> +		ret = qtest_raw_send(s->msgfds.writefd, buf, strlen(buf));
> +		if (ret < 0)
> +			rte_panic("cannot relay normal message\n");
> +	}
> +}

Huawei Xie Jan. 25, 2016, 10:29 a.m. UTC | #5

On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
> +#define PCI_CONFIG_ADDR(_bus, _device, _function, _offset) ( \
> +	(1 << 31) | ((_bus) & 0xff) << 16 | ((_device) & 0x1f) << 11 | \
> +	((_function) & 0xf) << 8 | ((_offset) & 0xfc))

(_function) & 0x7 ?

Tetsuya Mukawa Jan. 26, 2016, 2:58 a.m. UTC | #6

On 2016/01/25 19:15, Xie, Huawei wrote:
> On 1/22/2016 6:38 PM, Tetsuya Mukawa wrote:
>> On 2016/01/22 17:14, Xie, Huawei wrote:
>>> On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
>>>> virtio: Extend virtio-net PMD to support container environment
>>>>
>>>> The patch adds a new virtio-net PMD configuration that allows the PMD to
>>>> work on host as if the PMD is in VM.
>>>> Here is new configuration for virtio-net PMD.
>>>>  - CONFIG_RTE_LIBRTE_VIRTIO_HOST_MODE
>>>> To use this mode, EAL needs physically contiguous memory. To allocate
>>>> such memory, add "--shm" option to application command line.
>>>>
>>>> To prepare virtio-net device on host, the users need to invoke QEMU
>>>> process in special qtest mode. This mode is mainly used for testing QEMU
>>>> devices from outer process. In this mode, no guest runs.
>>>> Here is QEMU command line.
>>>>
>>>>  $ qemu-system-x86_64 \
>>>>              -machine pc-i440fx-1.4,accel=qtest \
>>>>              -display none -qtest-log /dev/null \
>>>>              -qtest unix:/tmp/socket,server \
>>>>              -netdev type=tap,script=/etc/qemu-ifup,id=net0,queues=1\
>>>>              -device virtio-net-pci,netdev=net0,mq=on \
>>>>              -chardev socket,id=chr1,path=/tmp/ivshmem,server \
>>>>              -device ivshmem,size=1G,chardev=chr1,vectors=1
>>>>
>>>>  * QEMU process is needed per port.
>>> Does qtest supports hot plug virtio-net pci device, so that we could run
>>> one QEMU process in host, which provisions the virtio-net virtual
>>> devices for the container?
>> Theoretically, we can use hot plug in some cases.
>> But I guess we have 3 concerns here.
>>
>> 1. Security.
>> If we share QEMU process between multiple DPDK applications, this QEMU
>> process will have all fds of  the applications on different containers.
>> In some cases, it will be security concern.
>> So, I guess we need to support current 1:1 configuration at least.
>>
>> 2. shared memory.
>> Currently, QEMU and DPDK application will map shared memory using same
>> virtual address.
>> So if multiple DPDK application connects to one QEMU process, each DPDK
>> application should have different address for shared memory. I guess
>> this will be a big limitation.
>>
>> 3. PCI bridge.
>> So far, QEMU has one PCI bridge, so we can connect almost 10 PCI devices
>> to QEMU.
>> (I forget correct number, but it's almost 10, because some slots are
>> reserved by QEMU)
>> A DPDK application needs both virtio-net and ivshmem device, so I guess
>> almost 5 DPDK applications can connect to one QEMU process, so far.
>> To add more PCI bridges solves this.
>> But we need to add a lot of implementation to support cascaded PCI
>> bridges and PCI devices.
>> (Also we need to solve above "2nd" concern.)
>>
>> Anyway, if we use virtio-net PMD and vhost-user PMD, QEMU process will
>> not do anything after initialization.
>> (QEMU will try to read a qtest socket, then be stopped because there is
>> no message after initialization)
>> So I guess we can ignore overhead of these QEMU processes.
>> If someone cannot ignore it, I guess this is the one of cases that it's
>> nice to use your light weight container implementation.
> Thanks for the explanation, and also in your opinion where is the best
> place to run the QEMU instance? If we run QEMU instances in host, for
> vhost-kernel support, we could get rid of the root privilege issue.

Do you mean below?
If we deploy QEMU instance on host, we can start a container without the
root privilege.
(But on host, still QEMU instance needs the privilege to access to
vhost-kernel)

If so, I agree to deploy QEMU instance on host or other privileged
container will be nice.
In the case of vhost-user, to deploy on host or non-privileged container
will be good.

>
> Another issue is do you plan to support multiple virtio devices in
> container? Currently i find the code assuming only one virtio-net device
> in QEMU, right?

Yes, so far, 1 port needs 1 QEMU instance.
So if you need multiple virtio devices, you need to invoke multiple QEMU
instances.

Do you want to deploy 1 QEMU instance for each DPDK application, even if
the application has multiple virtio-net ports?

So far, I am not sure whether we need it, because this type of DPDK
application will need only one port in most cases.
But if you need this, yes, I can implement using QEMU PCI hotplug feature.
(But probably we can only attach almost 10 ports. This will be limitation.)

>
> Btw, i have read most of your qtest code. No obvious issues found so far
> but quite a couple of nits. You must have spent a lot of time on this.
> It is great work!

I appreciate your reviewing!

BTW, my container implementation needed a QEMU patch in the case of
vhost-user.
But the patch has been merged in upstream QEMU, so we don't have this
limitation any more.

Thanks,
Tetsuya

Tetsuya Mukawa Jan. 26, 2016, 2:58 a.m. UTC | #7

On 2016/01/25 19:17, Xie, Huawei wrote:
> On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
>> +static void
>> +qtest_handle_one_message(struct qtest_session *s, char *buf)
>> +{
>> +	int ret;
>> +
>> +	if (strncmp(buf, interrupt_message, strlen(interrupt_message)) == 0) {
>> +		if (rte_atomic16_read(&s->enable_intr) == 0)
>> +			return;
>> +
>> +		/* relay interrupt to pipe */
>> +		ret = write(s->irqfds.writefd, "1", 1);
> How about the interrupt latency? Seems it is quite long.

Yes, I agree with it.
Probably using evetfd or removing this read/write mechanism to handle
interrupts will be nice.
Let me check it more.

Tetsuya

Tetsuya Mukawa Jan. 26, 2016, 2:58 a.m. UTC | #8

On 2016/01/25 19:29, Xie, Huawei wrote:
> On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
>> +#define PCI_CONFIG_ADDR(_bus, _device, _function, _offset) ( \
>> +	(1 << 31) | ((_bus) & 0xff) << 16 | ((_device) & 0x1f) << 11 | \
>> +	((_function) & 0xf) << 8 | ((_offset) & 0xfc))
> (_function) & 0x7 ?

Yes, you are correct.
I will fix it.

Thanks,
Tetsuya

Huawei Xie Jan. 27, 2016, 9:39 a.m. UTC | #9

On 1/26/2016 10:58 AM, Tetsuya Mukawa wrote:
> On 2016/01/25 19:15, Xie, Huawei wrote:
>> On 1/22/2016 6:38 PM, Tetsuya Mukawa wrote:
>>> On 2016/01/22 17:14, Xie, Huawei wrote:
>>>> On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
>>>>> virtio: Extend virtio-net PMD to support container environment
>>>>>
>>>>> The patch adds a new virtio-net PMD configuration that allows the PMD to
>>>>> work on host as if the PMD is in VM.
>>>>> Here is new configuration for virtio-net PMD.
>>>>>  - CONFIG_RTE_LIBRTE_VIRTIO_HOST_MODE
>>>>> To use this mode, EAL needs physically contiguous memory. To allocate
>>>>> such memory, add "--shm" option to application command line.
>>>>>
>>>>> To prepare virtio-net device on host, the users need to invoke QEMU
>>>>> process in special qtest mode. This mode is mainly used for testing QEMU
>>>>> devices from outer process. In this mode, no guest runs.
>>>>> Here is QEMU command line.
>>>>>
>>>>>  $ qemu-system-x86_64 \
>>>>>              -machine pc-i440fx-1.4,accel=qtest \
>>>>>              -display none -qtest-log /dev/null \
>>>>>              -qtest unix:/tmp/socket,server \
>>>>>              -netdev type=tap,script=/etc/qemu-ifup,id=net0,queues=1\
>>>>>              -device virtio-net-pci,netdev=net0,mq=on \
>>>>>              -chardev socket,id=chr1,path=/tmp/ivshmem,server \
>>>>>              -device ivshmem,size=1G,chardev=chr1,vectors=1
>>>>>
>>>>>  * QEMU process is needed per port.
>>>> Does qtest supports hot plug virtio-net pci device, so that we could run
>>>> one QEMU process in host, which provisions the virtio-net virtual
>>>> devices for the container?
>>> Theoretically, we can use hot plug in some cases.
>>> But I guess we have 3 concerns here.
>>>
>>> 1. Security.
>>> If we share QEMU process between multiple DPDK applications, this QEMU
>>> process will have all fds of  the applications on different containers.
>>> In some cases, it will be security concern.
>>> So, I guess we need to support current 1:1 configuration at least.
>>>
>>> 2. shared memory.
>>> Currently, QEMU and DPDK application will map shared memory using same
>>> virtual address.
>>> So if multiple DPDK application connects to one QEMU process, each DPDK
>>> application should have different address for shared memory. I guess
>>> this will be a big limitation.
>>>
>>> 3. PCI bridge.
>>> So far, QEMU has one PCI bridge, so we can connect almost 10 PCI devices
>>> to QEMU.
>>> (I forget correct number, but it's almost 10, because some slots are
>>> reserved by QEMU)
>>> A DPDK application needs both virtio-net and ivshmem device, so I guess
>>> almost 5 DPDK applications can connect to one QEMU process, so far.
>>> To add more PCI bridges solves this.
>>> But we need to add a lot of implementation to support cascaded PCI
>>> bridges and PCI devices.
>>> (Also we need to solve above "2nd" concern.)
>>>
>>> Anyway, if we use virtio-net PMD and vhost-user PMD, QEMU process will
>>> not do anything after initialization.
>>> (QEMU will try to read a qtest socket, then be stopped because there is
>>> no message after initialization)
>>> So I guess we can ignore overhead of these QEMU processes.
>>> If someone cannot ignore it, I guess this is the one of cases that it's
>>> nice to use your light weight container implementation.
>> Thanks for the explanation, and also in your opinion where is the best
>> place to run the QEMU instance? If we run QEMU instances in host, for
>> vhost-kernel support, we could get rid of the root privilege issue.
> Do you mean below?
> If we deploy QEMU instance on host, we can start a container without the
> root privilege.
> (But on host, still QEMU instance needs the privilege to access to
> vhost-kernel)

There is no issue running QEMU instance with root privilege on host, but
i think it is not acceptable granting the container root privilege.

>
> If so, I agree to deploy QEMU instance on host or other privileged
> container will be nice.
> In the case of vhost-user, to deploy on host or non-privileged container
> will be good.
>
>> Another issue is do you plan to support multiple virtio devices in
>> container? Currently i find the code assuming only one virtio-net device
>> in QEMU, right?
> Yes, so far, 1 port needs 1 QEMU instance.
> So if you need multiple virtio devices, you need to invoke multiple QEMU
> instances.
>
> Do you want to deploy 1 QEMU instance for each DPDK application, even if
> the application has multiple virtio-net ports?
>
> So far, I am not sure whether we need it, because this type of DPDK
> application will need only one port in most cases.
> But if you need this, yes, I can implement using QEMU PCI hotplug feature.
> (But probably we can only attach almost 10 ports. This will be limitation.)

I am OK with supporting one virtio device for the first version.

>
>> Btw, i have read most of your qtest code. No obvious issues found so far
>> but quite a couple of nits. You must have spent a lot of time on this.
>> It is great work!
> I appreciate your reviewing!
>
> BTW, my container implementation needed a QEMU patch in the case of
> vhost-user.
> But the patch has been merged in upstream QEMU, so we don't have this
> limitation any more.

Great, better put the QEMU dependency information in the commit message
>
> Thanks,
> Tetsuya
>

Huawei Xie Jan. 27, 2016, 10:03 a.m. UTC | #10

On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
> +	/* Set BAR region */
> +	for (i = 0; i < NB_BAR; i++) {
> +		switch (dev->bar[i].type) {
> +		case QTEST_PCI_BAR_IO:
> +		case QTEST_PCI_BAR_MEMORY_UNDER_1MB:
> +		case QTEST_PCI_BAR_MEMORY_32:
> +			qtest_pci_outl(s, bus, device, 0, dev->bar[i].addr,
> +				dev->bar[i].region_start);
> +			PMD_DRV_LOG(INFO, "Set BAR of %s device: 0x%lx - 0x%lx\n",
> +				dev->name, dev->bar[i].region_start,
> +				dev->bar[i].region_start + dev->bar[i].region_size);
> +			break;
> +		case QTEST_PCI_BAR_MEMORY_64:
> +			qtest_pci_outq(s, bus, device, 0, dev->bar[i].addr,
> +				dev->bar[i].region_start);
> +			PMD_DRV_LOG(INFO, "Set BAR of %s device: 0x%lx - 0x%lx\n",
> +				dev->name, dev->bar[i].region_start,
> +				dev->bar[i].region_start + dev->bar[i].region_size);
> +			break;

Hasn't the bar resource already been allocated? Is it the app's
responsibility to allocate the bar resource in qtest mode? The app
couldn't have that knowledge.

> +		case QTEST_PCI_BAR_DISABLE:
> +			break;
> +		}
> +	}
> +

Huawei Xie Jan. 27, 2016, 3:58 p.m. UTC | #11

On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
[snip]
> +
> +static int
> +qtest_raw_recv(int fd, char *buf, size_t count)
> +{
> +	size_t len = count;
> +	size_t total_len = 0;
> +	int ret = 0;
> +
> +	while (len > 0) {
> +		ret = read(fd, buf, len);
> +		if (ret == (int)len)
> +			break;
> +		if (*(buf + ret - 1) == '\n')
> +			break;

The above two lines should be put after the below if block.

> +		if (ret == -1) {
> +			if (errno == EINTR)
> +				continue;
> +			return ret;
> +		}
> +		total_len += ret;
> +		buf += ret;
> +		len -= ret;
> +	}
> +	return total_len + ret;
> +}
> +

[snip]

> +
> +static void
> +qtest_handle_one_message(struct qtest_session *s, char *buf)
> +{
> +	int ret;
> +
> +	if (strncmp(buf, interrupt_message, strlen(interrupt_message)) == 0) {
> +		if (rte_atomic16_read(&s->enable_intr) == 0)
> +			return;
> +
> +		/* relay interrupt to pipe */
> +		ret = write(s->irqfds.writefd, "1", 1);
> +		if (ret < 0)
> +			rte_panic("cannot relay interrupt\n");
> +	} else {
> +		/* relay normal message to pipe */
> +		ret = qtest_raw_send(s->msgfds.writefd, buf, strlen(buf));
> +		if (ret < 0)
> +			rte_panic("cannot relay normal message\n");
> +	}
> +}
> +
> +static char *
> +qtest_get_next_message(char *p)
> +{
> +	p = strchr(p, '\n');
> +	if ((p == NULL) || (*(p + 1) == '\0'))
> +		return NULL;
> +	return p + 1;
> +}
> +
> +static void
> +qtest_close_one_socket(int *fd)
> +{
> +	if (*fd > 0) {
> +		close(*fd);
> +		*fd = -1;
> +	}
> +}
> +
> +static void
> +qtest_close_sockets(struct qtest_session *s)
> +{
> +	qtest_close_one_socket(&s->qtest_socket);
> +	qtest_close_one_socket(&s->msgfds.readfd);
> +	qtest_close_one_socket(&s->msgfds.writefd);
> +	qtest_close_one_socket(&s->irqfds.readfd);
> +	qtest_close_one_socket(&s->irqfds.writefd);
> +	qtest_close_one_socket(&s->ivshmem_socket);
> +}
> +
> +/*
> + * This thread relays QTest response using pipe.
> + * The function is needed because we need to separate IRQ message from others.
> + */
> +static void *
> +qtest_event_handler(void *data) {
> +	struct qtest_session *s = (struct qtest_session *)data;
> +	char buf[1024];
> +	char *p;
> +	int ret;
> +
> +	for (;;) {
> +		memset(buf, 0, sizeof(buf));
> +		ret = qtest_raw_recv(s->qtest_socket, buf, sizeof(buf));
> +		if (ret < 0) {
> +			qtest_close_sockets(s);
> +			return NULL;
> +		}
> +
> +		/* may receive multiple messages at the same time */

From the qtest_raw_recv implementation, if at some point one message is
received by two qtest_raw_recv calls, then is that message discarded?
We could save the last incomplete message in buffer, and combine the
message received next time together.

> +		p = buf;
> +		do {
> +			qtest_handle_one_message(s, p);
> +		} while ((p = qtest_get_next_message(p)) != NULL);
> +	}
> +	return NULL;
> +}
> +

Huawei Xie Jan. 27, 2016, 4:45 p.m. UTC | #12

On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
> +qtest_find_pci_device(struct qtest_session *s, uint16_t bus, uint8_t device)
> +{
> +	struct qtest_pci_device *dev;
> +	uint32_t val;
> +
> +	val = qtest_pci_inl(s, bus, device, 0, 0);
> +	TAILQ_FOREACH(dev, &s->head, next) {
> +		if (val == ((uint32_t)dev->device_id << 16 | dev->vendor_id)) {
> +			dev->bus_addr = bus;
> +			dev->device_addr = device;
> +			return;
> +		}
> +
> +	}
> +}
> +
> +static int
> +qtest_init_pci_devices(struct qtest_session *s)
> +{
> +	struct qtest_pci_device *dev;
> +	uint16_t bus;
> +	uint8_t device;
> +	int ret;
> +
> +	/* Find devices */
> +	bus = 0;
> +	do {
> +		device = 0;
> +		do {
> +			qtest_find_pci_device(s, bus, device);
> +		} while (device++ != NB_DEVICE - 1);
> +	} while (bus++ != NB_BUS - 1);

Seems this scan of all the pci devices is very time consuming operation,
and each scan involves socket communication.
Do you measure how long it takes to do the pci devices initialization?

> +
> +	/* Initialize devices */
> +	TAILQ_FOREACH(dev, &s->head, next) {
> +		ret = dev->init(s, dev);
> +		if (ret != 0)
> +			return ret;
> +	}
> +
> +	return 0;

Tetsuya Mukawa Jan. 28, 2016, 2:33 a.m. UTC | #13

On 2016/01/27 18:39, Xie, Huawei wrote:
> On 1/26/2016 10:58 AM, Tetsuya Mukawa wrote:
>> On 2016/01/25 19:15, Xie, Huawei wrote:
>>
>> BTW, my container implementation needed a QEMU patch in the case of
>> vhost-user.
>> But the patch has been merged in upstream QEMU, so we don't have this
>> limitation any more.
> Great, better put the QEMU dependency information in the commit message

Thanks for all your comments and carefully reviewing.

So far, I am not sure what is next QEMU version.
But I will add it after QEMU releases new one.

Tetsuya

Tetsuya Mukawa Jan. 28, 2016, 2:44 a.m. UTC | #14

On 2016/01/27 19:03, Xie, Huawei wrote:
> On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
>> +	/* Set BAR region */
>> +	for (i = 0; i < NB_BAR; i++) {
>> +		switch (dev->bar[i].type) {
>> +		case QTEST_PCI_BAR_IO:
>> +		case QTEST_PCI_BAR_MEMORY_UNDER_1MB:
>> +		case QTEST_PCI_BAR_MEMORY_32:
>> +			qtest_pci_outl(s, bus, device, 0, dev->bar[i].addr,
>> +				dev->bar[i].region_start);
>> +			PMD_DRV_LOG(INFO, "Set BAR of %s device: 0x%lx - 0x%lx\n",
>> +				dev->name, dev->bar[i].region_start,
>> +				dev->bar[i].region_start + dev->bar[i].region_size);
>> +			break;
>> +		case QTEST_PCI_BAR_MEMORY_64:
>> +			qtest_pci_outq(s, bus, device, 0, dev->bar[i].addr,
>> +				dev->bar[i].region_start);
>> +			PMD_DRV_LOG(INFO, "Set BAR of %s device: 0x%lx - 0x%lx\n",
>> +				dev->name, dev->bar[i].region_start,
>> +				dev->bar[i].region_start + dev->bar[i].region_size);
>> +			break;
> Hasn't the bar resource already been allocated? Is it the app's
> responsibility to allocate the bar resource in qtest mode? The app
> couldn't have that knowledge.

Yes. In qtest mode, the app should register above values.
(Without it, default values are 0)
Usually, this will be done by BIOS or uEFI. But in qtest mode, these
will not be invoked.
So we need to define above values, and also need to enable PCI devices.

In this release, I just register hard coded values except for one of
ivshmem BAR.
In next release, I will describe memory map in comment.

Tetsuya

Tetsuya Mukawa Jan. 28, 2016, 2:47 a.m. UTC | #15

On 2016/01/28 0:58, Xie, Huawei wrote:
> On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
> [snip]
>> +
>> +static int
>> +qtest_raw_recv(int fd, char *buf, size_t count)
>> +{
>> +	size_t len = count;
>> +	size_t total_len = 0;
>> +	int ret = 0;
>> +
>> +	while (len > 0) {
>> +		ret = read(fd, buf, len);
>> +		if (ret == (int)len)
>> +			break;
>> +		if (*(buf + ret - 1) == '\n')
>> +			break;
> The above two lines should be put after the below if block.

Yes, it should be so.

>
>> +		if (ret == -1) {
>> +			if (errno == EINTR)
>> +				continue;
>> +			return ret;
>> +		}
>> +		total_len += ret;
>> +		buf += ret;
>> +		len -= ret;
>> +	}
>> +	return total_len + ret;
>> +}
>> +
> [snip]
>
>> +
>> +static void
>> +qtest_handle_one_message(struct qtest_session *s, char *buf)
>> +{
>> +	int ret;
>> +
>> +	if (strncmp(buf, interrupt_message, strlen(interrupt_message)) == 0) {
>> +		if (rte_atomic16_read(&s->enable_intr) == 0)
>> +			return;
>> +
>> +		/* relay interrupt to pipe */
>> +		ret = write(s->irqfds.writefd, "1", 1);
>> +		if (ret < 0)
>> +			rte_panic("cannot relay interrupt\n");
>> +	} else {
>> +		/* relay normal message to pipe */
>> +		ret = qtest_raw_send(s->msgfds.writefd, buf, strlen(buf));
>> +		if (ret < 0)
>> +			rte_panic("cannot relay normal message\n");
>> +	}
>> +}
>> +
>> +static char *
>> +qtest_get_next_message(char *p)
>> +{
>> +	p = strchr(p, '\n');
>> +	if ((p == NULL) || (*(p + 1) == '\0'))
>> +		return NULL;
>> +	return p + 1;
>> +}
>> +
>> +static void
>> +qtest_close_one_socket(int *fd)
>> +{
>> +	if (*fd > 0) {
>> +		close(*fd);
>> +		*fd = -1;
>> +	}
>> +}
>> +
>> +static void
>> +qtest_close_sockets(struct qtest_session *s)
>> +{
>> +	qtest_close_one_socket(&s->qtest_socket);
>> +	qtest_close_one_socket(&s->msgfds.readfd);
>> +	qtest_close_one_socket(&s->msgfds.writefd);
>> +	qtest_close_one_socket(&s->irqfds.readfd);
>> +	qtest_close_one_socket(&s->irqfds.writefd);
>> +	qtest_close_one_socket(&s->ivshmem_socket);
>> +}
>> +
>> +/*
>> + * This thread relays QTest response using pipe.
>> + * The function is needed because we need to separate IRQ message from others.
>> + */
>> +static void *
>> +qtest_event_handler(void *data) {
>> +	struct qtest_session *s = (struct qtest_session *)data;
>> +	char buf[1024];
>> +	char *p;
>> +	int ret;
>> +
>> +	for (;;) {
>> +		memset(buf, 0, sizeof(buf));
>> +		ret = qtest_raw_recv(s->qtest_socket, buf, sizeof(buf));
>> +		if (ret < 0) {
>> +			qtest_close_sockets(s);
>> +			return NULL;
>> +		}
>> +
>> +		/* may receive multiple messages at the same time */
> From the qtest_raw_recv implementation, if at some point one message is
> received by two qtest_raw_recv calls, then is that message discarded?
> We could save the last incomplete message in buffer, and combine the
> message received next time together.

I guess we don't lose replies from QEMU.
Please let me describe more.

According to the qtest specification, after sending a message, we need
to receive a reply like below.
APP: ---command---> QEMU
APP: <-----------OK---- QEMU

But, to handle interrupt message, we need to take care below case.
APP: ---command---> QEMU
APP: <---interrupt---- QEMU
APP: <-----------OK---- QEMU

Also, we need to handle a case like multiple threads tries to send a
qtest message.
Anyway, here is current implementation.

So far, we have 3 types of sockets.
1. socket for qtest messaging.
2. socket for relaying normal message.
3. socket for relaying interrupt message.

About read direction:
The qtest socket is only read by "qtest_event_handler". The handler may
receive multiple messages at once.
In the case,  the handler split messages, and send it to normal message
socket or interrupt message socket.

About write direction:
The qtest socket will be written by below functions.
 - qtest_raw_in/out
 - qtest_raw_read/write
But all functions that use above functions need to have mutex before
sending messages.
So all messaging will not be overlapped, then only one thread will read
the socket for relaying normal message.

Tetsuya

Tetsuya Mukawa Jan. 28, 2016, 2:47 a.m. UTC | #16

On 2016/01/28 1:45, Xie, Huawei wrote:
> On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
>> +qtest_find_pci_device(struct qtest_session *s, uint16_t bus, uint8_t device)
>> +{
>> +	struct qtest_pci_device *dev;
>> +	uint32_t val;
>> +
>> +	val = qtest_pci_inl(s, bus, device, 0, 0);
>> +	TAILQ_FOREACH(dev, &s->head, next) {
>> +		if (val == ((uint32_t)dev->device_id << 16 | dev->vendor_id)) {
>> +			dev->bus_addr = bus;
>> +			dev->device_addr = device;
>> +			return;
>> +		}
>> +
>> +	}
>> +}
>> +
>> +static int
>> +qtest_init_pci_devices(struct qtest_session *s)
>> +{
>> +	struct qtest_pci_device *dev;
>> +	uint16_t bus;
>> +	uint8_t device;
>> +	int ret;
>> +
>> +	/* Find devices */
>> +	bus = 0;
>> +	do {
>> +		device = 0;
>> +		do {
>> +			qtest_find_pci_device(s, bus, device);
>> +		} while (device++ != NB_DEVICE - 1);
>> +	} while (bus++ != NB_BUS - 1);
> Seems this scan of all the pci devices is very time consuming operation,
> and each scan involves socket communication.
> Do you measure how long it takes to do the pci devices initialization?

I measured it, and seems it takes 0.35 seconds in my environment.
This will be done only once when the port is initialized. Probably it's
not so heady.

Tetsuya

Huawei Xie Jan. 28, 2016, 6:15 a.m. UTC | #17

On 1/28/2016 10:48 AM, Tetsuya Mukawa wrote:
> I measured it, and seems it takes 0.35 seconds in my environment.
> This will be done only once when the port is initialized. Probably it's
> not so heady.

There are 256 x 32 loop of pci scan. That is too long if we dynamically
start/tear down the container, otherwise it is ok. Some people are
struggling reducing the VM booting time from seconds to milliseconds to
compete with container technology. Let us consider if we could optimize
this.
For example, QEMU supports specifying bus/dev for a device in its
commandline, so could we assign fixed bus for virtio-net and ivshm
device? And for piix3, is it on bus 0/1?

Tetsuya Mukawa Jan. 28, 2016, 6:29 a.m. UTC | #18

On 2016/01/28 15:15, Xie, Huawei wrote:
> On 1/28/2016 10:48 AM, Tetsuya Mukawa wrote:
>> I measured it, and seems it takes 0.35 seconds in my environment.
>> This will be done only once when the port is initialized. Probably it's
>> not so heady.
> There are 256 x 32 loop of pci scan. That is too long if we dynamically
> start/tear down the container, otherwise it is ok. Some people are
> struggling reducing the VM booting time from seconds to milliseconds to
> compete with container technology. Let us consider if we could optimize
> this.
> For example, QEMU supports specifying bus/dev for a device in its
> commandline, so could we assign fixed bus for virtio-net and ivshm
> device? And for piix3, is it on bus 0/1?
>

OK, I understand the necessity. Let's consider it.
So far, the users doesn't need to specify pci address on QEMU command
line and DPDK vdev option.
But, let's change this, then we can remove this looping.

Probably specifying pci address on vdev option will not be mandatory.
if not specified, just using default value is nice.
I will fix like above in next release.

Tetsuya

Huawei Xie Jan. 28, 2016, 9:48 a.m. UTC | #19

On 1/28/2016 10:47 AM, Tetsuya Mukawa wrote:
> On 2016/01/28 0:58, Xie, Huawei wrote:
>> On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
>> [snip]
>>> +
>>> +static int
>>> +qtest_raw_recv(int fd, char *buf, size_t count)
>>> +{
>>> +	size_t len = count;
>>> +	size_t total_len = 0;
>>> +	int ret = 0;
>>> +
>>> +	while (len > 0) {
>>> +		ret = read(fd, buf, len);
>>> +		if (ret == (int)len)
>>> +			break;
>>> +		if (*(buf + ret - 1) == '\n')
>>> +			break;
>> The above two lines should be put after the below if block.
> Yes, it should be so.
>
>>> +		if (ret == -1) {
>>> +			if (errno == EINTR)
>>> +				continue;
>>> +			return ret;
>>> +		}
>>> +		total_len += ret;
>>> +		buf += ret;
>>> +		len -= ret;
>>> +	}
>>> +	return total_len + ret;
>>> +}
>>> +
>> [snip]
>>
>>> +
>>> +static void
>>> +qtest_handle_one_message(struct qtest_session *s, char *buf)
>>> +{
>>> +	int ret;
>>> +
>>> +	if (strncmp(buf, interrupt_message, strlen(interrupt_message)) == 0) {
>>> +		if (rte_atomic16_read(&s->enable_intr) == 0)
>>> +			return;
>>> +
>>> +		/* relay interrupt to pipe */
>>> +		ret = write(s->irqfds.writefd, "1", 1);
>>> +		if (ret < 0)
>>> +			rte_panic("cannot relay interrupt\n");
>>> +	} else {
>>> +		/* relay normal message to pipe */
>>> +		ret = qtest_raw_send(s->msgfds.writefd, buf, strlen(buf));
>>> +		if (ret < 0)
>>> +			rte_panic("cannot relay normal message\n");
>>> +	}
>>> +}
>>> +
>>> +static char *
>>> +qtest_get_next_message(char *p)
>>> +{
>>> +	p = strchr(p, '\n');
>>> +	if ((p == NULL) || (*(p + 1) == '\0'))
>>> +		return NULL;
>>> +	return p + 1;
>>> +}
>>> +
>>> +static void
>>> +qtest_close_one_socket(int *fd)
>>> +{
>>> +	if (*fd > 0) {
>>> +		close(*fd);
>>> +		*fd = -1;
>>> +	}
>>> +}
>>> +
>>> +static void
>>> +qtest_close_sockets(struct qtest_session *s)
>>> +{
>>> +	qtest_close_one_socket(&s->qtest_socket);
>>> +	qtest_close_one_socket(&s->msgfds.readfd);
>>> +	qtest_close_one_socket(&s->msgfds.writefd);
>>> +	qtest_close_one_socket(&s->irqfds.readfd);
>>> +	qtest_close_one_socket(&s->irqfds.writefd);
>>> +	qtest_close_one_socket(&s->ivshmem_socket);
>>> +}
>>> +
>>> +/*
>>> + * This thread relays QTest response using pipe.
>>> + * The function is needed because we need to separate IRQ message from others.
>>> + */
>>> +static void *
>>> +qtest_event_handler(void *data) {
>>> +	struct qtest_session *s = (struct qtest_session *)data;
>>> +	char buf[1024];
>>> +	char *p;
>>> +	int ret;
>>> +
>>> +	for (;;) {
>>> +		memset(buf, 0, sizeof(buf));
>>> +		ret = qtest_raw_recv(s->qtest_socket, buf, sizeof(buf));
>>> +		if (ret < 0) {
>>> +			qtest_close_sockets(s);
>>> +			return NULL;
>>> +		}
>>> +
>>> +		/* may receive multiple messages at the same time */
>> From the qtest_raw_recv implementation, if at some point one message is
>> received by two qtest_raw_recv calls, then is that message discarded?
>> We could save the last incomplete message in buffer, and combine the
>> message received next time together.
> I guess we don't lose replies from QEMU.
> Please let me describe more.
>
> According to the qtest specification, after sending a message, we need
> to receive a reply like below.
> APP: ---command---> QEMU
> APP: <-----------OK---- QEMU
>
> But, to handle interrupt message, we need to take care below case.
> APP: ---command---> QEMU
> APP: <---interrupt---- QEMU
> APP: <-----------OK---- QEMU
>
> Also, we need to handle a case like multiple threads tries to send a
> qtest message.
> Anyway, here is current implementation.
>
> So far, we have 3 types of sockets.
> 1. socket for qtest messaging.
> 2. socket for relaying normal message.
> 3. socket for relaying interrupt message.
>
> About read direction:
> The qtest socket is only read by "qtest_event_handler". The handler may
> receive multiple messages at once.

I think there are two assumptions that all messages are ended with "\n"
and the sizeof(buf) could hold the maximum length of sum of all multiple
messages that QEMU could send at one time.
Otherwise in the last read call of qtest_raw_receive, you might receive
only part of the a message.

> In the case,  the handler split messages, and send it to normal message
> socket or interrupt message socket.
>
> About write direction:
> The qtest socket will be written by below functions.
>  - qtest_raw_in/out
>  - qtest_raw_read/write
> But all functions that use above functions need to have mutex before
> sending messages.
> So all messaging will not be overlapped, then only one thread will read
> the socket for relaying normal message.
>
> Tetsuya
>

Tetsuya Mukawa Jan. 28, 2016, 9:53 a.m. UTC | #20

On 2016/01/28 18:48, Xie, Huawei wrote:
> On 1/28/2016 10:47 AM, Tetsuya Mukawa wrote:
>> On 2016/01/28 0:58, Xie, Huawei wrote:
>>> On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
>>> [snip]
>>>> +
>>>> +static int
>>>> +qtest_raw_recv(int fd, char *buf, size_t count)
>>>> +{
>>>> +	size_t len = count;
>>>> +	size_t total_len = 0;
>>>> +	int ret = 0;
>>>> +
>>>> +	while (len > 0) {
>>>> +		ret = read(fd, buf, len);
>>>> +		if (ret == (int)len)
>>>> +			break;
>>>> +		if (*(buf + ret - 1) == '\n')
>>>> +			break;
>>> The above two lines should be put after the below if block.
>> Yes, it should be so.
>>
>>>> +		if (ret == -1) {
>>>> +			if (errno == EINTR)
>>>> +				continue;
>>>> +			return ret;
>>>> +		}
>>>> +		total_len += ret;
>>>> +		buf += ret;
>>>> +		len -= ret;
>>>> +	}
>>>> +	return total_len + ret;
>>>> +}
>>>> +
>>> [snip]
>>>
>>>> +
>>>> +static void
>>>> +qtest_handle_one_message(struct qtest_session *s, char *buf)
>>>> +{
>>>> +	int ret;
>>>> +
>>>> +	if (strncmp(buf, interrupt_message, strlen(interrupt_message)) == 0) {
>>>> +		if (rte_atomic16_read(&s->enable_intr) == 0)
>>>> +			return;
>>>> +
>>>> +		/* relay interrupt to pipe */
>>>> +		ret = write(s->irqfds.writefd, "1", 1);
>>>> +		if (ret < 0)
>>>> +			rte_panic("cannot relay interrupt\n");
>>>> +	} else {
>>>> +		/* relay normal message to pipe */
>>>> +		ret = qtest_raw_send(s->msgfds.writefd, buf, strlen(buf));
>>>> +		if (ret < 0)
>>>> +			rte_panic("cannot relay normal message\n");
>>>> +	}
>>>> +}
>>>> +
>>>> +static char *
>>>> +qtest_get_next_message(char *p)
>>>> +{
>>>> +	p = strchr(p, '\n');
>>>> +	if ((p == NULL) || (*(p + 1) == '\0'))
>>>> +		return NULL;
>>>> +	return p + 1;
>>>> +}
>>>> +
>>>> +static void
>>>> +qtest_close_one_socket(int *fd)
>>>> +{
>>>> +	if (*fd > 0) {
>>>> +		close(*fd);
>>>> +		*fd = -1;
>>>> +	}
>>>> +}
>>>> +
>>>> +static void
>>>> +qtest_close_sockets(struct qtest_session *s)
>>>> +{
>>>> +	qtest_close_one_socket(&s->qtest_socket);
>>>> +	qtest_close_one_socket(&s->msgfds.readfd);
>>>> +	qtest_close_one_socket(&s->msgfds.writefd);
>>>> +	qtest_close_one_socket(&s->irqfds.readfd);
>>>> +	qtest_close_one_socket(&s->irqfds.writefd);
>>>> +	qtest_close_one_socket(&s->ivshmem_socket);
>>>> +}
>>>> +
>>>> +/*
>>>> + * This thread relays QTest response using pipe.
>>>> + * The function is needed because we need to separate IRQ message from others.
>>>> + */
>>>> +static void *
>>>> +qtest_event_handler(void *data) {
>>>> +	struct qtest_session *s = (struct qtest_session *)data;
>>>> +	char buf[1024];
>>>> +	char *p;
>>>> +	int ret;
>>>> +
>>>> +	for (;;) {
>>>> +		memset(buf, 0, sizeof(buf));
>>>> +		ret = qtest_raw_recv(s->qtest_socket, buf, sizeof(buf));
>>>> +		if (ret < 0) {
>>>> +			qtest_close_sockets(s);
>>>> +			return NULL;
>>>> +		}
>>>> +
>>>> +		/* may receive multiple messages at the same time */
>>> From the qtest_raw_recv implementation, if at some point one message is
>>> received by two qtest_raw_recv calls, then is that message discarded?
>>> We could save the last incomplete message in buffer, and combine the
>>> message received next time together.
>> I guess we don't lose replies from QEMU.
>> Please let me describe more.
>>
>> According to the qtest specification, after sending a message, we need
>> to receive a reply like below.
>> APP: ---command---> QEMU
>> APP: <-----------OK---- QEMU
>>
>> But, to handle interrupt message, we need to take care below case.
>> APP: ---command---> QEMU
>> APP: <---interrupt---- QEMU
>> APP: <-----------OK---- QEMU
>>
>> Also, we need to handle a case like multiple threads tries to send a
>> qtest message.
>> Anyway, here is current implementation.
>>
>> So far, we have 3 types of sockets.
>> 1. socket for qtest messaging.
>> 2. socket for relaying normal message.
>> 3. socket for relaying interrupt message.
>>
>> About read direction:
>> The qtest socket is only read by "qtest_event_handler". The handler may
>> receive multiple messages at once.
> I think there are two assumptions that all messages are ended with "\n"
> and the sizeof(buf) could hold the maximum length of sum of all multiple
> messages that QEMU could send at one time.
> Otherwise in the last read call of qtest_raw_receive, you might receive
> only part of the a message.

I've got your point. I will fix above case.

Thanks,
Tetsuya

Huawei Xie Jan. 29, 2016, 8:56 a.m. UTC | #21

On 1/28/2016 10:44 AM, Tetsuya Mukawa wrote:
> On 2016/01/27 19:03, Xie, Huawei wrote:
>> On 1/21/2016 7:09 PM, Tetsuya Mukawa wrote:
>>> +	/* Set BAR region */
>>> +	for (i = 0; i < NB_BAR; i++) {
>>> +		switch (dev->bar[i].type) {
>>> +		case QTEST_PCI_BAR_IO:
>>> +		case QTEST_PCI_BAR_MEMORY_UNDER_1MB:
>>> +		case QTEST_PCI_BAR_MEMORY_32:
>>> +			qtest_pci_outl(s, bus, device, 0, dev->bar[i].addr,
>>> +				dev->bar[i].region_start);
>>> +			PMD_DRV_LOG(INFO, "Set BAR of %s device: 0x%lx - 0x%lx\n",
>>> +				dev->name, dev->bar[i].region_start,
>>> +				dev->bar[i].region_start + dev->bar[i].region_size);
>>> +			break;
>>> +		case QTEST_PCI_BAR_MEMORY_64:
>>> +			qtest_pci_outq(s, bus, device, 0, dev->bar[i].addr,
>>> +				dev->bar[i].region_start);
>>> +			PMD_DRV_LOG(INFO, "Set BAR of %s device: 0x%lx - 0x%lx\n",
>>> +				dev->name, dev->bar[i].region_start,
>>> +				dev->bar[i].region_start + dev->bar[i].region_size);
>>> +			break;
>> Hasn't the bar resource already been allocated? Is it the app's
>> responsibility to allocate the bar resource in qtest mode? The app
>> couldn't have that knowledge.
> Yes. In qtest mode, the app should register above values.
> (Without it, default values are 0)
> Usually, this will be done by BIOS or uEFI. But in qtest mode, these
> will not be invoked.
> So we need to define above values, and also need to enable PCI devices.
>
> In this release, I just register hard coded values except for one of
> ivshmem BAR.
> In next release, I will describe memory map in comment.

I think ideally this app should do the whole PCI system
initialization(and also the north/south bridge) using the DFS algorithm
on behalf of the BIOS, to allocate resource for all bridge and devices.
Otherwise if QEMU follows the hardware platform's behavior to route
MMIO/IO access, if we only allocate resources for part of the devices,
the transactions could not be routed correctly.
Anyway we shouldn't bother us. It is OK as long as it works.


>
> Tetsuya
>
>

Yuanhan Liu Jan. 29, 2016, 8:57 a.m. UTC | #22

On Thu, Jan 21, 2016 at 08:07:58PM +0900, Tetsuya Mukawa wrote:
> +static int
> +virt_read_pci_cfg(struct virtio_hw *hw, void *buf, size_t len, off_t offset)
> +{
> +	qtest_read_pci_cfg(hw, "virtio-net", buf, len, offset);
> +	return 0;
> +}
> +
> +static void *
> +virt_get_mapped_addr(struct virtio_hw *hw, uint8_t bar,
> +		     uint32_t offset, uint32_t length)
> +{
> +	uint64_t base;
> +	uint64_t size;
> +
> +	if (qtest_get_bar_size(hw, "virtio-net", bar, &size) < 0) {
> +		PMD_INIT_LOG(ERR, "invalid bar: %u", bar);
> +		return NULL;
> +	}
> +
> +	if (offset + length < offset) {
> +		PMD_INIT_LOG(ERR, "offset(%u) + lenght(%u) overflows",
> +			offset, length);
> +		return NULL;
> +	}
> +
> +	if (offset + length > size) {
> +		PMD_INIT_LOG(ERR,
> +			"invalid cap: overflows bar space: %u > %"PRIu64,
> +			offset + length, size);
> +		return NULL;
> +	}
> +
> +	if (qtest_get_bar_addr(hw, "virtio-net", bar, &base) < 0) {
> +		PMD_INIT_LOG(ERR, "invalid bar: %u", bar);
> +		return NULL;
> +	}

So, I understood the usage now, and the cfg_ops abstraction doesn't look
good yet necessary to me.  For EAL managed pci device, bar length and
addr are stored at memory_resources[], and for your case, it's from the
qtest. And judging that it's compile time decision, I'd like it to be:

    #ifdef /* RTE_LIBRTE_VIRTIO_HOST_MODE */
    
    static uint32_t
    get_bar_size(...)
    {
    	return qtest_get_bar_size(..);
    }
    
    static uint64-t
    get_bar_addr(...)
    {
    	return qtest_get_bar_addr(..);
    }
    
    ...
    ...
    
    #else
    
    static  uint32_t
    get_bar_size(...)
    {
    	return dev->mem_resource[bar].addr;
    }
    
    ...
    
    }
    #endif


And then you just need do related changes at virtio_read_caps() and
get_cfg_addr(). That'd be much simpler, without introducing duplicate
code and uncessary complex.

What do you think of that?

	--yliu

> +
> +	return (void *)(base + offset);
> +}
> +
> +static const struct virtio_pci_cfg_ops virt_cfg_ops = {
> +	.map			= virt_map_pci_cfg,
> +	.unmap			= virt_unmap_pci_cfg,
> +	.get_mapped_addr	= virt_get_mapped_addr,
> +	.read			= virt_read_pci_cfg,
> +};
> +#endif /* RTE_LIBRTE_VIRTIO_HOST_MODE */

Yuanhan Liu Jan. 29, 2016, 9:13 a.m. UTC | #23

On Fri, Jan 29, 2016 at 04:57:23PM +0800, Yuanhan Liu wrote:
> On Thu, Jan 21, 2016 at 08:07:58PM +0900, Tetsuya Mukawa wrote:
> > +static int
> > +virt_read_pci_cfg(struct virtio_hw *hw, void *buf, size_t len, off_t offset)
> > +{
> > +	qtest_read_pci_cfg(hw, "virtio-net", buf, len, offset);
> > +	return 0;
> > +}
> > +
> > +static void *
> > +virt_get_mapped_addr(struct virtio_hw *hw, uint8_t bar,
> > +		     uint32_t offset, uint32_t length)
> > +{
> > +	uint64_t base;
> > +	uint64_t size;
> > +
> > +	if (qtest_get_bar_size(hw, "virtio-net", bar, &size) < 0) {
> > +		PMD_INIT_LOG(ERR, "invalid bar: %u", bar);
> > +		return NULL;
> > +	}
> > +
> > +	if (offset + length < offset) {
> > +		PMD_INIT_LOG(ERR, "offset(%u) + lenght(%u) overflows",
> > +			offset, length);
> > +		return NULL;
> > +	}
> > +
> > +	if (offset + length > size) {
> > +		PMD_INIT_LOG(ERR,
> > +			"invalid cap: overflows bar space: %u > %"PRIu64,
> > +			offset + length, size);
> > +		return NULL;
> > +	}
> > +
> > +	if (qtest_get_bar_addr(hw, "virtio-net", bar, &base) < 0) {
> > +		PMD_INIT_LOG(ERR, "invalid bar: %u", bar);
> > +		return NULL;
> > +	}
> 
> So, I understood the usage now, and the cfg_ops abstraction doesn't look
> good yet necessary to me.  For EAL managed pci device, bar length and
> addr are stored at memory_resources[], and for your case, it's from the
> qtest. And judging that it's compile time decision, I'd like it to be:
> 
>     #ifdef /* RTE_LIBRTE_VIRTIO_HOST_MODE */

Oops, sorry, I was wrong. Your code could be co-exist with the
traditional virtio pmd driver, thus we can't do that.

But still, I think dynamic "if ... else ..." should be better:
there are just few places (maybe 4: bar_size, bar length, map
device, read config) need that.


On the other hand, if you really want to do that abstraction,
you should go it with more fine granularity, such as the following
methods I proposed, instead of the big one: get_cfg_addr(). In
that way, we could avoid duplicate code.

	--yliu

>     
>     static uint32_t
>     get_bar_size(...)
>     {
>     	return qtest_get_bar_size(..);
>     }
>     
>     static uint64-t
>     get_bar_addr(...)
>     {
>     	return qtest_get_bar_addr(..);
>     }
>     
>     ...
>     ...
>     
>     #else
>     
>     static  uint32_t
>     get_bar_size(...)
>     {
>     	return dev->mem_resource[bar].addr;
>     }
>     
>     ...
>     
>     }
>     #endif
> 
> 
> And then you just need do related changes at virtio_read_caps() and
> get_cfg_addr(). That'd be much simpler, without introducing duplicate
> code and uncessary complex.
> 
> What do you think of that?
> 
> 	--yliu
> 
> > +
> > +	return (void *)(base + offset);
> > +}
> > +
> > +static const struct virtio_pci_cfg_ops virt_cfg_ops = {
> > +	.map			= virt_map_pci_cfg,
> > +	.unmap			= virt_unmap_pci_cfg,
> > +	.get_mapped_addr	= virt_get_mapped_addr,
> > +	.read			= virt_read_pci_cfg,
> > +};
> > +#endif /* RTE_LIBRTE_VIRTIO_HOST_MODE */

Tetsuya Mukawa Feb. 1, 2016, 1:49 a.m. UTC | #24

On 2016/01/29 18:13, Yuanhan Liu wrote:
> On Fri, Jan 29, 2016 at 04:57:23PM +0800, Yuanhan Liu wrote:
>> On Thu, Jan 21, 2016 at 08:07:58PM +0900, Tetsuya Mukawa wrote:
>>> +static int
>>> +virt_read_pci_cfg(struct virtio_hw *hw, void *buf, size_t len, off_t offset)
>>> +{
>>> +	qtest_read_pci_cfg(hw, "virtio-net", buf, len, offset);
>>> +	return 0;
>>> +}
>>> +
>>> +static void *
>>> +virt_get_mapped_addr(struct virtio_hw *hw, uint8_t bar,
>>> +		     uint32_t offset, uint32_t length)
>>> +{
>>> +	uint64_t base;
>>> +	uint64_t size;
>>> +
>>> +	if (qtest_get_bar_size(hw, "virtio-net", bar, &size) < 0) {
>>> +		PMD_INIT_LOG(ERR, "invalid bar: %u", bar);
>>> +		return NULL;
>>> +	}
>>> +
>>> +	if (offset + length < offset) {
>>> +		PMD_INIT_LOG(ERR, "offset(%u) + lenght(%u) overflows",
>>> +			offset, length);
>>> +		return NULL;
>>> +	}
>>> +
>>> +	if (offset + length > size) {
>>> +		PMD_INIT_LOG(ERR,
>>> +			"invalid cap: overflows bar space: %u > %"PRIu64,
>>> +			offset + length, size);
>>> +		return NULL;
>>> +	}
>>> +
>>> +	if (qtest_get_bar_addr(hw, "virtio-net", bar, &base) < 0) {
>>> +		PMD_INIT_LOG(ERR, "invalid bar: %u", bar);
>>> +		return NULL;
>>> +	}
>> So, I understood the usage now, and the cfg_ops abstraction doesn't look
>> good yet necessary to me.  For EAL managed pci device, bar length and
>> addr are stored at memory_resources[], and for your case, it's from the
>> qtest. And judging that it's compile time decision, I'd like it to be:
>>
>>     #ifdef /* RTE_LIBRTE_VIRTIO_HOST_MODE */
> Oops, sorry, I was wrong. Your code could be co-exist with the
> traditional virtio pmd driver, thus we can't do that.
>
> But still, I think dynamic "if ... else ..." should be better:
> there are just few places (maybe 4: bar_size, bar length, map
> device, read config) need that.

Thanks for comments.
I will use "if ... else ...." instead of introducing a cfg_ops.

Tetsuya


>
> On the other hand, if you really want to do that abstraction,
> you should go it with more fine granularity, such as the following
> methods I proposed, instead of the big one: get_cfg_addr(). In
> that way, we could avoid duplicate code.
>
> 	--yliu
>
>>     
>>     static uint32_t
>>     get_bar_size(...)
>>     {
>>     	return qtest_get_bar_size(..);
>>     }
>>     
>>     static uint64-t
>>     get_bar_addr(...)
>>     {
>>     	return qtest_get_bar_addr(..);
>>     }
>>     
>>     ...
>>     ...
>>     
>>     #else
>>     
>>     static  uint32_t
>>     get_bar_size(...)
>>     {
>>     	return dev->mem_resource[bar].addr;
>>     }
>>     
>>     ...
>>     
>>     }
>>     #endif
>>
>>
>> And then you just need do related changes at virtio_read_caps() and
>> get_cfg_addr(). That'd be much simpler, without introducing duplicate
>> code and uncessary complex.
>>
>> What do you think of that?
>>
>> 	--yliu
>>
>>> +
>>> +	return (void *)(base + offset);
>>> +}
>>> +
>>> +static const struct virtio_pci_cfg_ops virt_cfg_ops = {
>>> +	.map			= virt_map_pci_cfg,
>>> +	.unmap			= virt_unmap_pci_cfg,
>>> +	.get_mapped_addr	= virt_get_mapped_addr,
>>> +	.read			= virt_read_pci_cfg,
>>> +};
>>> +#endif /* RTE_LIBRTE_VIRTIO_HOST_MODE */

Tetsuya Mukawa Feb. 10, 2016, 3:40 a.m. UTC | #25

The patches will work on below patch series.
- [PATCH v2 0/5] virtio support for container
- [PATCH 0/4] rework ioport access for virtio

[Changes]
v2 changes:
- Rebase on above patch seiries.
- Rebase on master
- Add "--qtest-virtio" EAL option.
- Fixes in qtest.c
- Fix error handling for the case qtest connection is closed.
- Use eventfd for interrupt messaging.
- Use linux header for PCI register definitions.
- Fix qtest_raw_send/recv to handle error correctly.
- Fix bit mask of PCI_CONFIG_ADDR.
- Describe memory and ioport usage of qtest guest in qtest.c
- Remove loop that is for finding PCI devices.

[Abstraction]

Normally, virtio-net PMD only works on VM, because there is no virtio-net device on host.
This patches extend virtio-net PMD to be able to work on host as virtual PMD.
But we didn't implement virtio-net device as a part of virtio-net PMD.
To prepare virtio-net device for the PMD, start QEMU process with special QTest mode, then connect it from virtio-net PMD through unix domain socket.

The PMD can connect to anywhere QEMU virtio-net device can.
For example, the PMD can connects to vhost-net kernel module and vhost-user backend application.
Similar to virtio-net PMD on QEMU, application memory that uses virtio-net PMD will be shared between vhost backend application.
But vhost backend application memory will not be shared.

Main target of this PMD is container like docker, rkt, lxc and etc.
We can isolate related processes(virtio-net PMD process, QEMU and vhost-user backend process) by container.
But, to communicate through unix domain socket, shared directory will be needed.

[How to use]

Please use QEMU-2.5.1, or above.
(So far, QEMU-2.5.1 hasn't been released yet, so please checkout master from QEMU repository)

- Compile
Set "CONFIG_RTE_VIRTIO_VDEV_QTEST=y" in config/common_linux.
Then compile it.

- Start QEMU like below.
$ qemu-system-x86_64 \
-machine pc-i440fx-1.4,accel=qtest \
-display none -qtest-log /dev/null \
-qtest unix:/tmp/socket,server \
-netdev type=tap,script=/etc/qemu-ifup,id=net0,queues=1 \
-device virtio-net-pci,netdev=net0,mq=on,disable-modern=false,addr=3 \
-chardev socket,id=chr1,path=/tmp/ivshmem,server \
-device ivshmem,size=1G,chardev=chr1,vectors=1,addr=4

- Start DPDK application like below
$ testpmd -c f -n 1 -m 1024 --no-pci --single-file --qtest-virtio \
--vdev="eth_qtest_virtio0,qtest=/tmp/socket,ivshmem=/tmp/ivshmem"\
-- --disable-hw-vlan --txqflags=0xf00 -i

(*1) Please Specify same memory size in QEMU and DPDK command line.
(*2) Should use qemu-2.5.1, or above.
(*3) QEMU process is needed per port.
(*4) virtio-1.0 device are only supported.
(*5) The vhost backends like vhost-net and vhost-user can be specified.
(*6) In most cases, just using above command is enough, but you can also
specify other QEMU virtio-net options.
(*7) Only checked "pc-i440fx-1.4" machine, but may work with other
machines. It depends on a machine has piix3 south bridge.
If the machine doesn't have, virtio-net PMD cannot receive status
changed interrupts.
(*8) Should not add "--enable-kvm" to QEMU command line.

[Detailed Description]

- virtio-net device implementation
The PMD uses QEMU virtio-net device. To do that, QEMU QTest functionality is used.
QTest is a test framework of QEMU devices. It allows us to implement a device driver outside of QEMU.
With QTest, we can implement DPDK application and virtio-net PMD as standalone process on host.
When QEMU is invoked as QTest mode, any guest code will not run.
To know more about QTest, see below.
http://wiki.qemu.org/Features/QTest

- probing devices
QTest provides a unix domain socket. Through this socket, driver process can access to I/O port and memory of QEMU virtual machine.
The PMD will send I/O port accesses to probe pci devices.
If we can find virtio-net and ivshmem device, initialize the devices.
Also, I/O port accesses of virtio-net PMD will be sent through socket, and virtio-net PMD can initialize vitio-net device on QEMU correctly.

- ivshmem device to share memory
To share memory that virtio-net PMD process uses, ivshmem device will be used.
Because ivshmem device can only handle one file descriptor, shared memory should be consist of one file.
To allocate such a memory, EAL has new option called "--single-file".
Also, the hugepages should be mapped between "1 << 31" to "1 << 44".
To map like above, EAL has one more new option called "-qtest-virtio".
While initializing ivshmem device, we can set BAR(Base Address Register).
It represents which memory QEMU vcpu can access to this shared memory.
We will specify host virtual address of shared memory as this address.
It is very useful because we don't need to apply patch to QEMU to calculate address offset.
(For example, if virtio-net PMD process will allocate memory from shared memory, then specify the virtual address of it to virtio-net register, QEMU virtio-net device can understand it without calculating address offset.)

Tetsuya Mukawa (5):
virtio: Retrieve driver name from eth_dev
EAL: Add new EAL "--qtest-virtio" option
vhost: Add a function to check virtio device type
virtio: Add support for qtest virtio-net PMD
docs: add release note for qtest virtio container support

config/common_linuxapp | 1 +
doc/guides/rel_notes/release_2_3.rst | 3 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/qtest.c | 1342 ++++++++++++++++++++++++++++
drivers/net/virtio/qtest.h | 65 ++
drivers/net/virtio/virtio_ethdev.c | 433 ++++++++-
drivers/net/virtio/virtio_ethdev.h | 32 +
drivers/net/virtio/virtio_pci.c | 364 +++++++-
drivers/net/virtio/virtio_pci.h | 5 +-
lib/librte_eal/common/eal_common_options.c | 10 +
lib/librte_eal/common/eal_internal_cfg.h | 1 +
lib/librte_eal/common/eal_options.h | 2 +
lib/librte_eal/linuxapp/eal/eal_memory.c | 81 +-
13 files changed, 2274 insertions(+), 69 deletions(-)
create mode 100644 drivers/net/virtio/qtest.c
create mode 100644 drivers/net/virtio/qtest.h

diff mbox

Patch

diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74bc515..04682f6 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -269,6 +269,7 @@  CONFIG_RTE_LIBRTE_PMD_SZEDATA2=n
 # Compile burst-oriented VIRTIO PMD driver
 #
 CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
+CONFIG_RTE_LIBRTE_VIRTIO_HOST_MODE=y
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_INIT=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
 CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 43835ba..697e629 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -52,6 +52,10 @@  SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
 
+ifeq ($(CONFIG_RTE_LIBRTE_VIRTIO_HOST_MODE),y)
+	SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += qtest.c
+endif
+
 # this lib depends upon:
 DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_eal lib/librte_ether
 DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_mempool lib/librte_mbuf
diff --git a/drivers/net/virtio/qtest.c b/drivers/net/virtio/qtest.c
new file mode 100644
index 0000000..717bee9
--- /dev/null
+++ b/drivers/net/virtio/qtest.c
@@ -0,0 +1,1237 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2015 IGEL Co., Ltd. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of IGEL Co., Ltd. nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <sys/queue.h>
+#include <signal.h>
+#include <pthread.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+
+#include <rte_memory.h>
+#include <rte_malloc.h>
+#include <rte_common.h>
+#include <rte_interrupts.h>
+
+#include "virtio_pci.h"
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+
+#define NB_BUS                          256
+#define NB_DEVICE                       32
+#define NB_BAR                          6
+
+/* PCI common configuration registers */
+#define REG_ADDR_VENDOR_ID              0x0
+#define REG_ADDR_DEVICE_ID              0x2
+#define REG_ADDR_COMMAND                0x4
+#define REG_ADDR_STATUS                 0x6
+#define REG_ADDR_REVISION_ID            0x8
+#define REG_ADDR_CLASS_CODE             0x9
+#define REG_ADDR_CACHE_LINE_S           0xc
+#define REG_ADDR_LAT_TIMER              0xd
+#define REG_ADDR_HEADER_TYPE            0xe
+#define REG_ADDR_BIST                   0xf
+#define REG_ADDR_BAR0                   0x10
+#define REG_ADDR_BAR1                   0x14
+#define REG_ADDR_BAR2                   0x18
+#define REG_ADDR_BAR3                   0x1c
+#define REG_ADDR_BAR4                   0x20
+#define REG_ADDR_BAR5                   0x24
+
+/* PCI common configuration register values */
+#define REG_VAL_COMMAND_IO              0x1
+#define REG_VAL_COMMAND_MEMORY          0x2
+#define REG_VAL_COMMAND_MASTER          0x4
+#define REG_VAL_HEADER_TYPE_ENDPOINT    0x0
+#define REG_VAL_BAR_MEMORY              0x0
+#define REG_VAL_BAR_IO                  0x1
+#define REG_VAL_BAR_LOCATE_32           0x0
+#define REG_VAL_BAR_LOCATE_UNDER_1MB    0x2
+#define REG_VAL_BAR_LOCATE_64           0x4
+
+/* PIIX3 configuration registers */
+#define PIIX3_REG_ADDR_PIRQA            0x60
+#define PIIX3_REG_ADDR_PIRQB            0x61
+#define PIIX3_REG_ADDR_PIRQC            0x62
+#define PIIX3_REG_ADDR_PIRQD            0x63
+
+/* Device information */
+#define VIRTIO_NET_DEVICE_ID            0x1000
+#define VIRTIO_NET_VENDOR_ID            0x1af4
+#define VIRTIO_NET_IO_START             0xc000
+#define VIRTIO_NET_MEMORY1_START	0x1000000000
+#define VIRTIO_NET_MEMORY2_START	0x2000000000
+#define VIRTIO_NET_IRQ_NUM              10
+#define IVSHMEM_DEVICE_ID               0x1110
+#define IVSHMEM_VENDOR_ID               0x1af4
+#define IVSHMEM_MEMORY_START            0x3000000000
+#define IVSHMEM_PROTOCOL_VERSION        0
+#define PIIX3_DEVICE_ID                 0x7000
+#define PIIX3_VENDOR_ID                 0x8086
+
+#define PCI_CONFIG_ADDR(_bus, _device, _function, _offset) ( \
+	(1 << 31) | ((_bus) & 0xff) << 16 | ((_device) & 0x1f) << 11 | \
+	((_function) & 0xf) << 8 | ((_offset) & 0xfc))
+
+static char interrupt_message[32];
+
+enum qtest_pci_bar_type {
+	QTEST_PCI_BAR_DISABLE = 0,
+	QTEST_PCI_BAR_IO,
+	QTEST_PCI_BAR_MEMORY_UNDER_1MB,
+	QTEST_PCI_BAR_MEMORY_32,
+	QTEST_PCI_BAR_MEMORY_64
+};
+
+struct qtest_pci_bar {
+	enum qtest_pci_bar_type type;
+	uint8_t addr;
+	uint64_t region_start;
+	uint64_t region_size;
+};
+
+struct qtest_session;
+TAILQ_HEAD(qtest_pci_device_list, qtest_pci_device);
+struct qtest_pci_device {
+	TAILQ_ENTRY(qtest_pci_device) next;
+	const char *name;
+	uint16_t device_id;
+	uint16_t vendor_id;
+	uint8_t bus_addr;
+	uint8_t device_addr;
+	struct qtest_pci_bar bar[NB_BAR];
+	int (*init)(struct qtest_session *s, struct qtest_pci_device *dev);
+};
+
+union qtest_pipefds {
+	struct {
+		int pipefd[2];
+	};
+	struct {
+		int readfd;
+		int writefd;
+	};
+};
+
+struct qtest_session {
+	int qtest_socket;
+	pthread_mutex_t qtest_session_lock;
+
+	struct qtest_pci_device_list head;
+	int ivshmem_socket;
+
+	pthread_t event_th;
+	union qtest_pipefds msgfds;
+
+	pthread_t intr_th;
+	union qtest_pipefds irqfds;
+	rte_atomic16_t enable_intr;
+	rte_intr_callback_fn cb;
+	void *cb_arg;
+};
+
+static int
+qtest_raw_send(int fd, char *buf, size_t count)
+{
+	size_t len = count;
+	size_t total_len = 0;
+	int ret = 0;
+
+	while (len > 0) {
+		ret = write(fd, buf, len);
+		if (ret == (int)len)
+			break;
+		if (ret == -1) {
+			if (errno == EINTR)
+				continue;
+			return ret;
+		}
+		total_len += ret;
+		buf += ret;
+		len -= ret;
+	}
+	return total_len + ret;
+}
+
+static int
+qtest_raw_recv(int fd, char *buf, size_t count)
+{
+	size_t len = count;
+	size_t total_len = 0;
+	int ret = 0;
+
+	while (len > 0) {
+		ret = read(fd, buf, len);
+		if (ret == (int)len)
+			break;
+		if (*(buf + ret - 1) == '\n')
+			break;
+		if (ret == -1) {
+			if (errno == EINTR)
+				continue;
+			return ret;
+		}
+		total_len += ret;
+		buf += ret;
+		len -= ret;
+	}
+	return total_len + ret;
+}
+
+/*
+ * To know QTest protocol specification, see below QEMU source code.
+ *  - qemu/qtest.c
+ * If qtest socket is closed, qtest_raw_in and qtest_raw_read will return 0.
+ */
+static uint32_t
+qtest_raw_in(struct qtest_session *s, uint16_t addr, char type)
+{
+	char buf[1024];
+	int ret;
+
+	if ((type != 'l') && (type != 'w') && (type != 'b'))
+		rte_panic("Invalid value\n");
+
+	snprintf(buf, sizeof(buf), "in%c 0x%x\n", type, addr);
+	/* write to qtest socket */
+	ret = qtest_raw_send(s->qtest_socket, buf, strlen(buf));
+	/* read reply from event handler */
+	ret = qtest_raw_recv(s->msgfds.readfd, buf, sizeof(buf));
+	if (ret < 0)
+		return 0;
+
+	buf[ret] = '\0';
+	return strtoul(buf + strlen("OK "), NULL, 16);
+}
+
+static void
+qtest_raw_out(struct qtest_session *s, uint16_t addr, uint32_t val, char type)
+{
+	char buf[1024];
+	int ret __rte_unused;
+
+	if ((type != 'l') && (type != 'w') && (type != 'b'))
+		rte_panic("Invalid value\n");
+
+	snprintf(buf, sizeof(buf), "out%c 0x%x 0x%x\n", type, addr, val);
+	/* write to qtest socket */
+	ret = qtest_raw_send(s->qtest_socket, buf, strlen(buf));
+	/* read reply from event handler */
+	ret = qtest_raw_recv(s->msgfds.readfd, buf, sizeof(buf));
+}
+
+static uint32_t
+qtest_raw_read(struct qtest_session *s, uint64_t addr, char type)
+{
+	char buf[1024];
+	int ret;
+
+	if ((type != 'l') && (type != 'w') && (type != 'b'))
+		rte_panic("Invalid value\n");
+
+	snprintf(buf, sizeof(buf), "read%c 0x%lx\n", type, addr);
+	/* write to qtest socket */
+	ret = qtest_raw_send(s->qtest_socket, buf, strlen(buf));
+	/* read reply from event handler */
+	ret = qtest_raw_recv(s->msgfds.readfd, buf, sizeof(buf));
+	if (ret < 0)
+		return 0;
+
+	buf[ret] = '\0';
+	return strtoul(buf + strlen("OK "), NULL, 16);
+}
+
+static void
+qtest_raw_write(struct qtest_session *s, uint64_t addr, uint32_t val, char type)
+{
+	char buf[1024];
+	int ret __rte_unused;
+
+	if ((type != 'l') && (type != 'w') && (type != 'b'))
+		rte_panic("Invalid value\n");
+
+	snprintf(buf, sizeof(buf), "write%c 0x%lx 0x%x\n", type, addr, val);
+	/* write to qtest socket */
+	ret = qtest_raw_send(s->qtest_socket, buf, strlen(buf));
+	/* read reply from event handler */
+	ret = qtest_raw_recv(s->msgfds.readfd, buf, sizeof(buf));
+}
+
+/*
+ * qtest_pci_inX/outX are used for accessing PCI configuration space.
+ * The functions are implemented based on PCI configuration space
+ * specification.
+ * Accroding to the spec, access size of read()/write() should be 4 bytes.
+ */
+static int
+qtest_pci_inb(struct qtest_session *s, uint8_t bus, uint8_t device,
+		uint8_t function, uint8_t offset)
+{
+	uint32_t tmp;
+
+	tmp = PCI_CONFIG_ADDR(bus, device, function, offset);
+
+	if (pthread_mutex_lock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	qtest_raw_out(s, 0xcf8, tmp, 'l');
+	tmp = qtest_raw_in(s, 0xcfc, 'l');
+
+	if (pthread_mutex_unlock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot unlock mutex\n");
+
+	return (tmp >> ((offset & 0x3) * 8)) & 0xff;
+}
+
+static void
+qtest_pci_outb(struct qtest_session *s, uint8_t bus, uint8_t device,
+		uint8_t function, uint8_t offset, uint8_t value)
+{
+	uint32_t addr, tmp, pos;
+
+	addr = PCI_CONFIG_ADDR(bus, device, function, offset);
+	pos = (offset % 4) * 8;
+
+	if (pthread_mutex_lock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	qtest_raw_out(s, 0xcf8, addr, 'l');
+	tmp = qtest_raw_in(s, 0xcfc, 'l');
+	tmp = (tmp & ~(0xff << pos)) | (value << pos);
+
+	qtest_raw_out(s, 0xcf8, addr, 'l');
+	qtest_raw_out(s, 0xcfc, tmp, 'l');
+
+	if (pthread_mutex_unlock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot unlock mutex\n");
+}
+
+static uint32_t
+qtest_pci_inl(struct qtest_session *s, uint8_t bus, uint8_t device,
+		uint8_t function, uint8_t offset)
+{
+	uint32_t tmp;
+
+	tmp = PCI_CONFIG_ADDR(bus, device, function, offset);
+
+	if (pthread_mutex_lock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	qtest_raw_out(s, 0xcf8, tmp, 'l');
+	tmp = qtest_raw_in(s, 0xcfc, 'l');
+
+	if (pthread_mutex_unlock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot unlock mutex\n");
+
+	return tmp;
+}
+
+static void
+qtest_pci_outl(struct qtest_session *s, uint8_t bus, uint8_t device,
+		uint8_t function, uint8_t offset, uint32_t value)
+{
+	uint32_t tmp;
+
+	tmp = PCI_CONFIG_ADDR(bus, device, function, offset);
+
+	if (pthread_mutex_lock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	qtest_raw_out(s, 0xcf8, tmp, 'l');
+	qtest_raw_out(s, 0xcfc, value, 'l');
+
+	if (pthread_mutex_unlock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot unlock mutex\n");
+}
+
+static uint64_t
+qtest_pci_inq(struct qtest_session *s, uint8_t bus, uint8_t device,
+		uint8_t function, uint8_t offset)
+{
+	uint32_t tmp;
+	uint64_t val;
+
+	tmp = PCI_CONFIG_ADDR(bus, device, function, offset);
+
+	if (pthread_mutex_lock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	qtest_raw_out(s, 0xcf8, tmp, 'l');
+	val = (uint64_t)qtest_raw_in(s, 0xcfc, 'l');
+
+	tmp = PCI_CONFIG_ADDR(bus, device, function, offset + 4);
+
+	qtest_raw_out(s, 0xcf8, tmp, 'l');
+	val |= (uint64_t)qtest_raw_in(s, 0xcfc, 'l') << 32;
+
+	if (pthread_mutex_unlock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot unlock mutex\n");
+
+	return val;
+}
+
+static void
+qtest_pci_outq(struct qtest_session *s, uint8_t bus, uint8_t device,
+		uint8_t function, uint8_t offset, uint64_t value)
+{
+	uint32_t tmp;
+
+	tmp = PCI_CONFIG_ADDR(bus, device, function, offset);
+
+	if (pthread_mutex_lock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	qtest_raw_out(s, 0xcf8, tmp, 'l');
+	qtest_raw_out(s, 0xcfc, (uint32_t)(value & 0xffffffff), 'l');
+
+	tmp = PCI_CONFIG_ADDR(bus, device, function, offset + 4);
+
+	qtest_raw_out(s, 0xcf8, tmp, 'l');
+	qtest_raw_out(s, 0xcfc, (uint32_t)(value >> 32), 'l');
+
+	if (pthread_mutex_unlock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot unlock mutex\n");
+}
+
+/*
+ * qtest_in/out are used for accessing ioport of qemu guest.
+ * qtest_read/write are used for accessing memory of qemu guest.
+ */
+uint32_t
+qtest_in(struct virtio_hw *hw, uint16_t addr, char type)
+{
+	struct qtest_session *s = (struct qtest_session *)hw->qsession;
+	uint32_t val;
+
+	if (pthread_mutex_lock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	val = qtest_raw_in(s, addr, type);
+
+	if (pthread_mutex_unlock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	return val;
+}
+
+void
+qtest_out(struct virtio_hw *hw, uint16_t addr, uint64_t val, char type)
+{
+	struct qtest_session *s = (struct qtest_session *)hw->qsession;
+
+	if (pthread_mutex_lock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	qtest_raw_out(s, addr, val, type);
+
+	if (pthread_mutex_unlock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+}
+
+uint32_t
+qtest_read(struct virtio_hw *hw, uint64_t addr, char type)
+{
+	struct qtest_session *s = (struct qtest_session *)hw->qsession;
+	uint32_t val;
+
+	if (pthread_mutex_lock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	val = qtest_raw_read(s, addr, type);
+
+	if (pthread_mutex_unlock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	return val;
+}
+
+void
+qtest_write(struct virtio_hw *hw, uint64_t addr, uint64_t val, char type)
+{
+	struct qtest_session *s = (struct qtest_session *)hw->qsession;
+
+	if (pthread_mutex_lock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	qtest_raw_write(s, addr, val, type);
+
+	if (pthread_mutex_unlock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+}
+
+static struct qtest_pci_device *
+qtest_find_device(struct qtest_session *s, const char *name)
+{
+	struct qtest_pci_device *dev;
+
+	TAILQ_FOREACH(dev, &s->head, next) {
+		if (strcmp(dev->name, name) == 0)
+			return dev;
+	}
+	return NULL;
+}
+
+/*
+ * The function is used for reading pci configuration space of specifed device.
+ */
+int
+qtest_read_pci_cfg(struct virtio_hw *hw, const char *name,
+		void *buf, size_t len, off_t offset)
+{
+	struct qtest_session *s = (struct qtest_session *)hw->qsession;
+	struct qtest_pci_device *dev;
+	uint32_t i;
+	uint8_t *p = buf;
+
+	dev = qtest_find_device(s, name);
+	if (dev == NULL) {
+		PMD_DRV_LOG(ERR, "Cannot find specified device: %s\n", name);
+		return -1;
+	}
+
+	for (i = 0; i < len; i++) {
+		*(p + i) = qtest_pci_inb(s,
+				dev->bus_addr, dev->device_addr, 0, offset + i);
+	}
+
+	return 0;
+}
+
+static struct qtest_pci_bar *
+qtest_get_bar(struct virtio_hw *hw, const char *name, uint8_t bar)
+{
+	struct qtest_session *s = (struct qtest_session *)hw->qsession;
+	struct qtest_pci_device *dev;
+
+	if (bar >= NB_BAR) {
+		PMD_DRV_LOG(ERR, "Invalid bar is specified: %u\n", bar);
+		return NULL;
+	}
+
+	dev = qtest_find_device(s, name);
+	if (dev == NULL) {
+		PMD_DRV_LOG(ERR, "Cannot find specified device: %s\n", name);
+		return NULL;
+	}
+
+	if (dev->bar[bar].type == QTEST_PCI_BAR_DISABLE) {
+		PMD_DRV_LOG(ERR, "Cannot find valid BAR(%s): %u\n", name, bar);
+		return NULL;
+	}
+
+	return &dev->bar[bar];
+}
+
+int
+qtest_get_bar_addr(struct virtio_hw *hw, const char *name,
+		uint8_t bar, uint64_t *addr)
+{
+	struct qtest_pci_bar *bar_ptr;
+
+	bar_ptr = qtest_get_bar(hw, name, bar);
+	if (bar_ptr == NULL)
+		return -1;
+
+	*addr = bar_ptr->region_start;
+	return 0;
+}
+
+int
+qtest_get_bar_size(struct virtio_hw *hw, const char *name,
+		uint8_t bar, uint64_t *size)
+{
+	struct qtest_pci_bar *bar_ptr;
+
+	bar_ptr = qtest_get_bar(hw, name, bar);
+	if (bar_ptr == NULL)
+		return -1;
+
+	*size = bar_ptr->region_size;
+	return 0;
+}
+
+int
+qtest_intr_enable(void *data)
+{
+	struct virtio_hw *hw = ((struct rte_eth_dev_data *)data)->dev_private;
+	struct qtest_session *s;
+
+	s = (struct qtest_session *)hw->qsession;
+	rte_atomic16_set(&s->enable_intr, 1);
+
+	return 0;
+}
+
+int
+qtest_intr_disable(void *data)
+{
+	struct virtio_hw *hw = ((struct rte_eth_dev_data *)data)->dev_private;
+	struct qtest_session *s;
+
+	s = (struct qtest_session *)hw->qsession;
+	rte_atomic16_set(&s->enable_intr, 0);
+
+	return 0;
+}
+
+void
+qtest_intr_callback_register(void *data,
+		rte_intr_callback_fn cb, void *cb_arg)
+{
+	struct virtio_hw *hw = ((struct rte_eth_dev_data *)data)->dev_private;
+	struct qtest_session *s;
+
+	s = (struct qtest_session *)hw->qsession;
+	s->cb = cb;
+	s->cb_arg = cb_arg;
+	rte_atomic16_set(&s->enable_intr, 1);
+}
+
+void
+qtest_intr_callback_unregister(void *data,
+		rte_intr_callback_fn cb __rte_unused,
+		void *cb_arg __rte_unused)
+{
+	struct virtio_hw *hw = ((struct rte_eth_dev_data *)data)->dev_private;
+	struct qtest_session *s;
+
+	s = (struct qtest_session *)hw->qsession;
+	rte_atomic16_set(&s->enable_intr, 0);
+	s->cb = NULL;
+	s->cb_arg = NULL;
+}
+
+static void *
+qtest_intr_handler(void *data) {
+	struct qtest_session *s = (struct qtest_session *)data;
+	char buf[1];
+	int ret;
+
+	for (;;) {
+		ret = qtest_raw_recv(s->irqfds.readfd, buf, sizeof(buf));
+		if (ret < 0)
+			return NULL;
+		s->cb(NULL, s->cb_arg);
+	}
+	return NULL;
+}
+
+static int
+qtest_intr_initialize(void *data)
+{
+	struct virtio_hw *hw = ((struct rte_eth_dev_data *)data)->dev_private;
+	struct qtest_session *s;
+	char buf[1024];
+	int ret;
+
+	s = (struct qtest_session *)hw->qsession;
+
+	/* This message will come when interrupt occurs */
+	snprintf(interrupt_message, sizeof(interrupt_message),
+			"IRQ raise %d", VIRTIO_NET_IRQ_NUM);
+
+	snprintf(buf, sizeof(buf), "irq_intercept_in ioapic\n");
+
+	if (pthread_mutex_lock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	/* To enable interrupt, send "irq_intercept_in" message to QEMU */
+	ret = qtest_raw_send(s->qtest_socket, buf, strlen(buf));
+	if (ret < 0) {
+		pthread_mutex_unlock(&s->qtest_session_lock);
+		return -1;
+	}
+
+	/* just ignore QEMU response */
+	ret = qtest_raw_recv(s->msgfds.readfd, buf, sizeof(buf));
+	if (ret < 0) {
+		pthread_mutex_unlock(&s->qtest_session_lock);
+		return -1;
+	}
+
+	if (pthread_mutex_unlock(&s->qtest_session_lock) < 0)
+		rte_panic("Cannot lock mutex\n");
+
+	return 0;
+}
+
+static void
+qtest_handle_one_message(struct qtest_session *s, char *buf)
+{
+	int ret;
+
+	if (strncmp(buf, interrupt_message, strlen(interrupt_message)) == 0) {
+		if (rte_atomic16_read(&s->enable_intr) == 0)
+			return;
+
+		/* relay interrupt to pipe */
+		ret = write(s->irqfds.writefd, "1", 1);
+		if (ret < 0)
+			rte_panic("cannot relay interrupt\n");
+	} else {
+		/* relay normal message to pipe */
+		ret = qtest_raw_send(s->msgfds.writefd, buf, strlen(buf));
+		if (ret < 0)
+			rte_panic("cannot relay normal message\n");
+	}
+}
+
+static char *
+qtest_get_next_message(char *p)
+{
+	p = strchr(p, '\n');
+	if ((p == NULL) || (*(p + 1) == '\0'))
+		return NULL;
+	return p + 1;
+}
+
+static void
+qtest_close_one_socket(int *fd)
+{
+	if (*fd > 0) {
+		close(*fd);
+		*fd = -1;
+	}
+}
+
+static void
+qtest_close_sockets(struct qtest_session *s)
+{
+	qtest_close_one_socket(&s->qtest_socket);
+	qtest_close_one_socket(&s->msgfds.readfd);
+	qtest_close_one_socket(&s->msgfds.writefd);
+	qtest_close_one_socket(&s->irqfds.readfd);
+	qtest_close_one_socket(&s->irqfds.writefd);
+	qtest_close_one_socket(&s->ivshmem_socket);
+}
+
+/*
+ * This thread relays QTest response using pipe.
+ * The function is needed because we need to separate IRQ message from others.
+ */
+static void *
+qtest_event_handler(void *data) {
+	struct qtest_session *s = (struct qtest_session *)data;
+	char buf[1024];
+	char *p;
+	int ret;
+
+	for (;;) {
+		memset(buf, 0, sizeof(buf));
+		ret = qtest_raw_recv(s->qtest_socket, buf, sizeof(buf));
+		if (ret < 0) {
+			qtest_close_sockets(s);
+			return NULL;
+		}
+
+		/* may receive multiple messages at the same time */
+		p = buf;
+		do {
+			qtest_handle_one_message(s, p);
+		} while ((p = qtest_get_next_message(p)) != NULL);
+	}
+	return NULL;
+}
+
+static int
+qtest_init_piix3_device(struct qtest_session *s, struct qtest_pci_device *dev)
+{
+	uint8_t bus, device, virtio_net_slot = 0;
+	struct qtest_pci_device *tmpdev;
+	uint8_t pcislot2regaddr[] = {	0xff,
+					0xff,
+					0xff,
+					PIIX3_REG_ADDR_PIRQC,
+					PIIX3_REG_ADDR_PIRQD,
+					PIIX3_REG_ADDR_PIRQA,
+					PIIX3_REG_ADDR_PIRQB};
+
+	bus = dev->bus_addr;
+	device = dev->device_addr;
+
+	PMD_DRV_LOG(INFO,
+		"Find %s on virtual PCI bus: %04x:%02x:00.0\n",
+		dev->name, bus, device);
+
+	/* Get slot id that is connected to virtio-net */
+	TAILQ_FOREACH(tmpdev, &s->head, next) {
+		if (strcmp(tmpdev->name, "virtio-net") == 0) {
+			virtio_net_slot = tmpdev->device_addr;
+			break;
+		}
+	}
+
+	if (virtio_net_slot == 0)
+		return -1;
+
+	/*
+	 * Set interrupt routing for virtio-net device.
+	 * Here is i440fx/piix3 connection settings
+	 * ---------------------------------------
+	 * PCI Slot3 -> PIRQC
+	 * PCI Slot4 -> PIRQD
+	 * PCI Slot5 -> PIRQA
+	 * PCI Slot6 -> PIRQB
+	 */
+	if (pcislot2regaddr[virtio_net_slot] != 0xff) {
+		qtest_pci_outb(s, bus, device, 0,
+				pcislot2regaddr[virtio_net_slot],
+				VIRTIO_NET_IRQ_NUM);
+	}
+
+	return 0;
+}
+
+/*
+ * Common initialization of PCI device.
+ * To know detail, see pci specification.
+ */
+static int
+qtest_init_pci_device(struct qtest_session *s, struct qtest_pci_device *dev)
+{
+	uint8_t i, bus, device;
+	uint32_t val;
+	uint64_t val64;
+
+	bus = dev->bus_addr;
+	device = dev->device_addr;
+
+	PMD_DRV_LOG(INFO,
+		"Find %s on virtual PCI bus: %04x:%02x:00.0\n",
+		dev->name, bus, device);
+
+	/* Check header type */
+	val = qtest_pci_inb(s, bus, device, 0, REG_ADDR_HEADER_TYPE);
+	if (val != REG_VAL_HEADER_TYPE_ENDPOINT) {
+		PMD_DRV_LOG(ERR, "Unexpected header type %d\n", val);
+		return -1;
+	}
+
+	/* Check BAR type */
+	for (i = 0; i < NB_BAR; i++) {
+		val = qtest_pci_inl(s, bus, device, 0, dev->bar[i].addr);
+
+		switch (dev->bar[i].type) {
+		case QTEST_PCI_BAR_IO:
+			if ((val & 0x1) != REG_VAL_BAR_IO)
+				dev->bar[i].type = QTEST_PCI_BAR_DISABLE;
+			break;
+		case QTEST_PCI_BAR_MEMORY_UNDER_1MB:
+			if ((val & 0x1) != REG_VAL_BAR_MEMORY)
+				dev->bar[i].type = QTEST_PCI_BAR_DISABLE;
+			if ((val & 0x6) != REG_VAL_BAR_LOCATE_UNDER_1MB)
+				dev->bar[i].type = QTEST_PCI_BAR_DISABLE;
+			break;
+		case QTEST_PCI_BAR_MEMORY_32:
+			if ((val & 0x1) != REG_VAL_BAR_MEMORY)
+				dev->bar[i].type = QTEST_PCI_BAR_DISABLE;
+			if ((val & 0x6) != REG_VAL_BAR_LOCATE_32)
+				dev->bar[i].type = QTEST_PCI_BAR_DISABLE;
+			break;
+		case QTEST_PCI_BAR_MEMORY_64:
+
+			if ((val & 0x1) != REG_VAL_BAR_MEMORY)
+				dev->bar[i].type = QTEST_PCI_BAR_DISABLE;
+			if ((val & 0x6) != REG_VAL_BAR_LOCATE_64)
+				dev->bar[i].type = QTEST_PCI_BAR_DISABLE;
+			break;
+		case QTEST_PCI_BAR_DISABLE:
+			break;
+		}
+	}
+
+	/* Enable device */
+	val = qtest_pci_inl(s, bus, device, 0, REG_ADDR_COMMAND);
+	val |= REG_VAL_COMMAND_IO | REG_VAL_COMMAND_MEMORY | REG_VAL_COMMAND_MASTER;
+	qtest_pci_outl(s, bus, device, 0, REG_ADDR_COMMAND, val);
+
+	/* Calculate BAR size */
+	for (i = 0; i < NB_BAR; i++) {
+		switch (dev->bar[i].type) {
+		case QTEST_PCI_BAR_IO:
+		case QTEST_PCI_BAR_MEMORY_UNDER_1MB:
+		case QTEST_PCI_BAR_MEMORY_32:
+			qtest_pci_outl(s, bus, device, 0,
+					dev->bar[i].addr, 0xffffffff);
+			val = qtest_pci_inl(s, bus, device,
+					0, dev->bar[i].addr);
+			dev->bar[i].region_size = ~(val & 0xfffffff0) + 1;
+			break;
+		case QTEST_PCI_BAR_MEMORY_64:
+			qtest_pci_outq(s, bus, device, 0,
+					dev->bar[i].addr, 0xffffffffffffffff);
+			val64 = qtest_pci_inq(s, bus, device,
+					0, dev->bar[i].addr);
+			dev->bar[i].region_size =
+					~(val64 & 0xfffffffffffffff0) + 1;
+			break;
+		case QTEST_PCI_BAR_DISABLE:
+			break;
+		}
+	}
+
+	/* Set BAR region */
+	for (i = 0; i < NB_BAR; i++) {
+		switch (dev->bar[i].type) {
+		case QTEST_PCI_BAR_IO:
+		case QTEST_PCI_BAR_MEMORY_UNDER_1MB:
+		case QTEST_PCI_BAR_MEMORY_32:
+			qtest_pci_outl(s, bus, device, 0, dev->bar[i].addr,
+				dev->bar[i].region_start);
+			PMD_DRV_LOG(INFO, "Set BAR of %s device: 0x%lx - 0x%lx\n",
+				dev->name, dev->bar[i].region_start,
+				dev->bar[i].region_start + dev->bar[i].region_size);
+			break;
+		case QTEST_PCI_BAR_MEMORY_64:
+			qtest_pci_outq(s, bus, device, 0, dev->bar[i].addr,
+				dev->bar[i].region_start);
+			PMD_DRV_LOG(INFO, "Set BAR of %s device: 0x%lx - 0x%lx\n",
+				dev->name, dev->bar[i].region_start,
+				dev->bar[i].region_start + dev->bar[i].region_size);
+			break;
+		case QTEST_PCI_BAR_DISABLE:
+			break;
+		}
+	}
+
+	return 0;
+}
+
+static void
+qtest_find_pci_device(struct qtest_session *s, uint16_t bus, uint8_t device)
+{
+	struct qtest_pci_device *dev;
+	uint32_t val;
+
+	val = qtest_pci_inl(s, bus, device, 0, 0);
+	TAILQ_FOREACH(dev, &s->head, next) {
+		if (val == ((uint32_t)dev->device_id << 16 | dev->vendor_id)) {
+			dev->bus_addr = bus;
+			dev->device_addr = device;
+			return;
+		}
+
+	}
+}
+
+static int
+qtest_init_pci_devices(struct qtest_session *s)
+{
+	struct qtest_pci_device *dev;
+	uint16_t bus;
+	uint8_t device;
+	int ret;
+
+	/* Find devices */
+	bus = 0;
+	do {
+		device = 0;
+		do {
+			qtest_find_pci_device(s, bus, device);
+		} while (device++ != NB_DEVICE - 1);
+	} while (bus++ != NB_BUS - 1);
+
+	/* Initialize devices */
+	TAILQ_FOREACH(dev, &s->head, next) {
+		ret = dev->init(s, dev);
+		if (ret != 0)
+			return ret;
+	}
+
+	return 0;
+}
+
+struct rte_pci_id
+qtest_get_pci_id_of_virtio_net(void)
+{
+	struct rte_pci_id id =  {VIRTIO_NET_DEVICE_ID,
+		VIRTIO_NET_VENDOR_ID, PCI_ANY_ID, PCI_ANY_ID};
+
+	return id;
+}
+
+static int
+qtest_register_target_devices(struct qtest_session *s)
+{
+	struct qtest_pci_device *virtio_net, *ivshmem, *piix3;
+	const struct rte_memseg *ms;
+
+	ms = rte_eal_get_physmem_layout();
+	/* if EAL memory size isn't pow of 2, ivshmem will refuse it */
+	if ((ms[0].len & (ms[0].len - 1)) != 0) {
+		PMD_DRV_LOG(ERR, "memory size must be power of 2\n");
+		return -1;
+	}
+
+	virtio_net = malloc(sizeof(*virtio_net));
+	if (virtio_net == NULL)
+		return -1;
+
+	ivshmem = malloc(sizeof(*ivshmem));
+	if (ivshmem == NULL)
+		return -1;
+
+	piix3 = malloc(sizeof(*piix3));
+	if (piix3 == NULL)
+		return -1;
+
+	memset(virtio_net, 0, sizeof(*virtio_net));
+	memset(ivshmem, 0, sizeof(*ivshmem));
+
+	TAILQ_INIT(&s->head);
+
+	virtio_net->name = "virtio-net";
+	virtio_net->device_id = VIRTIO_NET_DEVICE_ID;
+	virtio_net->vendor_id = VIRTIO_NET_VENDOR_ID;
+	virtio_net->init = qtest_init_pci_device;
+	virtio_net->bar[0].addr = REG_ADDR_BAR0;
+	virtio_net->bar[0].type = QTEST_PCI_BAR_IO;
+	virtio_net->bar[0].region_start = VIRTIO_NET_IO_START;
+	virtio_net->bar[1].addr = REG_ADDR_BAR1;
+	virtio_net->bar[1].type = QTEST_PCI_BAR_MEMORY_32;
+	virtio_net->bar[1].region_start = VIRTIO_NET_MEMORY1_START;
+	virtio_net->bar[4].addr = REG_ADDR_BAR4;
+	virtio_net->bar[4].type = QTEST_PCI_BAR_MEMORY_64;
+	virtio_net->bar[4].region_start = VIRTIO_NET_MEMORY2_START;
+	TAILQ_INSERT_TAIL(&s->head, virtio_net, next);
+
+	ivshmem->name = "ivshmem";
+	ivshmem->device_id = IVSHMEM_DEVICE_ID;
+	ivshmem->vendor_id = IVSHMEM_VENDOR_ID;
+	ivshmem->init = qtest_init_pci_device;
+	ivshmem->bar[0].addr = REG_ADDR_BAR0;
+	ivshmem->bar[0].type = QTEST_PCI_BAR_MEMORY_32;
+	ivshmem->bar[0].region_start = IVSHMEM_MEMORY_START;
+	ivshmem->bar[2].addr = REG_ADDR_BAR2;
+	ivshmem->bar[2].type = QTEST_PCI_BAR_MEMORY_64;
+	/* In host mode, only one memory segment is vaild */
+	ivshmem->bar[2].region_start = (uint64_t)ms[0].addr;
+	TAILQ_INSERT_TAIL(&s->head, ivshmem, next);
+
+	/* piix3 is needed to route irqs from virtio-net to ioapic */
+	piix3->name = "piix3";
+	piix3->device_id = PIIX3_DEVICE_ID;
+	piix3->vendor_id = PIIX3_VENDOR_ID;
+	piix3->init = qtest_init_piix3_device;
+	TAILQ_INSERT_TAIL(&s->head, piix3, next);
+
+	return 0;
+}
+
+static int
+qtest_send_message_to_ivshmem(int sock_fd, uint64_t client_id, int shm_fd)
+{
+	struct iovec iov;
+	struct msghdr msgh;
+	size_t fdsize = sizeof(int);
+	char control[CMSG_SPACE(fdsize)];
+	struct cmsghdr *cmsg;
+	int ret;
+
+	memset(&msgh, 0, sizeof(msgh));
+	iov.iov_base = &client_id;
+	iov.iov_len = sizeof(client_id);
+
+	msgh.msg_iov = &iov;
+	msgh.msg_iovlen = 1;
+
+	if (shm_fd >= 0) {
+		msgh.msg_control = &control;
+		msgh.msg_controllen = sizeof(control);
+		cmsg = CMSG_FIRSTHDR(&msgh);
+		cmsg->cmsg_len = CMSG_LEN(fdsize);
+		cmsg->cmsg_level = SOL_SOCKET;
+		cmsg->cmsg_type = SCM_RIGHTS;
+		memcpy(CMSG_DATA(cmsg), &shm_fd, fdsize);
+	}
+
+	do {
+		ret = sendmsg(sock_fd, &msgh, 0);
+	} while (ret < 0 && errno == EINTR);
+
+	if (ret < 0) {
+		PMD_DRV_LOG(ERR, "sendmsg error\n");
+		return ret;
+	}
+
+	return ret;
+}
+
+static int
+qtest_setup_shared_memory(struct qtest_session *s)
+{
+	int shm_fd, ret;
+
+	rte_memseg_info_get(0, &shm_fd, NULL, NULL);
+
+	/* send our protocol version first */
+	ret = qtest_send_message_to_ivshmem(s->ivshmem_socket,
+			IVSHMEM_PROTOCOL_VERSION, -1);
+	if (ret < 0) {
+		PMD_DRV_LOG(ERR,
+			"Failed to send protocol version to ivshmem\n");
+		return -1;
+	}
+
+	/* send client id */
+	ret = qtest_send_message_to_ivshmem(s->ivshmem_socket, 0, -1);
+	if (ret < 0) {
+		PMD_DRV_LOG(ERR, "Failed to send VMID to ivshmem\n");
+		return -1;
+	}
+
+	/* send message to ivshmem */
+	ret = qtest_send_message_to_ivshmem(s->ivshmem_socket, -1, shm_fd);
+	if (ret < 0) {
+		PMD_DRV_LOG(ERR, "Failed to file descriptor to ivshmem\n");
+		return -1;
+	}
+
+	/* close EAL memory again */
+	close(shm_fd);
+
+	return 0;
+}
+
+int
+qtest_vdev_init(struct rte_eth_dev_data *data,
+		int qtest_socket, int ivshmem_socket)
+{
+	struct virtio_hw *hw = ((struct rte_eth_dev_data *)data)->dev_private;
+	struct qtest_session *s;
+	int ret;
+
+	s = rte_zmalloc(NULL, sizeof(*s), RTE_CACHE_LINE_SIZE);
+
+	ret = pipe(s->msgfds.pipefd);
+	if (ret != 0) {
+		PMD_DRV_LOG(ERR, "Failed to initialize message pipe\n");
+		return -1;
+	}
+
+	ret = pipe(s->irqfds.pipefd);
+	if (ret != 0) {
+		PMD_DRV_LOG(ERR, "Failed to initialize irq pipe\n");
+		return -1;
+	}
+
+	ret = qtest_register_target_devices(s);
+	if (ret != 0) {
+		PMD_DRV_LOG(ERR, "Failed to initialize qtest session\n");
+		return -1;
+	}
+
+	ret = pthread_mutex_init(&s->qtest_session_lock, NULL);
+	if (ret != 0) {
+		PMD_DRV_LOG(ERR, "Failed to initialize mutex\n");
+		return -1;
+	}
+
+	rte_atomic16_set(&s->enable_intr, 0);
+	s->qtest_socket = qtest_socket;
+	s->ivshmem_socket = ivshmem_socket;
+	hw->qsession = (void *)s;
+
+	ret = pthread_create(&s->event_th, NULL, qtest_event_handler, s);
+	if (ret != 0) {
+		PMD_DRV_LOG(ERR, "Failed to create event handler\n");
+		return -1;
+	}
+
+	ret = pthread_create(&s->intr_th, NULL, qtest_intr_handler, s);
+	if (ret != 0) {
+		PMD_DRV_LOG(ERR, "Failed to create interrupt handler\n");
+		return -1;
+	}
+
+	ret = qtest_intr_initialize(data);
+	if (ret != 0) {
+		PMD_DRV_LOG(ERR, "Failed to initialize interrupt\n");
+		return -1;
+	}
+
+	ret = qtest_setup_shared_memory(s);
+	if (ret != 0) {
+		PMD_DRV_LOG(ERR, "Failed to setup shared memory\n");
+		return -1;
+	}
+
+	ret = qtest_init_pci_devices(s);
+	if (ret != 0) {
+		PMD_DRV_LOG(ERR, "Failed to initialize devices\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+qtest_remove_target_devices(struct qtest_session *s)
+{
+	struct qtest_pci_device *dev, *next;
+
+	for (dev = TAILQ_FIRST(&s->head); dev != NULL; dev = next) {
+		next = TAILQ_NEXT(dev, next);
+		TAILQ_REMOVE(&s->head, dev, next);
+		free(dev);
+	}
+}
+
+void
+qtest_vdev_uninit(struct rte_eth_dev_data *data)
+{
+	struct virtio_hw *hw = ((struct rte_eth_dev_data *)data)->dev_private;
+	struct qtest_session *s;
+
+	s = (struct qtest_session *)hw->qsession;
+
+	qtest_close_sockets(s);
+
+	pthread_cancel(s->event_th);
+	pthread_join(s->event_th, NULL);
+
+	pthread_cancel(s->intr_th);
+	pthread_join(s->intr_th, NULL);
+
+	pthread_mutex_destroy(&s->qtest_session_lock);
+
+	qtest_remove_target_devices(s);
+
+	rte_free(s);
+}
diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index c477b05..e32f1dd 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -36,6 +36,10 @@ 
 #include <stdio.h>
 #include <errno.h>
 #include <unistd.h>
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+#include <sys/socket.h>
+#include <sys/un.h>
+#endif
 
 #include <rte_ethdev.h>
 #include <rte_memcpy.h>
@@ -52,6 +56,10 @@ 
 #include <rte_memory.h>
 #include <rte_eal.h>
 #include <rte_dev.h>
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+#include <rte_eal_memconfig.h>
+#include <rte_kvargs.h>
+#endif
 
 #include "virtio_ethdev.h"
 #include "virtio_pci.h"
@@ -160,8 +168,7 @@  virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
 	if ((vq->vq_free_cnt < ((uint32_t)pkt_num + 2)) || (pkt_num < 1))
 		return -1;
 
-	memcpy(vq->virtio_net_hdr_mz->addr, ctrl,
-		sizeof(struct virtio_pmd_ctrl));
+	memcpy(vq->virtio_net_hdr_vaddr, ctrl, sizeof(struct virtio_pmd_ctrl));
 
 	/*
 	 * Format is enforced in qemu code:
@@ -170,14 +177,14 @@  virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
 	 * One RX packet for ACK.
 	 */
 	vq->vq_ring.desc[head].flags = VRING_DESC_F_NEXT;
-	vq->vq_ring.desc[head].addr = vq->virtio_net_hdr_mz->phys_addr;
+	vq->vq_ring.desc[head].addr = vq->virtio_net_hdr_mem;
 	vq->vq_ring.desc[head].len = sizeof(struct virtio_net_ctrl_hdr);
 	vq->vq_free_cnt--;
 	i = vq->vq_ring.desc[head].next;
 
 	for (k = 0; k < pkt_num; k++) {
 		vq->vq_ring.desc[i].flags = VRING_DESC_F_NEXT;
-		vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mz->phys_addr
+		vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mem
 			+ sizeof(struct virtio_net_ctrl_hdr)
 			+ sizeof(ctrl->status) + sizeof(uint8_t)*sum;
 		vq->vq_ring.desc[i].len = dlen[k];
@@ -187,7 +194,7 @@  virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
 	}
 
 	vq->vq_ring.desc[i].flags = VRING_DESC_F_WRITE;
-	vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mz->phys_addr
+	vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mem
 			+ sizeof(struct virtio_net_ctrl_hdr);
 	vq->vq_ring.desc[i].len = sizeof(ctrl->status);
 	vq->vq_free_cnt--;
@@ -232,7 +239,7 @@  virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
 	PMD_INIT_LOG(DEBUG, "vq->vq_free_cnt=%d\nvq->vq_desc_head_idx=%d",
 			vq->vq_free_cnt, vq->vq_desc_head_idx);
 
-	memcpy(&result, vq->virtio_net_hdr_mz->addr,
+	memcpy(&result, vq->virtio_net_hdr_vaddr,
 			sizeof(struct virtio_pmd_ctrl));
 
 	return result.status;
@@ -270,6 +277,9 @@  virtio_dev_queue_release(struct virtqueue *vq) {
 		hw = vq->hw;
 		hw->vtpci_ops->del_queue(hw, vq);
 
+		rte_memzone_free(vq->virtio_net_hdr_mz);
+		rte_memzone_free(vq->mz);
+
 		rte_free(vq->sw_ring);
 		rte_free(vq);
 	}
@@ -366,66 +376,81 @@  int virtio_dev_queue_setup(struct rte_eth_dev *dev,
 		}
 	}
 
-	/*
-	 * Virtio PCI device VIRTIO_PCI_QUEUE_PF register is 32bit,
-	 * and only accepts 32 bit page frame number.
-	 * Check if the allocated physical memory exceeds 16TB.
-	 */
-	if ((mz->phys_addr + vq->vq_ring_size - 1) >> (VIRTIO_PCI_QUEUE_ADDR_SHIFT + 32)) {
-		PMD_INIT_LOG(ERR, "vring address shouldn't be above 16TB!");
-		rte_free(vq);
-		return -ENOMEM;
-	}
-
 	memset(mz->addr, 0, sizeof(mz->len));
 	vq->mz = mz;
-	vq->vq_ring_mem = mz->phys_addr;
 	vq->vq_ring_virt_mem = mz->addr;
-	PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem:      0x%"PRIx64, (uint64_t)mz->phys_addr);
-	PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64, (uint64_t)(uintptr_t)mz->addr);
+
+
+	if (dev->dev_type == RTE_ETH_DEV_PCI) {
+		vq->vq_ring_mem = mz->phys_addr;
+
+		/* Virtio PCI device VIRTIO_PCI_QUEUE_PF register is 32bit,
+		 * and only accepts 32 bit page frame number.
+		 * Check if the allocated physical memory exceeds 16TB.
+		 */
+		uint64_t last_physaddr = vq->vq_ring_mem + vq->vq_ring_size - 1;
+		if (last_physaddr >> (VIRTIO_PCI_QUEUE_ADDR_SHIFT + 32)) {
+			PMD_INIT_LOG(ERR, "vring address shouldn't be above 16TB!");
+			rte_free(vq);
+			return -ENOMEM;
+		}
+	}
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+	else { /* RTE_ETH_DEV_VIRTUAL */
+		/* Use virtual addr to fill!!! */
+		vq->vq_ring_mem = (phys_addr_t)mz->addr;
+
+		/* TODO: check last_physaddr */
+	}
+#endif
+
+	PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem:      0x%"PRIx64,
+			(uint64_t)vq->vq_ring_mem);
+	PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64,
+			(uint64_t)(uintptr_t)vq->vq_ring_virt_mem);
+
 	vq->virtio_net_hdr_mz  = NULL;
 	vq->virtio_net_hdr_mem = 0;
 
+	uint64_t hdr_size = 0;
 	if (queue_type == VTNET_TQ) {
 		/*
 		 * For each xmit packet, allocate a virtio_net_hdr
 		 */
 		snprintf(vq_name, sizeof(vq_name), "port%d_tvq%d_hdrzone",
 			dev->data->port_id, queue_idx);
-		vq->virtio_net_hdr_mz = rte_memzone_reserve_aligned(vq_name,
-			vq_size * hw->vtnet_hdr_size,
-			socket_id, 0, RTE_CACHE_LINE_SIZE);
-		if (vq->virtio_net_hdr_mz == NULL) {
-			if (rte_errno == EEXIST)
-				vq->virtio_net_hdr_mz =
-					rte_memzone_lookup(vq_name);
-			if (vq->virtio_net_hdr_mz == NULL) {
-				rte_free(vq);
-				return -ENOMEM;
-			}
-		}
-		vq->virtio_net_hdr_mem =
-			vq->virtio_net_hdr_mz->phys_addr;
-		memset(vq->virtio_net_hdr_mz->addr, 0,
-			vq_size * hw->vtnet_hdr_size);
+		hdr_size = vq_size * hw->vtnet_hdr_size;
 	} else if (queue_type == VTNET_CQ) {
-		/* Allocate a page for control vq command, data and status */
 		snprintf(vq_name, sizeof(vq_name), "port%d_cvq_hdrzone",
 			dev->data->port_id);
-		vq->virtio_net_hdr_mz = rte_memzone_reserve_aligned(vq_name,
-			PAGE_SIZE, socket_id, 0, RTE_CACHE_LINE_SIZE);
-		if (vq->virtio_net_hdr_mz == NULL) {
+		/* Allocate a page for control vq command, data and status */
+		hdr_size = PAGE_SIZE;
+	}
+
+	if (hdr_size) { /* queue_type is VTNET_TQ or VTNET_CQ */
+		mz = rte_memzone_reserve_aligned(vq_name,
+				hdr_size, socket_id, 0, RTE_CACHE_LINE_SIZE);
+		if (mz == NULL) {
 			if (rte_errno == EEXIST)
-				vq->virtio_net_hdr_mz =
-					rte_memzone_lookup(vq_name);
-			if (vq->virtio_net_hdr_mz == NULL) {
+				mz = rte_memzone_lookup(vq_name);
+			if (mz == NULL) {
 				rte_free(vq);
 				return -ENOMEM;
 			}
 		}
-		vq->virtio_net_hdr_mem =
-			vq->virtio_net_hdr_mz->phys_addr;
-		memset(vq->virtio_net_hdr_mz->addr, 0, PAGE_SIZE);
+		vq->virtio_net_hdr_mz = mz;
+		vq->virtio_net_hdr_vaddr = mz->addr;
+		memset(vq->virtio_net_hdr_vaddr, 0, hdr_size);
+
+		if (dev->dev_type == RTE_ETH_DEV_PCI) {
+			vq->virtio_net_hdr_mem = mz->phys_addr;
+		}
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+		else {
+			/* Use vaddr!!! */
+			vq->virtio_net_hdr_mem = (phys_addr_t)mz->addr;
+		}
+#endif
 	}
 
 	hw->vtpci_ops->setup_queue(hw, vq);
@@ -479,12 +504,18 @@  virtio_dev_close(struct rte_eth_dev *dev)
 	PMD_INIT_LOG(DEBUG, "virtio_dev_close");
 
 	/* reset the NIC */
-	if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+	if (((dev->dev_type == RTE_ETH_DEV_PCI) &&
+			(pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)) ||
+			((dev->dev_type == RTE_ETH_DEV_VIRTUAL) &&
+			(dev->data->dev_flags & RTE_ETH_DEV_INTR_LSC))) {
 		vtpci_irq_config(hw, VIRTIO_MSI_NO_VECTOR);
+	}
 	vtpci_reset(hw);
 	hw->started = 0;
-	virtio_dev_free_mbufs(dev);
-	virtio_free_queues(dev);
+	if ((dev->data->rx_queues != NULL) && (dev->data->tx_queues != NULL)) {
+		virtio_dev_free_mbufs(dev);
+		virtio_free_queues(dev);
+	}
 }
 
 static void
@@ -983,14 +1014,30 @@  virtio_interrupt_handler(__rte_unused struct rte_intr_handle *handle,
 	isr = vtpci_isr(hw);
 	PMD_DRV_LOG(INFO, "interrupt status = %#x", isr);
 
-	if (rte_intr_enable(&dev->pci_dev->intr_handle) < 0)
-		PMD_DRV_LOG(ERR, "interrupt enable failed");
-
-	if (isr & VIRTIO_PCI_ISR_CONFIG) {
+	if (dev->dev_type == RTE_ETH_DEV_PCI) {
+		if (rte_intr_enable(&dev->pci_dev->intr_handle) < 0)
+			PMD_DRV_LOG(ERR, "interrupt enable failed");
+		if (isr & VIRTIO_PCI_ISR_CONFIG) {
+			if (virtio_dev_link_update(dev, 0) == 0)
+				_rte_eth_dev_callback_process(dev,
+						RTE_ETH_EVENT_INTR_LSC);
+		}
+	}
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+	else if (dev->dev_type == RTE_ETH_DEV_VIRTUAL) {
+		if (qtest_intr_enable(dev->data) < 0)
+			PMD_DRV_LOG(ERR, "interrupt enable failed");
+		/*
+		 * If last qtest message is interrupt, 'isr' will be 0
+		 * becasue socket has been closed already.
+		 * But still we want to notice this event to EAL.
+		 * So just ignore isr value.
+		 */
 		if (virtio_dev_link_update(dev, 0) == 0)
 			_rte_eth_dev_callback_process(dev,
-						      RTE_ETH_EVENT_INTR_LSC);
+					RTE_ETH_EVENT_INTR_LSC);
 	}
+#endif
 
 }
 
@@ -1014,7 +1061,8 @@  eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 	struct virtio_hw *hw = eth_dev->data->dev_private;
 	struct virtio_net_config *config;
 	struct virtio_net_config local_config;
-	struct rte_pci_device *pci_dev;
+	struct rte_pci_device *pci_dev = eth_dev->pci_dev;
+	struct rte_pci_id id;
 
 	RTE_BUILD_BUG_ON(RTE_PKTMBUF_HEADROOM < sizeof(struct virtio_net_hdr));
 
@@ -1052,8 +1100,14 @@  eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 		return -1;
 
 	/* If host does not support status then disable LSC */
-	if (!vtpci_with_feature(hw, VIRTIO_NET_F_STATUS))
-		pci_dev->driver->drv_flags &= ~RTE_PCI_DRV_INTR_LSC;
+	if (!vtpci_with_feature(hw, VIRTIO_NET_F_STATUS)) {
+		if (eth_dev->dev_type == RTE_ETH_DEV_PCI)
+			pci_dev->driver->drv_flags &= ~RTE_PCI_DRV_INTR_LSC;
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+		else if (eth_dev->dev_type == RTE_ETH_DEV_VIRTUAL)
+			eth_dev->data->dev_flags &= ~RTE_ETH_DEV_INTR_LSC;
+#endif
+	}
 
 	rte_eth_copy_pci_info(eth_dev, pci_dev);
 
@@ -1132,14 +1186,30 @@  eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 
 	PMD_INIT_LOG(DEBUG, "hw->max_rx_queues=%d   hw->max_tx_queues=%d",
 			hw->max_rx_queues, hw->max_tx_queues);
+
+	memset(&id, 0, sizeof(id));
+	if (eth_dev->dev_type == RTE_ETH_DEV_PCI)
+		id = pci_dev->id;
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+	else if (eth_dev->dev_type == RTE_ETH_DEV_VIRTUAL)
+		id = qtest_get_pci_id_of_virtio_net();
+#endif
+
 	PMD_INIT_LOG(DEBUG, "port %d vendorID=0x%x deviceID=0x%x",
-			eth_dev->data->port_id, pci_dev->id.vendor_id,
-			pci_dev->id.device_id);
+			eth_dev->data->port_id,
+			id.vendor_id, id.device_id);
 
 	/* Setup interrupt callback  */
-	if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+	if ((eth_dev->dev_type == RTE_ETH_DEV_PCI) &&
+			(pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC))
 		rte_intr_callback_register(&pci_dev->intr_handle,
-				   virtio_interrupt_handler, eth_dev);
+				virtio_interrupt_handler, eth_dev);
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+	else if ((eth_dev->dev_type == RTE_ETH_DEV_VIRTUAL) &&
+			(eth_dev->data->dev_flags & RTE_ETH_DEV_INTR_LSC))
+		qtest_intr_callback_register(eth_dev->data,
+				virtio_interrupt_handler, eth_dev);
+#endif
 
 	virtio_dev_cq_start(eth_dev);
 
@@ -1173,10 +1243,18 @@  eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
 	eth_dev->data->mac_addrs = NULL;
 
 	/* reset interrupt callback  */
-	if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+	if ((eth_dev->dev_type == RTE_ETH_DEV_PCI) &&
+			(pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC))
 		rte_intr_callback_unregister(&pci_dev->intr_handle,
 						virtio_interrupt_handler,
 						eth_dev);
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+	else if ((eth_dev->dev_type == RTE_ETH_DEV_VIRTUAL) &&
+			(eth_dev->data->dev_flags & RTE_ETH_DEV_INTR_LSC))
+		qtest_intr_callback_unregister(eth_dev->data,
+				virtio_interrupt_handler, eth_dev);
+#endif
+
 	vtpci_uninit(eth_dev, hw);
 
 	PMD_INIT_LOG(DEBUG, "dev_uninit completed");
@@ -1241,11 +1319,15 @@  virtio_dev_configure(struct rte_eth_dev *dev)
 		return -ENOTSUP;
 	}
 
-	if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+	if (((dev->dev_type == RTE_ETH_DEV_PCI) &&
+			(pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)) ||
+			((dev->dev_type == RTE_ETH_DEV_VIRTUAL) &&
+			(dev->data->dev_flags & RTE_ETH_DEV_INTR_LSC))) {
 		if (vtpci_irq_config(hw, 0) == VIRTIO_MSI_NO_VECTOR) {
 			PMD_DRV_LOG(ERR, "failed to set config vector");
 			return -EBUSY;
 		}
+	}
 
 	return 0;
 }
@@ -1260,15 +1342,31 @@  virtio_dev_start(struct rte_eth_dev *dev)
 
 	/* check if lsc interrupt feature is enabled */
 	if (dev->data->dev_conf.intr_conf.lsc) {
-		if (!(pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)) {
-			PMD_DRV_LOG(ERR, "link status not supported by host");
-			return -ENOTSUP;
-		}
+		if (dev->dev_type == RTE_ETH_DEV_PCI) {
+			if (!(pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)) {
+				PMD_DRV_LOG(ERR,
+					"link status not supported by host");
+				return -ENOTSUP;
+			}
 
-		if (rte_intr_enable(&dev->pci_dev->intr_handle) < 0) {
-			PMD_DRV_LOG(ERR, "interrupt enable failed");
-			return -EIO;
+			if (rte_intr_enable(&dev->pci_dev->intr_handle) < 0) {
+				PMD_DRV_LOG(ERR, "interrupt enable failed");
+				return -EIO;
+			}
 		}
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+		else if (dev->dev_type == RTE_ETH_DEV_VIRTUAL) {
+			if (!(dev->data->dev_flags & RTE_ETH_DEV_INTR_LSC)) {
+				PMD_DRV_LOG(ERR,
+					"link status not supported by host");
+				return -ENOTSUP;
+			}
+			if (qtest_intr_enable(dev->data) < 0) {
+				PMD_DRV_LOG(ERR, "interrupt enable failed");
+				return -EIO;
+			}
+		}
+#endif
 	}
 
 	/* Initialize Link state */
@@ -1365,8 +1463,15 @@  virtio_dev_stop(struct rte_eth_dev *dev)
 
 	PMD_INIT_LOG(DEBUG, "stop");
 
-	if (dev->data->dev_conf.intr_conf.lsc)
-		rte_intr_disable(&dev->pci_dev->intr_handle);
+	if (dev->data->dev_conf.intr_conf.lsc) {
+		if (dev->dev_type == RTE_ETH_DEV_PCI)
+			rte_intr_disable(&dev->pci_dev->intr_handle);
+
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+		if (dev->dev_type == RTE_ETH_DEV_VIRTUAL)
+			qtest_intr_disable(dev->data);
+#endif
+	}
 
 	memset(&link, 0, sizeof(link));
 	virtio_dev_atomic_write_link_status(dev, &link);
@@ -1411,7 +1516,13 @@  virtio_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 {
 	struct virtio_hw *hw = dev->data->dev_private;
 
-	dev_info->driver_name = dev->driver->pci_drv.name;
+	if (dev->dev_type == RTE_ETH_DEV_PCI)
+		dev_info->driver_name = dev->driver->pci_drv.name;
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+	else if (dev->dev_type == RTE_ETH_DEV_VIRTUAL)
+		dev_info->driver_name =  dev->data->drv_name;
+#endif
+
 	dev_info->max_rx_queues = (uint16_t)hw->max_rx_queues;
 	dev_info->max_tx_queues = (uint16_t)hw->max_tx_queues;
 	dev_info->min_rx_bufsize = VIRTIO_MIN_RX_BUFSIZE;
@@ -1439,3 +1550,196 @@  static struct rte_driver rte_virtio_driver = {
 };
 
 PMD_REGISTER_DRIVER(rte_virtio_driver);
+
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+
+#define ETH_VIRTIO_NET_ARG_QTEST_PATH           "qtest"
+#define ETH_VIRTIO_NET_ARG_IVSHMEM_PATH         "ivshmem"
+
+static const char *valid_args[] = {
+	ETH_VIRTIO_NET_ARG_QTEST_PATH,
+	ETH_VIRTIO_NET_ARG_IVSHMEM_PATH,
+	NULL
+};
+
+static int
+get_string_arg(const char *key __rte_unused,
+		const char *value, void *extra_args)
+{
+	int ret, fd, loop = 3;
+	int *pfd = extra_args;
+	struct sockaddr_un sa = {0};
+
+	if ((value == NULL) || (extra_args == NULL))
+		return -EINVAL;
+
+	fd = socket(AF_UNIX, SOCK_STREAM, 0);
+	if (fd < 0)
+		return -1;
+
+	sa.sun_family = AF_UNIX;
+	strncpy(sa.sun_path, value, sizeof(sa.sun_path));
+
+	while (loop--) {
+		/*
+		 * may need to wait for qtest and ivshmem
+		 * sockets are prepared by QEMU.
+		 */
+		ret = connect(fd, (struct sockaddr *)&sa,
+				sizeof(struct sockaddr_un));
+		if (ret == 0)
+			break;
+		else
+			usleep(100000);
+	}
+
+	if (ret != 0) {
+		close(fd);
+		return -1;
+	}
+
+	*pfd = fd;
+
+	return 0;
+}
+
+static struct rte_eth_dev *
+virtio_net_eth_dev_alloc(const char *name)
+{
+	struct rte_eth_dev *eth_dev;
+	struct rte_eth_dev_data *data;
+	struct virtio_hw *hw;
+
+	eth_dev = rte_eth_dev_allocate(name, RTE_ETH_DEV_VIRTUAL);
+	if (eth_dev == NULL)
+		rte_panic("cannot alloc rte_eth_dev\n");
+
+	data = eth_dev->data;
+
+	hw = rte_zmalloc(NULL, sizeof(*hw), 0);
+	if (!hw)
+		rte_panic("malloc virtio_hw failed\n");
+
+	data->dev_private = hw;
+	eth_dev->driver = &rte_virtio_pmd;
+	return eth_dev;
+}
+
+/*
+ * Initialization when "CONFIG_RTE_LIBRTE_VIRTIO_HOST_MODE" is enabled.
+ */
+static int
+rte_virtio_net_pmd_init(const char *name, const char *params)
+{
+	struct rte_kvargs *kvlist = NULL;
+	struct rte_eth_dev *eth_dev = NULL;
+	int ret, qtest_sock, ivshmem_sock;
+	struct rte_mem_config *mcfg;
+
+	if (params == NULL || params[0] == '\0')
+		goto error;
+
+	/* get pointer to global configuration */
+	mcfg = rte_eal_get_configuration()->mem_config;
+
+	/* Check if EAL memory consists of one memory segment */
+	if ((RTE_MAX_MEMSEG > 1) && (mcfg->memseg[1].addr != NULL)) {
+		PMD_INIT_LOG(ERR, "Non contigious memory");
+		goto error;
+	}
+
+	kvlist = rte_kvargs_parse(params, valid_args);
+	if (!kvlist) {
+		PMD_INIT_LOG(ERR, "error when parsing param");
+		goto error;
+	}
+
+	if (rte_kvargs_count(kvlist, ETH_VIRTIO_NET_ARG_IVSHMEM_PATH) == 1) {
+		ret = rte_kvargs_process(kvlist, ETH_VIRTIO_NET_ARG_IVSHMEM_PATH,
+				&get_string_arg, &ivshmem_sock);
+		if (ret != 0) {
+			PMD_INIT_LOG(ERR,
+				"Failed to connect to ivshmem socket");
+			goto error;
+		}
+	} else {
+		PMD_INIT_LOG(ERR, "No argument specified for %s",
+				ETH_VIRTIO_NET_ARG_IVSHMEM_PATH);
+		goto error;
+	}
+
+	if (rte_kvargs_count(kvlist, ETH_VIRTIO_NET_ARG_QTEST_PATH) == 1) {
+		ret = rte_kvargs_process(kvlist, ETH_VIRTIO_NET_ARG_QTEST_PATH,
+				&get_string_arg, &qtest_sock);
+		if (ret != 0) {
+			PMD_INIT_LOG(ERR,
+				"Failed to connect to qtest socket");
+			goto error;
+		}
+	} else {
+		PMD_INIT_LOG(ERR, "No argument specified for %s",
+				ETH_VIRTIO_NET_ARG_QTEST_PATH);
+		goto error;
+	}
+
+	eth_dev = virtio_net_eth_dev_alloc(name);
+
+	qtest_vdev_init(eth_dev->data, qtest_sock, ivshmem_sock);
+
+	/* originally, this will be called in rte_eal_pci_probe() */
+	eth_virtio_dev_init(eth_dev);
+
+	eth_dev->driver = NULL;
+	eth_dev->data->dev_flags |= RTE_ETH_DEV_DETACHABLE;
+	eth_dev->data->kdrv = RTE_KDRV_NONE;
+	eth_dev->data->drv_name = "rte_virtio_pmd";
+
+	rte_kvargs_free(kvlist);
+	return 0;
+
+error:
+	rte_kvargs_free(kvlist);
+	return -EFAULT;
+}
+
+/*
+ * Finalization when "CONFIG_RTE_LIBRTE_VIRTIO_HOST_MODE" is enabled.
+ */
+static int
+rte_virtio_net_pmd_uninit(const char *name)
+{
+	struct rte_eth_dev *eth_dev = NULL;
+	int ret;
+
+	if (name == NULL)
+		return -EINVAL;
+
+	/* find the ethdev entry */
+	eth_dev = rte_eth_dev_allocated(name);
+	if (eth_dev == NULL)
+		return -ENODEV;
+
+	ret = eth_virtio_dev_uninit(eth_dev);
+	if (ret != 0)
+		return -EFAULT;
+
+	qtest_vdev_uninit(eth_dev->data);
+	rte_free(eth_dev->data->dev_private);
+
+	ret = rte_eth_dev_release_port(eth_dev);
+	if (ret != 0)
+		return -EFAULT;
+
+	return 0;
+}
+
+static struct rte_driver rte_virtio_net_driver = {
+	.name   = "eth_virtio_net",
+	.type   = PMD_VDEV,
+	.init   = rte_virtio_net_pmd_init,
+	.uninit = rte_virtio_net_pmd_uninit,
+};
+
+PMD_REGISTER_DRIVER(rte_virtio_net_driver);
+
+#endif /* RTE_LIBRTE_VIRTIO_HOST_MODE */
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index fed9571..81e6465 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -123,5 +123,17 @@  uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
 #define VTNET_LRO_FEATURES (VIRTIO_NET_F_GUEST_TSO4 | \
 			    VIRTIO_NET_F_GUEST_TSO6 | VIRTIO_NET_F_GUEST_ECN)
 
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+int qtest_vdev_init(struct rte_eth_dev_data *data,
+		int qtest_socket, int ivshmem_socket);
+void qtest_vdev_uninit(struct rte_eth_dev_data *data);
+void qtest_intr_callback_register(void *data,
+		rte_intr_callback_fn cb, void *cb_arg);
+void qtest_intr_callback_unregister(void *data,
+		rte_intr_callback_fn cb, void *cb_arg);
+int qtest_intr_enable(void *data);
+int qtest_intr_disable(void *data);
+struct rte_pci_id qtest_get_pci_id_of_virtio_net(void);
+#endif /* RTE_LIBRTE_VIRTIO_HOST_MODE */
 
 #endif /* _VIRTIO_ETHDEV_H_ */
diff --git a/drivers/net/virtio/virtio_pci.c b/drivers/net/virtio/virtio_pci.c
index 98eef85..2121234 100644
--- a/drivers/net/virtio/virtio_pci.c
+++ b/drivers/net/virtio/virtio_pci.c
@@ -145,6 +145,98 @@  static const struct virtio_pci_dev_ops phys_modern_dev_ops = {
 	.write32	= phys_modern_write32,
 };
 
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+static uint8_t
+virt_legacy_read8(struct virtio_hw *hw, uint8_t *addr)
+{
+	return qtest_in(hw, (uint16_t)(hw->io_base + (uint64_t)addr), 'b');
+}
+
+static uint16_t
+virt_legacy_read16(struct virtio_hw *hw, uint16_t *addr)
+{
+	return qtest_in(hw, (uint16_t)(hw->io_base + (uint64_t)addr), 'w');
+}
+
+static uint32_t
+virt_legacy_read32(struct virtio_hw *hw, uint32_t *addr)
+{
+	return qtest_in(hw, (uint16_t)(hw->io_base + (uint64_t)addr), 'l');
+}
+
+static void
+virt_legacy_write8(struct virtio_hw *hw, uint8_t *addr, uint8_t val)
+{
+	qtest_out(hw, (uint16_t)(hw->io_base + (uint64_t)addr), val, 'b');
+}
+
+static void
+virt_legacy_write16(struct virtio_hw *hw, uint16_t *addr, uint16_t val)
+{
+	qtest_out(hw, (uint16_t)(hw->io_base + (uint64_t)addr), val, 'w');
+}
+
+static void
+virt_legacy_write32(struct virtio_hw *hw, uint32_t *addr, uint32_t val)
+{
+	qtest_out(hw, (uint16_t)(hw->io_base + (uint64_t)addr), val, 'l');
+}
+
+static const struct virtio_pci_dev_ops virt_legacy_dev_ops = {
+	.read8		= virt_legacy_read8,
+	.read16		= virt_legacy_read16,
+	.read32		= virt_legacy_read32,
+	.write8		= virt_legacy_write8,
+	.write16	= virt_legacy_write16,
+	.write32	= virt_legacy_write32,
+};
+
+static uint8_t
+virt_modern_read8(struct virtio_hw *hw, uint8_t *addr)
+{
+	return qtest_read(hw, (uint64_t)addr, 'b');
+}
+
+static uint16_t
+virt_modern_read16(struct virtio_hw *hw, uint16_t *addr)
+{
+	return qtest_read(hw, (uint64_t)addr, 'w');
+}
+
+static uint32_t
+virt_modern_read32(struct virtio_hw *hw, uint32_t *addr)
+{
+	return qtest_read(hw, (uint64_t)addr, 'l');
+}
+
+static void
+virt_modern_write8(struct virtio_hw *hw, uint8_t *addr, uint8_t val)
+{
+	qtest_write(hw, (uint64_t)addr, val, 'b');
+}
+
+static void
+virt_modern_write16(struct virtio_hw *hw, uint16_t *addr, uint16_t val)
+{
+	qtest_write(hw, (uint64_t)addr, val, 'w');
+}
+
+static void
+virt_modern_write32(struct virtio_hw *hw, uint32_t *addr, uint32_t val)
+{
+	qtest_write(hw, (uint64_t)addr, val, 'l');
+}
+
+static const struct virtio_pci_dev_ops virt_modern_dev_ops = {
+	.read8		= virt_modern_read8,
+	.read16		= virt_modern_read16,
+	.read32		= virt_modern_read32,
+	.write8		= virt_modern_write8,
+	.write16	= virt_modern_write16,
+	.write32	= virt_modern_write32,
+};
+#endif /* RTE_LIBRTE_VIRTIO_HOST_MODE */
+
 static int
 vtpci_dev_init(struct rte_eth_dev *dev, struct virtio_hw *hw)
 {
@@ -154,6 +246,17 @@  vtpci_dev_init(struct rte_eth_dev *dev, struct virtio_hw *hw)
 		else
 			hw->vtpci_dev_ops = &phys_legacy_dev_ops;
 		return 0;
+	} else if (dev->dev_type == RTE_ETH_DEV_VIRTUAL) {
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+		if (strncmp(dev->data->name, "eth_virtio_net",
+				strlen("eth_virtio_net")) == 0) {
+			if (hw->modern == 1)
+				hw->vtpci_dev_ops = &virt_modern_dev_ops;
+			else
+				hw->vtpci_dev_ops = &virt_legacy_dev_ops;
+			return 0;
+		}
+#endif
 	}
 
 	PMD_DRV_LOG(ERR, "Unkown virtio-net device.");
@@ -224,12 +327,81 @@  static const struct virtio_pci_cfg_ops phys_cfg_ops = {
 	.read			= phys_read_pci_cfg,
 };
 
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+static int
+virt_map_pci_cfg(struct virtio_hw *hw __rte_unused)
+{
+	return 0;
+}
+
+static void
+virt_unmap_pci_cfg(struct virtio_hw *hw __rte_unused)
+{
+	return;
+}
+
+static int
+virt_read_pci_cfg(struct virtio_hw *hw, void *buf, size_t len, off_t offset)
+{
+	qtest_read_pci_cfg(hw, "virtio-net", buf, len, offset);
+	return 0;
+}
+
+static void *
+virt_get_mapped_addr(struct virtio_hw *hw, uint8_t bar,
+		     uint32_t offset, uint32_t length)
+{
+	uint64_t base;
+	uint64_t size;
+
+	if (qtest_get_bar_size(hw, "virtio-net", bar, &size) < 0) {
+		PMD_INIT_LOG(ERR, "invalid bar: %u", bar);
+		return NULL;
+	}
+
+	if (offset + length < offset) {
+		PMD_INIT_LOG(ERR, "offset(%u) + lenght(%u) overflows",
+			offset, length);
+		return NULL;
+	}
+
+	if (offset + length > size) {
+		PMD_INIT_LOG(ERR,
+			"invalid cap: overflows bar space: %u > %"PRIu64,
+			offset + length, size);
+		return NULL;
+	}
+
+	if (qtest_get_bar_addr(hw, "virtio-net", bar, &base) < 0) {
+		PMD_INIT_LOG(ERR, "invalid bar: %u", bar);
+		return NULL;
+	}
+
+	return (void *)(base + offset);
+}
+
+static const struct virtio_pci_cfg_ops virt_cfg_ops = {
+	.map			= virt_map_pci_cfg,
+	.unmap			= virt_unmap_pci_cfg,
+	.get_mapped_addr	= virt_get_mapped_addr,
+	.read			= virt_read_pci_cfg,
+};
+#endif /* RTE_LIBRTE_VIRTIO_HOST_MODE */
+
 static int
 vtpci_cfg_init(struct rte_eth_dev *dev, struct virtio_hw *hw)
 {
 	if (dev->dev_type == RTE_ETH_DEV_PCI) {
 		hw->vtpci_cfg_ops = &phys_cfg_ops;
 		return 0;
+	} else if (dev->dev_type == RTE_ETH_DEV_VIRTUAL) {
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+		if (strncmp(dev->data->name, "eth_virtio_net",
+				strlen("eth_virtio_net")) == 0) {
+			hw->vtpci_cfg_ops = &virt_cfg_ops;
+			return 0;
+		}
+#endif
 	}
 
 	PMD_DRV_LOG(ERR, "Unkown virtio-net device.");
@@ -785,7 +957,7 @@  modern_setup_queue(struct virtio_hw *hw, struct virtqueue *vq)
 	uint64_t desc_addr, avail_addr, used_addr;
 	uint16_t notify_off;
 
-	desc_addr = vq->mz->phys_addr;
+	desc_addr = vq->vq_ring_mem;
 	avail_addr = desc_addr + vq->vq_nentries * sizeof(struct vring_desc);
 	used_addr = RTE_ALIGN_CEIL(avail_addr + offsetof(struct vring_avail,
 							 ring[vq->vq_nentries]),
@@ -1019,6 +1191,14 @@  vtpci_modern_init(struct rte_eth_dev *dev, struct virtio_hw *hw)
 
 	if (dev->dev_type == RTE_ETH_DEV_PCI)
 		pci_dev->driver->drv_flags |= RTE_PCI_DRV_INTR_LSC;
+	else if (dev->dev_type == RTE_ETH_DEV_VIRTUAL) {
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+		if (strncmp(dev->data->name, "eth_virtio_net",
+				strlen("eth_virtio_net")) == 0) {
+			dev->data->dev_flags |= RTE_ETH_DEV_INTR_LSC;
+		}
+#endif
+	}
 
 	hw->vtpci_ops = &modern_ops;
 	hw->modern = 1;
@@ -1037,6 +1217,14 @@  vtpci_legacy_init(struct rte_eth_dev *dev, struct virtio_hw *hw)
 			return -1;
 
 		hw->use_msix = legacy_virtio_has_msix(&pci_dev->addr);
+	} else if (dev->dev_type == RTE_ETH_DEV_VIRTUAL) {
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+		if (strncmp(dev->data->name, "eth_virtio_net",
+					strlen("eth_virtio_net")) == 0) {
+			hw->use_msix = 0;
+			dev->data->dev_flags |= RTE_ETH_DEV_INTR_LSC;
+		}
+#endif
 	}
 
 	hw->io_base = (uint32_t)(uintptr_t)
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 7b5ad54..cdc23b5 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -267,6 +267,9 @@  struct virtio_net_config;
 
 struct virtio_hw {
 	struct virtqueue *cvq;
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+	void        *qsession;
+#endif
 	uint32_t    io_base;
 	uint64_t    guest_features;
 	uint32_t    max_tx_queues;
@@ -366,4 +369,17 @@  uint8_t vtpci_isr(struct virtio_hw *);
 
 uint16_t vtpci_irq_config(struct virtio_hw *, uint16_t);
 
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+uint32_t qtest_in(struct virtio_hw *, uint16_t, char type);
+void qtest_out(struct virtio_hw *, uint16_t, uint64_t, char type);
+uint32_t qtest_read(struct virtio_hw *, uint64_t, char type);
+void qtest_write(struct virtio_hw *, uint64_t, uint64_t, char type);
+int qtest_read_pci_cfg(struct virtio_hw *hw, const char *name,
+		void *buf, size_t len, off_t offset);
+int qtest_get_bar_addr(struct virtio_hw *hw, const char *name,
+		uint8_t bar, uint64_t *addr);
+int qtest_get_bar_size(struct virtio_hw *hw, const char *name,
+		uint8_t bar, uint64_t *size);
+#endif
+
 #endif /* _VIRTIO_PCI_H_ */
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 41a1366..f842c79 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -191,8 +191,7 @@  virtqueue_enqueue_recv_refill(struct virtqueue *vq, struct rte_mbuf *cookie)
 
 	start_dp = vq->vq_ring.desc;
 	start_dp[idx].addr =
-		(uint64_t)(cookie->buf_physaddr + RTE_PKTMBUF_HEADROOM
-		- hw->vtnet_hdr_size);
+		RTE_MBUF_DATA_DMA_ADDR(cookie) - hw->vtnet_hdr_size;
 	start_dp[idx].len =
 		cookie->buf_len - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
 	start_dp[idx].flags =  VRING_DESC_F_WRITE;
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 99d4fa9..b772e04 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -66,8 +66,13 @@  struct rte_mbuf;
 
 #define VIRTQUEUE_MAX_NAME_SZ 32
 
+#ifdef RTE_LIBRTE_VIRTIO_HOST_MODE
+#define RTE_MBUF_DATA_DMA_ADDR(mb) \
+	((uint64_t)(mb)->buf_addr + (mb)->data_off)
+#else
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
 	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
+#endif
 
 #define VTNET_SQ_RQ_QUEUE_IDX 0
 #define VTNET_SQ_TQ_QUEUE_IDX 1
@@ -167,7 +172,8 @@  struct virtqueue {
 
 	void        *vq_ring_virt_mem;    /**< linear address of vring*/
 	unsigned int vq_ring_size;
-	phys_addr_t vq_ring_mem;          /**< physical address of vring */
+	phys_addr_t vq_ring_mem;          /**< physical address of vring for non-vdev,
+						virtual address of vring for vdev */
 
 	struct vring vq_ring;    /**< vring keeping desc, used and avail */
 	uint16_t    vq_free_cnt; /**< num of desc available */
@@ -188,6 +194,7 @@  struct virtqueue {
 	uint16_t vq_avail_idx;
 	uint64_t mbuf_initializer; /**< value to init mbufs. */
 	phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */
+	void	*virtio_net_hdr_vaddr;	/**< linear address of vring */
 
 	struct rte_mbuf **sw_ring; /**< RX software ring. */
 	/* dummy mbuf, for wraparound when processing RX ring. */