[dpdk-dev,RFC,1/2] EAL: Add new EAL "--shm" option.

Message ID	1447930650-26023-2-git-send-email-mukawa@igel.co.jp (mailing list archive)
State	RFC, archived
Headers	From: Tetsuya Mukawa <mukawa@igel.co.jp> To: dev@dpdk.org Date: Thu, 19 Nov 2015 19:57:29 +0900 Message-Id: <1447930650-26023-2-git-send-email-mukawa@igel.co.jp> In-Reply-To: <1447930650-26023-1-git-send-email-mukawa@igel.co.jp> References: <1447930650-26023-1-git-send-email-mukawa@igel.co.jp> Cc: nakajima.yoshihiro@lab.ntt.co.jp, zhbzg@huawei.com, mst@redhat.com, gaoxiaoqiu@huawei.com, oscar.zhangbo@huawei.com, ann.zhuangyanying@huawei.com, zhoujingbin@huawei.com, guohongzhen@huawei.com Subject: [dpdk-dev] [RFC PATCH 1/2] EAL: Add new EAL "--shm" option. Precedence: list Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>

Message ID

1447930650-26023-2-git-send-email-mukawa@igel.co.jp (mailing list archive)

State

RFC, archived

Headers

From: Tetsuya Mukawa <mukawa@igel.co.jp>
To: dev@dpdk.org
Date: Thu, 19 Nov 2015 19:57:29 +0900
Message-Id: <1447930650-26023-2-git-send-email-mukawa@igel.co.jp>
In-Reply-To: <1447930650-26023-1-git-send-email-mukawa@igel.co.jp>
References: <1447930650-26023-1-git-send-email-mukawa@igel.co.jp>
Cc: nakajima.yoshihiro@lab.ntt.co.jp, zhbzg@huawei.com, mst@redhat.com,
	gaoxiaoqiu@huawei.com, oscar.zhangbo@huawei.com,
	ann.zhuangyanying@huawei.com, 
	zhoujingbin@huawei.com, guohongzhen@huawei.com
Subject: [dpdk-dev] [RFC PATCH 1/2] EAL: Add new EAL "--shm" option.
Precedence: list
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Commit Message

Tetsuya Mukawa Nov. 19, 2015, 10:57 a.m. UTC

  The patch adds new EAL "--shm" option. If the option is specified,
EAL will allocate one file from hugetlbfs. This memory is for sharing
memory between DPDK applicaiton and QEMU ivhsmem device.

Signed-off-by: Tetsuya Mukawa <mukawa@igel.co.jp>
---
 lib/librte_eal/common/eal_common_options.c |  5 +++
 lib/librte_eal/common/eal_internal_cfg.h   |  1 +
 lib/librte_eal/common/eal_options.h        |  2 +
 lib/librte_eal/common/include/rte_memory.h |  5 +++
 lib/librte_eal/linuxapp/eal/eal_memory.c   | 71 ++++++++++++++++++++++++++++++
 5 files changed, 84 insertions(+)

Comments

Tetsuya Mukawa Dec. 16, 2015, 8:37 a.m. UTC | #1

[Change log]

PATCH v1:
(Just listing functionality changes and important bug fix)
* Support virtio-net interrupt handling.
(It means virtio-net PMD on host and guest have same virtio-net features)
* Fix memory allocation method to allocate contiguous memory correctly.
* Port Hotplug is supported.
* Rebase on DPDK-2.2.

[Abstraction]

Normally, virtio-net PMD only works on VM, because there is no virtio-net device on host.
This RFC patch extends virtio-net PMD to be able to work on host as virtual PMD.
But we didn't implement virtio-net device as a part of virtio-net PMD.
To prepare virtio-net device for the PMD, start QEMU process with special QTest mode, then connect it from virtio-net PMD through unix domain socket.

The virtio-net PMD on host is fully compatible with the PMD on guest.
We can use same functionalities, and connect to anywhere QEMU virtio-net device can.
For example, the PMD can use virtio-net multi queues function. Also it can connects to vhost-net kernel module and vhost-user backend application.
Similar to virtio-net PMD on QEMU, application memory that uses virtio-net PMD will be shared between vhost backend application. But vhost backend application memory will not be shared.

Main target of this PMD is container like docker, rkt, lxc and etc.
We can isolate related processes(virtio-net PMD process, QEMU and vhost-user backend process) by container.
But, to communicate through unix domain socket, shared directory will be needed.

[How to use]

So far, we need QEMU patch to connect to vhost-user backend.
See below patch.
- http://patchwork.ozlabs.org/patch/552549/
To know how to use, check commit log.

[Detailed Description]

- virtio-net device implementation
This host mode PMD uses QEMU virtio-net device. To do that, QEMU QTest functionality is used.
QTest is a test framework of QEMU devices. It allows us to implement a device driver outside of QEMU.
With QTest, we can implement DPDK application and virtio-net PMD as standalone process on host.
When QEMU is invoked as QTest mode, any guest code will not run.
To know more about QTest, see below.
- http://wiki.qemu.org/Features/QTest

- probing devices
QTest provides a unix domain socket. Through this socket, driver process can access to I/O port and memory of QEMU virtual machine.
The PMD will send I/O port accesses to probe pci devices.
If we can find virtio-net and ivshmem device, initialize the devices.
Also, I/O port accesses of virtio-net PMD will be sent through socket, and virtio-net PMD can initialize vitio-net device on QEMU correctly.

- ivshmem device to share memory
To share memory that virtio-net PMD process uses, ivshmem device will be used.
Because ivshmem device can only handle one file descriptor, shared memory should be consist of one file.
To allocate such a memory, EAL has new option called "--contig-mem".
If the option is specified, EAL will open a file and allocate memory from hugepages.
While initializing ivshmem device, we can set BAR(Base Address Register).
It represents which memory QEMU vcpu can access to this shared memory.
We will specify host physical address of shared memory as this address.
It is very useful because we don't need to apply patch to QEMU to calculate address offset.
(For example, if virtio-net PMD process will allocate memory from shared memory, then specify the physical address of it to virtio-net register, QEMU virtio-net device can understand it without calculating address offset.)

[Known issues]

- vhost-user
So far, to use vhost-user, we need to apply a patch to QEMU.
This is because, QEMU will not send memory information and file descriptor of ivshmem device to vhost-user backend.
I have submitted the patch to QEMU.
See "http://patchwork.ozlabs.org/patch/552549/".
Also, we may have an issue in DPDK vhost library to handle kickfd and callfd.
The patch for this issue is needed. I have a workaround patch, but let me check it more.
If someone wants to check vhost-user behavior, I will describe it more in later email.

Tetsuya Mukawa (2):
EAL: Add new EAL "--contig-mem" option
virtio: Extend virtio-net PMD to support container environment

config/common_linuxapp | 1 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/qtest.c | 1107 ++++++++++++++++++++++++++++
drivers/net/virtio/virtio_ethdev.c | 341 ++++++++-
drivers/net/virtio/virtio_ethdev.h | 12 +
drivers/net/virtio/virtio_pci.h | 25 +
lib/librte_eal/common/eal_common_options.c | 7 +
lib/librte_eal/common/eal_internal_cfg.h | 1 +
lib/librte_eal/common/eal_options.h | 2 +
lib/librte_eal/linuxapp/eal/eal_memory.c | 77 +-
10 files changed, 1543 insertions(+), 34 deletions(-)
create mode 100644 drivers/net/virtio/qtest.c

Jianfeng Tan Dec. 24, 2015, 2:05 p.m. UTC | #2

Hi Tetsuya,

After several days' studying your patch, I have some questions as follows:

1. Is physically-contig memory really necessary?
This is a too strong requirement IMHO. IVSHMEM doesn't require this in its original meaning. So how do you think of
Huawei Xie's idea of using virtual address for address translation? (In addition, virtual address of mem_table could be
different in application and QTest, but this can be addressed because SET_MEM_TABLE msg will be intercepted by
QTest)

2. Is root privilege OK in container's case?
Another reason we'd like to give up physically-contig feature is that it needs root privilege to read /proc/self/pagemap
file. Container has already been widely criticized for bad security isolation. Enabling root privilege will make it worse.
On the other hand, it's not easy to remove root privilege too. If we use vhost-net as the backend, kernel will definitely
require root privilege to create a tap device/raw socket. We tend to pick such work, which requires root, into runtime
preparation of a container. Do you agree?

3.Is one Qtest process per virtio device too heavy?
Although we can foresee that each container always owns only one virtio device, but take its possible high density
into consideration, hundreds or even thousands of container requires the same number of QTest processes. As
you mentioned that port hotplug is supported, is it possible to use just one QTest process for all virtio devices
emulation?

As you know, we have another solution according to this (which under heavy internal review). But I think we have lots
of common problems to be solved, right?

Thanks for your great work!

Thanks,
Jianfeng

> -----Original Message-----
> From: Tetsuya Mukawa [mailto:mukawa@igel.co.jp]
> Sent: Wednesday, December 16, 2015 4:37 PM
> To: dev@dpdk.org
> Cc: nakajima.yoshihiro@lab.ntt.co.jp; Tan, Jianfeng; Xie, Huawei;
> mst@redhat.com; marcandre.lureau@gmail.com; Tetsuya Mukawa
> Subject: [PATCH v1 0/2] Virtio-net PMD Extension to work on host
> 
> [Change log]
> 
> PATCH v1:
> (Just listing functionality changes and important bug fix)
> * Support virtio-net interrupt handling.
>   (It means virtio-net PMD on host and guest have same virtio-net features)
> * Fix memory allocation method to allocate contiguous memory correctly.
> * Port Hotplug is supported.
> * Rebase on DPDK-2.2.
> 
> 
> [Abstraction]
> 
> Normally, virtio-net PMD only works on VM, because there is no virtio-net
> device on host.
> This RFC patch extends virtio-net PMD to be able to work on host as virtual
> PMD.
> But we didn't implement virtio-net device as a part of virtio-net PMD.
> To prepare virtio-net device for the PMD, start QEMU process with special
> QTest mode, then connect it from virtio-net PMD through unix domain
> socket.
> 
> The virtio-net PMD on host is fully compatible with the PMD on guest.
> We can use same functionalities, and connect to anywhere QEMU virtio-net
> device can.
> For example, the PMD can use virtio-net multi queues function. Also it can
> connects to vhost-net kernel module and vhost-user backend application.
> Similar to virtio-net PMD on QEMU, application memory that uses virtio-net
> PMD will be shared between vhost backend application. But vhost backend
> application memory will not be shared.
> 
> Main target of this PMD is container like docker, rkt, lxc and etc.
> We can isolate related processes(virtio-net PMD process, QEMU and vhost-
> user backend process) by container.
> But, to communicate through unix domain socket, shared directory will be
> needed.
> 
> 
> [How to use]
> 
> So far, we need QEMU patch to connect to vhost-user backend.
> See below patch.
>  - http://patchwork.ozlabs.org/patch/552549/
> To know how to use, check commit log.
> 
> 
> [Detailed Description]
> 
>  - virtio-net device implementation
> This host mode PMD uses QEMU virtio-net device. To do that, QEMU QTest
> functionality is used.
> QTest is a test framework of QEMU devices. It allows us to implement a
> device driver outside of QEMU.
> With QTest, we can implement DPDK application and virtio-net PMD as
> standalone process on host.
> When QEMU is invoked as QTest mode, any guest code will not run.
> To know more about QTest, see below.
>  - http://wiki.qemu.org/Features/QTest
> 
>  - probing devices
> QTest provides a unix domain socket. Through this socket, driver process can
> access to I/O port and memory of QEMU virtual machine.
> The PMD will send I/O port accesses to probe pci devices.
> If we can find virtio-net and ivshmem device, initialize the devices.
> Also, I/O port accesses of virtio-net PMD will be sent through socket, and
> virtio-net PMD can initialize vitio-net device on QEMU correctly.
> 
>  - ivshmem device to share memory
> To share memory that virtio-net PMD process uses, ivshmem device will be
> used.
> Because ivshmem device can only handle one file descriptor, shared memory
> should be consist of one file.
> To allocate such a memory, EAL has new option called "--contig-mem".
> If the option is specified, EAL will open a file and allocate memory from
> hugepages.
> While initializing ivshmem device, we can set BAR(Base Address Register).
> It represents which memory QEMU vcpu can access to this shared memory.
> We will specify host physical address of shared memory as this address.
> It is very useful because we don't need to apply patch to QEMU to calculate
> address offset.
> (For example, if virtio-net PMD process will allocate memory from shared
> memory, then specify the physical address of it to virtio-net register, QEMU
> virtio-net device can understand it without calculating address offset.)
> 
> 
> [Known issues]
> 
>  - vhost-user
> So far, to use vhost-user, we need to apply a patch to QEMU.
> This is because, QEMU will not send memory information and file descriptor
> of ivshmem device to vhost-user backend.
> I have submitted the patch to QEMU.
> See "http://patchwork.ozlabs.org/patch/552549/".
> Also, we may have an issue in DPDK vhost library to handle kickfd and callfd.
> The patch for this issue is needed. I have a workaround patch, but let me
> check it more.
> If someone wants to check vhost-user behavior, I will describe it more in
> later email.
> 
> 
> 
> 
> Tetsuya Mukawa (2):
>   EAL: Add new EAL "--contig-mem" option
>   virtio: Extend virtio-net PMD to support container environment
> 
>  config/common_linuxapp                     |    1 +
>  drivers/net/virtio/Makefile                |    4 +
>  drivers/net/virtio/qtest.c                 | 1107
> ++++++++++++++++++++++++++++
>  drivers/net/virtio/virtio_ethdev.c         |  341 ++++++++-
>  drivers/net/virtio/virtio_ethdev.h         |   12 +
>  drivers/net/virtio/virtio_pci.h            |   25 +
>  lib/librte_eal/common/eal_common_options.c |    7 +
>  lib/librte_eal/common/eal_internal_cfg.h   |    1 +
>  lib/librte_eal/common/eal_options.h        |    2 +
>  lib/librte_eal/linuxapp/eal/eal_memory.c   |   77 +-
>  10 files changed, 1543 insertions(+), 34 deletions(-)
>  create mode 100644 drivers/net/virtio/qtest.c
> 
> --
> 2.1.4

Tetsuya Mukawa Dec. 28, 2015, 11:06 a.m. UTC | #3

On 2015/12/24 23:05, Tan, Jianfeng wrote:
> Hi Tetsuya,
>
> After several days' studying your patch, I have some questions as follows:
>
> 1. Is physically-contig memory really necessary?
> This is a too strong requirement IMHO. IVSHMEM doesn't require this in its original meaning. So how do you think of
> Huawei Xie's idea of using virtual address for address translation? (In addition, virtual address of mem_table could be
> different in application and QTest, but this can be addressed because SET_MEM_TABLE msg will be intercepted by
> QTest)

Hi Jianfeng,

Thanks for your suggestion.
Huawei's idea may solve contig-mem restriction.
Let me have time to check it more.

> 2. Is root privilege OK in container's case?
> Another reason we'd like to give up physically-contig feature is that it needs root privilege to read /proc/self/pagemap
> file. Container has already been widely criticized for bad security isolation. Enabling root privilege will make it worse.

I haven't checked how to invoke DPDK application in non-privileged
container.
But if we can invoke it, it's great.

I guess if we allocate memory like you did, probably we will not read
"/proc/self/pagemap".
Then we will be able to invoke DPDK application in non-privileged container.
Is this correct?

> On the other hand, it's not easy to remove root privilege too. If we use vhost-net as the backend, kernel will definitely
> require root privilege to create a tap device/raw socket. We tend to pick such work, which requires root, into runtime
> preparation of a container. Do you agree?

Yes, I agree. It's not easy to remove root privilege in some cases.
I guess if we can remove it in vhost-user case, it will be enough for
DPDK users.
What do you think?

> 3.Is one Qtest process per virtio device too heavy?
> Although we can foresee that each container always owns only one virtio device, but take its possible high density
> into consideration, hundreds or even thousands of container requires the same number of QTest processes. As
> you mentioned that port hotplug is supported, is it possible to use just one QTest process for all virtio devices
> emulation?

Yes, we can use pci hotplug for that purpose.
But it may depends on security policy.
The shared QEMU process knows all file descriptors of DPDK application
memories.
Because of this, I guess some users don't want to share QEMU process.

If the vhost-user is used, QEMU process doesn't use CPU resource.
So, I am not sure sleeping QEMU process will be overhead.

BTW, If we use pci hotplug, we need to use (virtual) pci bridge to
cascade pci devices.
So implementation will be more complex.
Honestly, I am not sure I will be able to finish it by next DPDK release.
How about starting from this implementation?
If we really need this feature, then add it.

> As you know, we have another solution according to this (which under heavy internal review). But I think we have lots
> of common problems to be solved, right?

Yes, I think so. And thanks for good suggestion.

Tetsuya,

> Thanks for your great work!
>
> Thanks,
> Jianfeng
>
>> -----Original Message-----
>> From: Tetsuya Mukawa [mailto:mukawa@igel.co.jp]
>> Sent: Wednesday, December 16, 2015 4:37 PM
>> To: dev@dpdk.org
>> Cc: nakajima.yoshihiro@lab.ntt.co.jp; Tan, Jianfeng; Xie, Huawei;
>> mst@redhat.com; marcandre.lureau@gmail.com; Tetsuya Mukawa
>> Subject: [PATCH v1 0/2] Virtio-net PMD Extension to work on host
>>
>> [Change log]
>>
>> PATCH v1:
>> (Just listing functionality changes and important bug fix)
>> * Support virtio-net interrupt handling.
>>   (It means virtio-net PMD on host and guest have same virtio-net features)
>> * Fix memory allocation method to allocate contiguous memory correctly.
>> * Port Hotplug is supported.
>> * Rebase on DPDK-2.2.
>>
>>
>> [Abstraction]
>>
>> Normally, virtio-net PMD only works on VM, because there is no virtio-net
>> device on host.
>> This RFC patch extends virtio-net PMD to be able to work on host as virtual
>> PMD.
>> But we didn't implement virtio-net device as a part of virtio-net PMD.
>> To prepare virtio-net device for the PMD, start QEMU process with special
>> QTest mode, then connect it from virtio-net PMD through unix domain
>> socket.
>>
>> The virtio-net PMD on host is fully compatible with the PMD on guest.
>> We can use same functionalities, and connect to anywhere QEMU virtio-net
>> device can.
>> For example, the PMD can use virtio-net multi queues function. Also it can
>> connects to vhost-net kernel module and vhost-user backend application.
>> Similar to virtio-net PMD on QEMU, application memory that uses virtio-net
>> PMD will be shared between vhost backend application. But vhost backend
>> application memory will not be shared.
>>
>> Main target of this PMD is container like docker, rkt, lxc and etc.
>> We can isolate related processes(virtio-net PMD process, QEMU and vhost-
>> user backend process) by container.
>> But, to communicate through unix domain socket, shared directory will be
>> needed.
>>
>>
>> [How to use]
>>
>> So far, we need QEMU patch to connect to vhost-user backend.
>> See below patch.
>>  - http://patchwork.ozlabs.org/patch/552549/
>> To know how to use, check commit log.
>>
>>
>> [Detailed Description]
>>
>>  - virtio-net device implementation
>> This host mode PMD uses QEMU virtio-net device. To do that, QEMU QTest
>> functionality is used.
>> QTest is a test framework of QEMU devices. It allows us to implement a
>> device driver outside of QEMU.
>> With QTest, we can implement DPDK application and virtio-net PMD as
>> standalone process on host.
>> When QEMU is invoked as QTest mode, any guest code will not run.
>> To know more about QTest, see below.
>>  - http://wiki.qemu.org/Features/QTest
>>
>>  - probing devices
>> QTest provides a unix domain socket. Through this socket, driver process can
>> access to I/O port and memory of QEMU virtual machine.
>> The PMD will send I/O port accesses to probe pci devices.
>> If we can find virtio-net and ivshmem device, initialize the devices.
>> Also, I/O port accesses of virtio-net PMD will be sent through socket, and
>> virtio-net PMD can initialize vitio-net device on QEMU correctly.
>>
>>  - ivshmem device to share memory
>> To share memory that virtio-net PMD process uses, ivshmem device will be
>> used.
>> Because ivshmem device can only handle one file descriptor, shared memory
>> should be consist of one file.
>> To allocate such a memory, EAL has new option called "--contig-mem".
>> If the option is specified, EAL will open a file and allocate memory from
>> hugepages.
>> While initializing ivshmem device, we can set BAR(Base Address Register).
>> It represents which memory QEMU vcpu can access to this shared memory.
>> We will specify host physical address of shared memory as this address.
>> It is very useful because we don't need to apply patch to QEMU to calculate
>> address offset.
>> (For example, if virtio-net PMD process will allocate memory from shared
>> memory, then specify the physical address of it to virtio-net register, QEMU
>> virtio-net device can understand it without calculating address offset.)
>>
>>
>> [Known issues]
>>
>>  - vhost-user
>> So far, to use vhost-user, we need to apply a patch to QEMU.
>> This is because, QEMU will not send memory information and file descriptor
>> of ivshmem device to vhost-user backend.
>> I have submitted the patch to QEMU.
>> See "http://patchwork.ozlabs.org/patch/552549/".
>> Also, we may have an issue in DPDK vhost library to handle kickfd and callfd.
>> The patch for this issue is needed. I have a workaround patch, but let me
>> check it more.
>> If someone wants to check vhost-user behavior, I will describe it more in
>> later email.
>>
>>
>>
>>
>> Tetsuya Mukawa (2):
>>   EAL: Add new EAL "--contig-mem" option
>>   virtio: Extend virtio-net PMD to support container environment
>>
>>  config/common_linuxapp                     |    1 +
>>  drivers/net/virtio/Makefile                |    4 +
>>  drivers/net/virtio/qtest.c                 | 1107
>> ++++++++++++++++++++++++++++
>>  drivers/net/virtio/virtio_ethdev.c         |  341 ++++++++-
>>  drivers/net/virtio/virtio_ethdev.h         |   12 +
>>  drivers/net/virtio/virtio_pci.h            |   25 +
>>  lib/librte_eal/common/eal_common_options.c |    7 +
>>  lib/librte_eal/common/eal_internal_cfg.h   |    1 +
>>  lib/librte_eal/common/eal_options.h        |    2 +
>>  lib/librte_eal/linuxapp/eal/eal_memory.c   |   77 +-
>>  10 files changed, 1543 insertions(+), 34 deletions(-)
>>  create mode 100644 drivers/net/virtio/qtest.c
>>
>> --
>> 2.1.4

Tetsuya Mukawa Jan. 6, 2016, 3:57 a.m. UTC | #4

On 2015/12/28 20:06, Tetsuya Mukawa wrote:
> On 2015/12/24 23:05, Tan, Jianfeng wrote:
>> Hi Tetsuya,
>>
>> After several days' studying your patch, I have some questions as follows:
>>
>> 1. Is physically-contig memory really necessary?
>> This is a too strong requirement IMHO. IVSHMEM doesn't require this in its original meaning. So how do you think of
>> Huawei Xie's idea of using virtual address for address translation? (In addition, virtual address of mem_table could be
>> different in application and QTest, but this can be addressed because SET_MEM_TABLE msg will be intercepted by
>> QTest)
> Hi Jianfeng,
>
> Thanks for your suggestion.
> Huawei's idea may solve contig-mem restriction.
> Let me have time to check it more.

Hi Jianfeng,

I made sure we can remove the restriction with Huawei's idea.
One thing I concern is below.
If we don't use contiguous memory, this PMD will not work with other
'physical' PMDs like e1000 PMD, virtio-net PMD, and etc.
(This is because allocated memory may not  be physically contiguous.)

One of examples is that if we implement like above, in QEMU guest, we
can handle a host NIC directly, but in container, we will not be able to
handle the device.
This will be a restriction for this virtual addressing changing.

Do you know an use case that the user wants to handle 'physical' PMD and
'virtual' virtio-net PMD together?

Tetsuya,

Jianfeng Tan Jan. 6, 2016, 5:42 a.m. UTC | #5

On 1/6/2016 11:57 AM, Tetsuya Mukawa wrote:
> On 2015/12/28 20:06, Tetsuya Mukawa wrote:
>> On 2015/12/24 23:05, Tan, Jianfeng wrote:
>>> Hi Tetsuya,
>>>
>>> After several days' studying your patch, I have some questions as follows:
>>>
>>> 1. Is physically-contig memory really necessary?
>>> This is a too strong requirement IMHO. IVSHMEM doesn't require this in its original meaning. So how do you think of
>>> Huawei Xie's idea of using virtual address for address translation? (In addition, virtual address of mem_table could be
>>> different in application and QTest, but this can be addressed because SET_MEM_TABLE msg will be intercepted by
>>> QTest)
>> Hi Jianfeng,
>>
>> Thanks for your suggestion.
>> Huawei's idea may solve contig-mem restriction.
>> Let me have time to check it more.
> Hi Jianfeng,
>
> I made sure we can remove the restriction with Huawei's idea.
> One thing I concern is below.
> If we don't use contiguous memory, this PMD will not work with other
> 'physical' PMDs like e1000 PMD, virtio-net PMD, and etc.
> (This is because allocated memory may not  be physically contiguous.)
>
> One of examples is that if we implement like above, in QEMU guest, we
> can handle a host NIC directly, but in container, we will not be able to
> handle the device.
> This will be a restriction for this virtual addressing changing.
>
> Do you know an use case that the user wants to handle 'physical' PMD and
> 'virtual' virtio-net PMD together?
>
> Tetsuya,
Hi Tetsuya,

I have no use case in hand, which handles 'physical' PMDs and 'virtual' 
virtio-net PMD together.
(Pavel Fedin once tried to run ovs in container, but that case just uses 
virtual virtio devices, I
don't know if he has plan to add 'physical' PMDs as well.)

Actually, it's not completely contradictory to make them work together. 
Like this:
a. containers with root privilege
We can initialize memory as legacy way. (TODO: besides 
physical-contiguous, we try allocate
virtual-contiguous big area for all memsegs as well.)
a.1 For vhost-net, before sending memory regions into kernel, we can 
merge those virtual-contiguous regions into one region.
a.2 For vhost-user, we can merge memory regions in the vhost. The 
blocker is that for now, maximum fd num was restricted
by VHOST_MEMORY_MAX_NREGIONS=8 (so in 2M-hugepage's case, 16M shared 
memory is not nearly enough).

b. containers without root privilege
No need to worry about this problem, because it lacks of privilege to 
construct physical-contiguous memory.

Thanks,
Jianfeng

Tetsuya Mukawa Jan. 6, 2016, 7:35 a.m. UTC | #6

On 2016/01/06 14:42, Tan, Jianfeng wrote:
>
>
> On 1/6/2016 11:57 AM, Tetsuya Mukawa wrote:
>> On 2015/12/28 20:06, Tetsuya Mukawa wrote:
>>> On 2015/12/24 23:05, Tan, Jianfeng wrote:
>>>> Hi Tetsuya,
>>>>
>>>> After several days' studying your patch, I have some questions as
>>>> follows:
>>>>
>>>> 1. Is physically-contig memory really necessary?
>>>> This is a too strong requirement IMHO. IVSHMEM doesn't require this
>>>> in its original meaning. So how do you think of
>>>> Huawei Xie's idea of using virtual address for address translation?
>>>> (In addition, virtual address of mem_table could be
>>>> different in application and QTest, but this can be addressed
>>>> because SET_MEM_TABLE msg will be intercepted by
>>>> QTest)
>>> Hi Jianfeng,
>>>
>>> Thanks for your suggestion.
>>> Huawei's idea may solve contig-mem restriction.
>>> Let me have time to check it more.
>> Hi Jianfeng,
>>
>> I made sure we can remove the restriction with Huawei's idea.
>> One thing I concern is below.
>> If we don't use contiguous memory, this PMD will not work with other
>> 'physical' PMDs like e1000 PMD, virtio-net PMD, and etc.
>> (This is because allocated memory may not  be physically contiguous.)
>>
>> One of examples is that if we implement like above, in QEMU guest, we
>> can handle a host NIC directly, but in container, we will not be able to
>> handle the device.
>> This will be a restriction for this virtual addressing changing.
>>
>> Do you know an use case that the user wants to handle 'physical' PMD and
>> 'virtual' virtio-net PMD together?
>>
>> Tetsuya,
> Hi Tetsuya,
>
> I have no use case in hand, which handles 'physical' PMDs and
> 'virtual' virtio-net PMD together.
> (Pavel Fedin once tried to run ovs in container, but that case just
> uses virtual virtio devices, I
> don't know if he has plan to add 'physical' PMDs as well.)
>
> Actually, it's not completely contradictory to make them work
> together. Like this:
> a. containers with root privilege
> We can initialize memory as legacy way. (TODO: besides
> physical-contiguous, we try allocate
> virtual-contiguous big area for all memsegs as well.)

Hi Jianfeng,

Yes, I agree with you.
If the feature is really needed, we will be able to have work around.

>
> a.1 For vhost-net, before sending memory regions into kernel, we can
> merge those virtual-contiguous regions into one region.
> a.2 For vhost-user, we can merge memory regions in the vhost. The
> blocker is that for now, maximum fd num was restricted
> by VHOST_MEMORY_MAX_NREGIONS=8 (so in 2M-hugepage's case, 16M shared
> memory is not nearly enough).
>

With current your implementation, when 'virtual' virtio-net PMD is used,
'phys_addr' will be virtual address in EAL layer.

struct rte_memseg {
        phys_addr_t phys_addr;      /**< Start physical address. */
        union {
                void *addr;         /**< Start virtual address. */
                uint64_t addr_64;   /**< Makes sure addr is always 64
bits */
        };
        .......
};

How about choosing it in virtio-net PMD?
(In the case of 'virtual', just use 'addr' instead of using 'phys_addr'.)
For example, port0 may use physical address, but port1 may use virtual
address.

With this, of course, we don't have an issue with 'physical' virtio-net PMD.
Also, with 'virtual' virtio-net PMD, we can use virtual address and fd
that represents the big virtual address space.
(TODO: Need to change rte_memseg and EAL to keep fd and offset?)
Then, you don't worry about VHOST_MEMORY_MAX_NREGIONS, because we have
only one fd.

> b. containers without root privilege
> No need to worry about this problem, because it lacks of privilege to
> construct physical-contiguous memory.
>

Yes, we cannot run 'physical' PMDs in this type of container.
Anyway, I will check it more, if we really need it.

Thanks,
Tetsuya

Jianfeng Tan Jan. 11, 2016, 5:31 a.m. UTC | #7

Hi Tetsuya,

> With current your implementation, when 'virtual' virtio-net PMD is used,
> 'phys_addr' will be virtual address in EAL layer.
>
> struct rte_memseg {
>          phys_addr_t phys_addr;      /**< Start physical address. */
>          union {
>                  void *addr;         /**< Start virtual address. */
>                  uint64_t addr_64;   /**< Makes sure addr is always 64
> bits */
>          };
>          .......
> };

It's not true. It does not effect EAL layer at all. Just fill virtual 
address in virtio PMD when:
1). set_base_addr;
2). preparing RX's descriptors;
3). transmitting packets, CVA is filled in TX's descriptors;
4). in TX and CQ's header, CVA is used.

>
> How about choosing it in virtio-net PMD?

My current implementation works as you say.

> (In the case of 'virtual', just use 'addr' instead of using 'phys_addr'.)
> For example, port0 may use physical address, but port1 may use virtual
> address.
>
> With this, of course, we don't have an issue with 'physical' virtio-net PMD.
> Also, with 'virtual' virtio-net PMD, we can use virtual address and fd
> that represents the big virtual address space.
> (TODO: Need to change rte_memseg and EAL to keep fd and offset?)

I suppose you mean that when initializing memory, just maintain one fd 
in the end, and
mmap all memsegs inside it. This sounds like a good idea to solve the 
limitation of
VHOST_MEMORY_MAX_NREGIONS.

Besides, Sergio and I are discussing about using VA instead of PA in 
VFIO to avoid the
requirement of physical-config for physical devices.


Thanks,
Jianfeng



> Then, you don't worry about VHOST_MEMORY_MAX_NREGIONS, because we have
> only one fd.
>
>> b. containers without root privilege
>> No need to worry about this problem, because it lacks of privilege to
>> construct physical-contiguous memory.
>>
> Yes, we cannot run 'physical' PMDs in this type of container.
> Anyway, I will check it more, if we really need it.
>
> Thanks,
> Tetsuya

diff mbox

Patch

diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c
index 79db608..67b4e52 100644
--- a/lib/librte_eal/common/eal_common_options.c
+++ b/lib/librte_eal/common/eal_common_options.c
@@ -82,6 +82,7 @@  eal_long_options[] = {
 	{OPT_NO_HUGE,           0, NULL, OPT_NO_HUGE_NUM          },
 	{OPT_NO_PCI,            0, NULL, OPT_NO_PCI_NUM           },
 	{OPT_NO_SHCONF,         0, NULL, OPT_NO_SHCONF_NUM        },
+	{OPT_SHM,               0, NULL, OPT_SHM_NUM              },
 	{OPT_PCI_BLACKLIST,     1, NULL, OPT_PCI_BLACKLIST_NUM    },
 	{OPT_PCI_WHITELIST,     1, NULL, OPT_PCI_WHITELIST_NUM    },
 	{OPT_PROC_TYPE,         1, NULL, OPT_PROC_TYPE_NUM        },
@@ -723,6 +724,10 @@  eal_parse_common_option(int opt, const char *optarg,
 		conf->no_hugetlbfs = 1;
 		break;
 
+	case OPT_SHM_NUM:
+		conf->shm = 1;
+		break;
+
 	case OPT_NO_PCI_NUM:
 		conf->no_pci = 1;
 		break;
diff --git a/lib/librte_eal/common/eal_internal_cfg.h b/lib/librte_eal/common/eal_internal_cfg.h
index 5f1367e..362ce12 100644
--- a/lib/librte_eal/common/eal_internal_cfg.h
+++ b/lib/librte_eal/common/eal_internal_cfg.h
@@ -66,6 +66,7 @@  struct internal_config {
 	volatile unsigned no_hugetlbfs;   /**< true to disable hugetlbfs */
 	unsigned hugepage_unlink;         /**< true to unlink backing files */
 	volatile unsigned xen_dom0_support; /**< support app running on Xen Dom0*/
+	volatile unsigned shm;            /**< true to create shared memory for ivshmem */
 	volatile unsigned no_pci;         /**< true to disable PCI */
 	volatile unsigned no_hpet;        /**< true to disable HPET */
 	volatile unsigned vmware_tsc_map; /**< true to use VMware TSC mapping
diff --git a/lib/librte_eal/common/eal_options.h b/lib/librte_eal/common/eal_options.h
index 4245fd5..263b4f8 100644
--- a/lib/librte_eal/common/eal_options.h
+++ b/lib/librte_eal/common/eal_options.h
@@ -55,6 +55,8 @@  enum {
 	OPT_HUGE_DIR_NUM,
 #define OPT_HUGE_UNLINK       "huge-unlink"
 	OPT_HUGE_UNLINK_NUM,
+#define OPT_SHM               "shm"
+	OPT_SHM_NUM,
 #define OPT_LCORES            "lcores"
 	OPT_LCORES_NUM,
 #define OPT_LOG_LEVEL         "log-level"
diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h
index 1bed415..9c1effc 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -100,6 +100,7 @@  struct rte_memseg {
 	int32_t socket_id;          /**< NUMA socket ID. */
 	uint32_t nchannel;          /**< Number of channels. */
 	uint32_t nrank;             /**< Number of ranks. */
+	int fd;                     /**< fd used for share this memory */
 #ifdef RTE_LIBRTE_XEN_DOM0
 	 /**< store segment MFNs */
 	uint64_t mfn[DOM0_NUM_MEMBLOCK];
@@ -128,6 +129,10 @@  int rte_mem_lock_page(const void *virt);
  */
 phys_addr_t rte_mem_virt2phy(const void *virt);
 
+
+int
+rte_memseg_info_get(int index, int *pfd, uint64_t *psize, void **paddr);
+
 /**
  * Get the layout of the available physical memory.
  *
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 657d19f..c46c2cf 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -143,6 +143,21 @@  rte_mem_lock_page(const void *virt)
 	return mlock((void*)aligned, page_size);
 }
 
+int
+rte_memseg_info_get(int index, int *pfd, uint64_t *psize, void **paddr)
+{
+	struct rte_mem_config *mcfg;
+	mcfg = rte_eal_get_configuration()->mem_config;
+
+	if (pfd != NULL)
+		*pfd = mcfg->memseg[index].fd;
+	if (psize != NULL)
+		*psize = (uint64_t)mcfg->memseg[index].len;
+	if (paddr != NULL)
+		*paddr = (void *)(uint64_t)mcfg->memseg[index].addr;
+	return 0;
+}
+
 /*
  * Get physical address of any mapped virtual address in the current process.
  */
@@ -1068,6 +1083,41 @@  calc_num_pages_per_socket(uint64_t * memory,
 	return total_num_pages;
 }
 
+static void *
+rte_eal_shm_create(int *pfd, const char *hugedir)
+{
+	int ret, fd;
+	char filepath[256];
+	void *vaddr;
+	uint64_t size = internal_config.memory;
+
+	sprintf(filepath, "%s/%s_cvio", hugedir,
+			internal_config.hugefile_prefix);
+
+	fd = open(filepath, O_CREAT | O_RDWR, 0600);
+	if (fd < 0)
+		rte_panic("open %s failed: %s\n", filepath, strerror(errno));
+
+	ret = flock(fd, LOCK_EX);
+	if (ret < 0) {
+		close(fd);
+		rte_panic("flock %s failed: %s\n", filepath, strerror(errno));
+	}
+
+	ret = ftruncate(fd, size);
+	if (ret < 0)
+		rte_panic("ftruncate failed: %s\n", strerror(errno));
+
+	vaddr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (vaddr != MAP_FAILED) {
+		memset(vaddr, 0, size);
+		*pfd = fd;
+	}
+	memset(vaddr, 0, size);
+
+	return vaddr;
+}
+
 /*
  * Prepare physical memory mapping: fill configuration structure with
  * these infos, return 0 on success.
@@ -1120,6 +1170,27 @@  rte_eal_hugepage_init(void)
 		return 0;
 	}
 
+	/* create shared memory consist of only one file */
+	if (internal_config.shm) {
+		int fd;
+		struct hugepage_info *hpi;
+
+		hpi = &internal_config.hugepage_info[0];
+		addr = rte_eal_shm_create(&fd, hpi->hugedir);
+		if (addr == MAP_FAILED) {
+			RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
+					strerror(errno));
+			return -1;
+		}
+		mcfg->memseg[0].phys_addr = rte_mem_virt2phy(addr);
+		mcfg->memseg[0].addr = addr;
+		mcfg->memseg[0].hugepage_sz = hpi->hugepage_sz;
+		mcfg->memseg[0].len = internal_config.memory;
+		mcfg->memseg[0].socket_id = 0;
+		mcfg->memseg[0].fd = fd;
+		return 0;
+	}
+
 /* check if app runs on Xen Dom0 */
 	if (internal_config.xen_dom0_support) {
 #ifdef RTE_LIBRTE_XEN_DOM0