[dpdk-dev,v6,1/4] eal/vfio: add multiple container support

Message ID 20180412071956.66178-2-xiao.w.wang@intel.com (mailing list archive)
State Superseded, archived
Headers

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation fail apply patch file failure

Commit Message

Xiao Wang April 12, 2018, 7:19 a.m. UTC
  Currently eal vfio framework binds vfio group fd to the default
container fd during rte_vfio_setup_device, while in some cases,
e.g. vDPA (vhost data path acceleration), we want to put vfio group
to a separate container and program IOMMU via this container.

This patch adds some APIs to support container creating and device
binding with a container.

A driver could use "rte_vfio_create_container" helper to create a
new container from eal, use "rte_vfio_bind_group" to bind a device
to the newly created container.

During rte_vfio_setup_device, the container bound with the device
will be used for IOMMU setup.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
 config/common_base                       |   1 +
 lib/librte_eal/bsdapp/eal/eal.c          |  50 +++
 lib/librte_eal/common/include/rte_vfio.h | 113 +++++++
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 522 +++++++++++++++++++++++++------
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   1 +
 lib/librte_eal/rte_eal_version.map       |   6 +
 6 files changed, 601 insertions(+), 92 deletions(-)
  

Comments

Anatoly Burakov April 12, 2018, 2:03 p.m. UTC | #1
On 12-Apr-18 8:19 AM, Xiao Wang wrote:
> Currently eal vfio framework binds vfio group fd to the default
> container fd during rte_vfio_setup_device, while in some cases,
> e.g. vDPA (vhost data path acceleration), we want to put vfio group
> to a separate container and program IOMMU via this container.
> 
> This patch adds some APIs to support container creating and device
> binding with a container.
> 
> A driver could use "rte_vfio_create_container" helper to create a
> new container from eal, use "rte_vfio_bind_group" to bind a device
> to the newly created container.
> 
> During rte_vfio_setup_device, the container bound with the device
> will be used for IOMMU setup.
> 
> Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
> ---

Apologies for late review. Some comments below.

<...>

>   
> +struct rte_memseg;
> +
>   /**
>    * Setup vfio_cfg for the device identified by its address.
>    * It discovers the configured I/O MMU groups or sets a new one for the device.
> @@ -131,6 +133,117 @@ rte_vfio_clear_group(int vfio_group_fd);
>   }
>   #endif
>   

<...>

> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Perform dma mapping for devices in a conainer.
> + *
> + * @param container_fd
> + *   the specified container fd
> + *
> + * @param dma_type
> + *   the dma map type
> + *
> + * @param ms
> + *   the dma address region to map
> + *
> + * @return
> + *    0 if successful
> + *   <0 if failed
> + */
> +int __rte_experimental
> +rte_vfio_dma_map(int container_fd, int dma_type, const struct rte_memseg *ms);
> +

First of all, why memseg, instead of va/iova/len? This seems like 
unnecessary attachment to internals of DPDK memory representation. Not 
all memory comes in memsegs, this makes the API unnecessarily specific 
to DPDK memory.

Also, why providing DMA type? There's already a VFIO type pointer in 
vfio_config - you can set this pointer for every new created container, 
so the user wouldn't have to care about IOMMU type. Is it not possible 
to figure out DMA type from within EAL VFIO? If not, maybe provide an 
API to do so, e.g. rte_vfio_container_set_dma_type()?

This will also need to be rebased on top of latest HEAD because there 
already is a similar DMA map/unmap API added, only without the container 
parameter. Perhaps rename these new functions to 
rte_vfio_container_(create|destroy|dma_map|dma_unmap)?

> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Perform dma unmapping for devices in a conainer.
> + *
> + * @param container_fd
> + *   the specified container fd
> + *
> + * @param dma_type
> + *    the dma map type
> + *
> + * @param ms
> + *   the dma address region to unmap
> + *
> + * @return
> + *    0 if successful
> + *   <0 if failed
> + */
> +int __rte_experimental
> +rte_vfio_dma_unmap(int container_fd, int dma_type, const struct rte_memseg *ms);
> +
>   #endif /* VFIO_PRESENT */
>   

<...>

> @@ -75,8 +53,8 @@ vfio_get_group_fd(int iommu_group_no)
>   		if (vfio_group_fd < 0) {
>   			/* if file not found, it's not an error */
>   			if (errno != ENOENT) {
> -				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
> -						strerror(errno));
> +				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
> +					filename, strerror(errno));

This looks like unintended change.

>   				return -1;
>   			}
>   
> @@ -86,8 +64,10 @@ vfio_get_group_fd(int iommu_group_no)
>   			vfio_group_fd = open(filename, O_RDWR);
>   			if (vfio_group_fd < 0) {
>   				if (errno != ENOENT) {
> -					RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
> -							strerror(errno));
> +					RTE_LOG(ERR, EAL,
> +						"Cannot open %s: %s\n",
> +						filename,
> +						strerror(errno));

This looks like unintended change.

>   					return -1;
>   				}
>   				return 0;
> @@ -95,21 +75,19 @@ vfio_get_group_fd(int iommu_group_no)
>   			/* noiommu group found */
>   		}
>   
> -		cur_grp->group_no = iommu_group_no;
> -		cur_grp->fd = vfio_group_fd;
> -		vfio_cfg.vfio_active_groups++;
>   		return vfio_group_fd;
>   	}
> -	/* if we're in a secondary process, request group fd from the primary
> +	/*
> +	 * if we're in a secondary process, request group fd from the primary
>   	 * process via our socket
>   	 */

This looks like unintended change.

>   	else {
> -		int socket_fd, ret;
> -
> -		socket_fd = vfio_mp_sync_connect_to_primary();
> +		int ret;
> +		int socket_fd = vfio_mp_sync_connect_to_primary();
>   
>   		if (socket_fd < 0) {
> -			RTE_LOG(ERR, EAL, "  cannot connect to primary process!\n");
> +			RTE_LOG(ERR, EAL,
> +				"  cannot connect to primary process!\n");

This looks like unintended change.

>   			return -1;
>   		}
>   		if (vfio_mp_sync_send_request(socket_fd, SOCKET_REQ_GROUP) < 0) {
> @@ -122,6 +100,7 @@ vfio_get_group_fd(int iommu_group_no)
>   			close(socket_fd);
>   			return -1;
>   		}
> +
>   		ret = vfio_mp_sync_receive_request(socket_fd);

This looks like unintended change.

(hint: "git revert -n HEAD && git add -p" is your friend :) )

>   		switch (ret) {
>   		case SOCKET_NO_FD:
> @@ -132,9 +111,6 @@ vfio_get_group_fd(int iommu_group_no)
>   			/* if we got the fd, store it and return it */
>   			if (vfio_group_fd > 0) {
>   				close(socket_fd);
> -				cur_grp->group_no = iommu_group_no;
> -				cur_grp->fd = vfio_group_fd;
> -				vfio_cfg.vfio_active_groups++;
>   				return vfio_group_fd;
>   			}
>   			/* fall-through on error */
> @@ -147,70 +123,349 @@ vfio_get_group_fd(int iommu_group_no)
>   	return -1;

<...>

> +int __rte_experimental
> +rte_vfio_create_container(void)
> +{
> +	struct vfio_config *vfio_cfg;
> +	int i;
> +
> +	/* Find an empty slot to store new vfio config */
> +	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
> +		if (vfio_cfgs[i] == NULL)
> +			break;
> +	}
> +
> +	if (i == VFIO_MAX_CONTAINERS) {
> +		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
> +		return -1;
> +	}
> +
> +	vfio_cfgs[i] = rte_zmalloc("vfio_container", sizeof(struct vfio_config),
> +		RTE_CACHE_LINE_SIZE);
> +	if (vfio_cfgs[i] == NULL)
> +		return -ENOMEM;

Is there a specific reason why 1) dynamic allocation is used (as opposed 
to just storing a static array), and 2) DPDK memory allocation is used? 
This seems like unnecessary complication.

Even if you were to decide to allocate memory instead of having a static 
array, you'll have to register for rte_eal_cleanup() to delete any 
allocated containers on DPDK exit. But, as i said, i think it would be 
better to keep it as static array.

> +
> +	RTE_LOG(INFO, EAL, "alloc container at slot %d\n", i);
> +	vfio_cfg = vfio_cfgs[i];
> +	vfio_cfg->vfio_active_groups = 0;
> +	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
> +
> +	if (vfio_cfg->vfio_container_fd < 0) {
> +		rte_free(vfio_cfgs[i]);
> +		vfio_cfgs[i] = NULL;
> +		return -1;
> +	}
> +
> +	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
> +		vfio_cfg->vfio_groups[i].group_no = -1;
> +		vfio_cfg->vfio_groups[i].fd = -1;
> +		vfio_cfg->vfio_groups[i].devices = 0;
> +	}

<...>

> @@ -665,41 +931,80 @@ vfio_get_group_no(const char *sysfs_base,
>   }
>   
>   static int
> -vfio_type1_dma_map(int vfio_container_fd)
> +do_vfio_type1_dma_map(int vfio_container_fd, const struct rte_memseg *ms)

<...>


> +static int
> +do_vfio_type1_dma_unmap(int vfio_container_fd, const struct rte_memseg *ms)

API's such as these two were recently added to DPDK.
  
Xiao Wang April 12, 2018, 4:07 p.m. UTC | #2
Hi Anatoly,

> -----Original Message-----

> From: Burakov, Anatoly

> Sent: Thursday, April 12, 2018 10:04 PM

> To: Wang, Xiao W <xiao.w.wang@intel.com>; Yigit, Ferruh

> <ferruh.yigit@intel.com>

> Cc: dev@dpdk.org; maxime.coquelin@redhat.com; Wang, Zhihong

> <zhihong.wang@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Tan, Jianfeng

> <jianfeng.tan@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Daly,

> Dan <dan.daly@intel.com>; thomas@monjalon.net; gaetan.rivet@6wind.com;

> hemant.agrawal@nxp.com; Chen, Junjie J <junjie.j.chen@intel.com>

> Subject: Re: [PATCH v6 1/4] eal/vfio: add multiple container support

> 

> On 12-Apr-18 8:19 AM, Xiao Wang wrote:

> > Currently eal vfio framework binds vfio group fd to the default

> > container fd during rte_vfio_setup_device, while in some cases,

> > e.g. vDPA (vhost data path acceleration), we want to put vfio group

> > to a separate container and program IOMMU via this container.

> >

> > This patch adds some APIs to support container creating and device

> > binding with a container.

> >

> > A driver could use "rte_vfio_create_container" helper to create a

> > new container from eal, use "rte_vfio_bind_group" to bind a device

> > to the newly created container.

> >

> > During rte_vfio_setup_device, the container bound with the device

> > will be used for IOMMU setup.

> >

> > Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>

> > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>

> > Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

> > Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>

> > ---

> 

> Apologies for late review. Some comments below.

> 

> <...>

> 

> >

> > +struct rte_memseg;

> > +

> >   /**

> >    * Setup vfio_cfg for the device identified by its address.

> >    * It discovers the configured I/O MMU groups or sets a new one for the

> device.

> > @@ -131,6 +133,117 @@ rte_vfio_clear_group(int vfio_group_fd);

> >   }

> >   #endif

> >

> 

> <...>

> 

> > +/**

> > + * @warning

> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior

> notice

> > + *

> > + * Perform dma mapping for devices in a conainer.

> > + *

> > + * @param container_fd

> > + *   the specified container fd

> > + *

> > + * @param dma_type

> > + *   the dma map type

> > + *

> > + * @param ms

> > + *   the dma address region to map

> > + *

> > + * @return

> > + *    0 if successful

> > + *   <0 if failed

> > + */

> > +int __rte_experimental

> > +rte_vfio_dma_map(int container_fd, int dma_type, const struct

> rte_memseg *ms);

> > +

> 

> First of all, why memseg, instead of va/iova/len? This seems like

> unnecessary attachment to internals of DPDK memory representation. Not

> all memory comes in memsegs, this makes the API unnecessarily specific

> to DPDK memory.


Agree, will use va/iova/len.

> 

> Also, why providing DMA type? There's already a VFIO type pointer in

> vfio_config - you can set this pointer for every new created container,

> so the user wouldn't have to care about IOMMU type. Is it not possible

> to figure out DMA type from within EAL VFIO? If not, maybe provide an

> API to do so, e.g. rte_vfio_container_set_dma_type()?


It's possible, EAL VFIO should be able to figure out a container's DMA type.

> 

> This will also need to be rebased on top of latest HEAD because there

> already is a similar DMA map/unmap API added, only without the container

> parameter. Perhaps rename these new functions to

> rte_vfio_container_(create|destroy|dma_map|dma_unmap)?


OK, will check the latest HEAD and rebase on that.

> 

> > +/**

> > + * @warning

> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior

> notice

> > + *

> > + * Perform dma unmapping for devices in a conainer.

> > + *

> > + * @param container_fd

> > + *   the specified container fd

> > + *

> > + * @param dma_type

> > + *    the dma map type

> > + *

> > + * @param ms

> > + *   the dma address region to unmap

> > + *

> > + * @return

> > + *    0 if successful

> > + *   <0 if failed

> > + */

> > +int __rte_experimental

> > +rte_vfio_dma_unmap(int container_fd, int dma_type, const struct

> rte_memseg *ms);

> > +

> >   #endif /* VFIO_PRESENT */

> >

> 

> <...>

> 

> > @@ -75,8 +53,8 @@ vfio_get_group_fd(int iommu_group_no)

> >   		if (vfio_group_fd < 0) {

> >   			/* if file not found, it's not an error */

> >   			if (errno != ENOENT) {

> > -				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",

> filename,

> > -						strerror(errno));

> > +				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",

> > +					filename, strerror(errno));

> 

> This looks like unintended change.

> 

> >   				return -1;

> >   			}

> >

> > @@ -86,8 +64,10 @@ vfio_get_group_fd(int iommu_group_no)

> >   			vfio_group_fd = open(filename, O_RDWR);

> >   			if (vfio_group_fd < 0) {

> >   				if (errno != ENOENT) {

> > -					RTE_LOG(ERR, EAL, "Cannot

> open %s: %s\n", filename,

> > -							strerror(errno));

> > +					RTE_LOG(ERR, EAL,

> > +						"Cannot open %s: %s\n",

> > +						filename,

> > +						strerror(errno));

> 

> This looks like unintended change.

> 

> >   					return -1;

> >   				}

> >   				return 0;

> > @@ -95,21 +75,19 @@ vfio_get_group_fd(int iommu_group_no)

> >   			/* noiommu group found */

> >   		}

> >

> > -		cur_grp->group_no = iommu_group_no;

> > -		cur_grp->fd = vfio_group_fd;

> > -		vfio_cfg.vfio_active_groups++;

> >   		return vfio_group_fd;

> >   	}

> > -	/* if we're in a secondary process, request group fd from the primary

> > +	/*

> > +	 * if we're in a secondary process, request group fd from the primary

> >   	 * process via our socket

> >   	 */

> 

> This looks like unintended change.

> 

> >   	else {

> > -		int socket_fd, ret;

> > -

> > -		socket_fd = vfio_mp_sync_connect_to_primary();

> > +		int ret;

> > +		int socket_fd = vfio_mp_sync_connect_to_primary();

> >

> >   		if (socket_fd < 0) {

> > -			RTE_LOG(ERR, EAL, "  cannot connect to primary

> process!\n");

> > +			RTE_LOG(ERR, EAL,

> > +				"  cannot connect to primary process!\n");

> 

> This looks like unintended change.

> 

> >   			return -1;

> >   		}

> >   		if (vfio_mp_sync_send_request(socket_fd,

> SOCKET_REQ_GROUP) < 0) {

> > @@ -122,6 +100,7 @@ vfio_get_group_fd(int iommu_group_no)

> >   			close(socket_fd);

> >   			return -1;

> >   		}

> > +

> >   		ret = vfio_mp_sync_receive_request(socket_fd);

> 

> This looks like unintended change.

> 

> (hint: "git revert -n HEAD && git add -p" is your friend :) )


Thanks, will remove these diff.

> 

> >   		switch (ret) {

> >   		case SOCKET_NO_FD:

> > @@ -132,9 +111,6 @@ vfio_get_group_fd(int iommu_group_no)

> >   			/* if we got the fd, store it and return it */

> >   			if (vfio_group_fd > 0) {

> >   				close(socket_fd);

> > -				cur_grp->group_no = iommu_group_no;

> > -				cur_grp->fd = vfio_group_fd;

> > -				vfio_cfg.vfio_active_groups++;

> >   				return vfio_group_fd;

> >   			}

> >   			/* fall-through on error */

> > @@ -147,70 +123,349 @@ vfio_get_group_fd(int iommu_group_no)

> >   	return -1;

> 

> <...>

> 

> > +int __rte_experimental

> > +rte_vfio_create_container(void)

> > +{

> > +	struct vfio_config *vfio_cfg;

> > +	int i;

> > +

> > +	/* Find an empty slot to store new vfio config */

> > +	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {

> > +		if (vfio_cfgs[i] == NULL)

> > +			break;

> > +	}

> > +

> > +	if (i == VFIO_MAX_CONTAINERS) {

> > +		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");

> > +		return -1;

> > +	}

> > +

> > +	vfio_cfgs[i] = rte_zmalloc("vfio_container", sizeof(struct vfio_config),

> > +		RTE_CACHE_LINE_SIZE);

> > +	if (vfio_cfgs[i] == NULL)

> > +		return -ENOMEM;

> 

> Is there a specific reason why 1) dynamic allocation is used (as opposed

> to just storing a static array), and 2) DPDK memory allocation is used?

> This seems like unnecessary complication.

> 

> Even if you were to decide to allocate memory instead of having a static

> array, you'll have to register for rte_eal_cleanup() to delete any

> allocated containers on DPDK exit. But, as i said, i think it would be

> better to keep it as static array.

>


Thanks for the suggestion, static array looks simpler and cleaner.
 
> > +

> > +	RTE_LOG(INFO, EAL, "alloc container at slot %d\n", i);

> > +	vfio_cfg = vfio_cfgs[i];

> > +	vfio_cfg->vfio_active_groups = 0;

> > +	vfio_cfg->vfio_container_fd = vfio_get_container_fd();

> > +

> > +	if (vfio_cfg->vfio_container_fd < 0) {

> > +		rte_free(vfio_cfgs[i]);

> > +		vfio_cfgs[i] = NULL;

> > +		return -1;

> > +	}

> > +

> > +	for (i = 0; i < VFIO_MAX_GROUPS; i++) {

> > +		vfio_cfg->vfio_groups[i].group_no = -1;

> > +		vfio_cfg->vfio_groups[i].fd = -1;

> > +		vfio_cfg->vfio_groups[i].devices = 0;

> > +	}

> 

> <...>

> 

> > @@ -665,41 +931,80 @@ vfio_get_group_no(const char *sysfs_base,

> >   }

> >

> >   static int

> > -vfio_type1_dma_map(int vfio_container_fd)

> > +do_vfio_type1_dma_map(int vfio_container_fd, const struct rte_memseg

> *ms)

> 

> <...>

> 

> 

> > +static int

> > +do_vfio_type1_dma_unmap(int vfio_container_fd, const struct

> rte_memseg *ms)

> 

> API's such as these two were recently added to DPDK.


Will check and rebase.

BRs,
Xiao

> 

> --

> Thanks,

> Anatoly
  
Anatoly Burakov April 12, 2018, 4:24 p.m. UTC | #3
On 12-Apr-18 5:07 PM, Wang, Xiao W wrote:
> Hi Anatoly,
> 

<...>

>>
>> Also, why providing DMA type? There's already a VFIO type pointer in
>> vfio_config - you can set this pointer for every new created container,
>> so the user wouldn't have to care about IOMMU type. Is it not possible
>> to figure out DMA type from within EAL VFIO? If not, maybe provide an
>> API to do so, e.g. rte_vfio_container_set_dma_type()?
> 
> It's possible, EAL VFIO should be able to figure out a container's DMA type.

You probably won't be able to do it until you add a group into the 
container, so probably best place to do it would be on group_bind?
  
Xiao Wang April 13, 2018, 9:18 a.m. UTC | #4
> -----Original Message-----

> From: Burakov, Anatoly

> Sent: Friday, April 13, 2018 12:24 AM

> To: Wang, Xiao W <xiao.w.wang@intel.com>; Yigit, Ferruh

> <ferruh.yigit@intel.com>

> Cc: dev@dpdk.org; maxime.coquelin@redhat.com; Wang, Zhihong

> <zhihong.wang@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Tan, Jianfeng

> <jianfeng.tan@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Daly,

> Dan <dan.daly@intel.com>; thomas@monjalon.net; gaetan.rivet@6wind.com;

> hemant.agrawal@nxp.com; Chen, Junjie J <junjie.j.chen@intel.com>

> Subject: Re: [PATCH v6 1/4] eal/vfio: add multiple container support

> 

> On 12-Apr-18 5:07 PM, Wang, Xiao W wrote:

> > Hi Anatoly,

> >

> 

> <...>

> 

> >>

> >> Also, why providing DMA type? There's already a VFIO type pointer in

> >> vfio_config - you can set this pointer for every new created container,

> >> so the user wouldn't have to care about IOMMU type. Is it not possible

> >> to figure out DMA type from within EAL VFIO? If not, maybe provide an

> >> API to do so, e.g. rte_vfio_container_set_dma_type()?

> >

> > It's possible, EAL VFIO should be able to figure out a container's DMA type.

> 

> You probably won't be able to do it until you add a group into the

> container, so probably best place to do it would be on group_bind?


Yes, the IOMMU type pointer could be set when group binding.

BRs,
Xiao

> 

> --

> Thanks,

> Anatoly
  
Xiao Wang April 15, 2018, 3:33 p.m. UTC | #5
IFCVF driver
============
The IFCVF vDPA (vhost data path acceleration) driver provides support for the
Intel FPGA 100G VF (IFCVF). IFCVF's datapath is virtio ring compatible, it
works as a HW vhost backend which can send/receive packets to/from virtio
directly by DMA. Besides, it supports dirty page logging and device state
report/restore. This driver enables its vDPA functionality with live migration
feature.

vDPA mode
=========
IFCVF's vendor ID and device ID are same as that of virtio net pci device,
with its specific subsystem vendor ID and device ID. To let the device be
probed by IFCVF driver, adding "vdpa=1" parameter helps to specify that this
device is to be used in vDPA mode, rather than polling mode, virtio pmd will
skip when it detects this message.

Container per device
====================
vDPA needs to create different containers for different devices, thus this
patch set adds some APIs in eal/vfio to support multiple container, e.g.
- rte_vfio_create_container
- rte_vfio_destroy_container
- rte_vfio_bind_group
- rte_vfio_unbind_group

By this extension, a device can be put into a new specific container, rather
than the previous default container.

IFCVF vDPA details
==================
Key vDPA driver ops implemented:
- ifcvf_dev_config:
  Enable VF data path with virtio information provided by vhost lib, including
  IOMMU programming to enable VF DMA to VM's memory, VFIO interrupt setup to
  route HW interrupt to virtio driver, create notify relay thread to translate
  virtio driver's kick to a MMIO write onto HW, HW queues configuration.

  This function gets called to set up HW data path backend when virtio driver
  in VM gets ready.

- ifcvf_dev_close:
  Revoke all the setup in ifcvf_dev_config.

  This function gets called when virtio driver stops device in VM.

Change log
==========
v7:
- Rebase on HEAD.
- Split the vfio patch into 2 parts, one for data structure extension, one for
  adding new API.
- Use static vfio_config array instead of dynamic alloating.
- Change rte_vfio_container_dma_map/unmap's parameters to use (va, iova, len).

v6:
- Rebase on master branch.
- Document "vdpa" devarg in virtio documentation.
- Rename ifcvf config option to CONFIG_RTE_LIBRTE_IFCVF_VDPA_PMD for
  consistensy, and add it into driver documentation.
- Add comments for ifcvf device ID.
- Minor code cleaning.

v5:
- Fix compilation in BSD, remove the rte_vfio.h including in BSD.

v4:
- Rebase on Zhihong's latest vDPA lib patch, with vDPA ops names change.
- Remove API "rte_vfio_get_group_fd", "rte_vfio_bind_group" will return the fd.
- Align the vfio_cfg search internal APIs naming.

v3:
- Add doc and release note for the new driver.
- Remove the vdev concept, make the driver as a PCI driver, it will get probed
  by PCI bus driver.
- Rebase on the v4 vDPA lib patch, register a vDPA device instead of a engine.
- Remove the PCI API exposure accordingly.
- Move the MAX_VFIO_CONTAINERS definition to config file.
- Let virtio pmd skips when a virtio device needs to work in vDPA mode.

v2:
- Rename function pci_get_kernel_driver_by_path to rte_pci_device_kdriver_name
  to make the API generic cross Linux and BSD, make it as EXPERIMENTAL.
- Rebase on Zhihong's vDPA v3 patch set.
- Minor code cleanup on vfio extension.


Xiao Wang (5):
  vfio: extend data structure for multi container
  vfio: add multi container support
  net/virtio: skip device probe in vdpa mode
  net/ifcvf: add ifcvf vdpa driver
  doc: add ifcvf driver document and release note

 config/common_base                       |   8 +
 config/common_linuxapp                   |   1 +
 doc/guides/nics/features/ifcvf.ini       |   8 +
 doc/guides/nics/ifcvf.rst                |  98 ++++
 doc/guides/nics/index.rst                |   1 +
 doc/guides/nics/virtio.rst               |  13 +
 doc/guides/rel_notes/release_18_05.rst   |   9 +
 drivers/net/Makefile                     |   3 +
 drivers/net/ifc/Makefile                 |  36 ++
 drivers/net/ifc/base/ifcvf.c             | 329 ++++++++++++
 drivers/net/ifc/base/ifcvf.h             | 160 ++++++
 drivers/net/ifc/base/ifcvf_osdep.h       |  52 ++
 drivers/net/ifc/ifcvf_vdpa.c             | 842 +++++++++++++++++++++++++++++++
 drivers/net/ifc/rte_ifcvf_version.map    |   4 +
 drivers/net/virtio/virtio_ethdev.c       |  43 ++
 lib/librte_eal/bsdapp/eal/eal.c          |  52 ++
 lib/librte_eal/common/include/rte_vfio.h | 119 +++++
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 723 +++++++++++++++++++++-----
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |  19 +-
 lib/librte_eal/rte_eal_version.map       |   6 +
 mk/rte.app.mk                            |   3 +
 21 files changed, 2377 insertions(+), 152 deletions(-)
 create mode 100644 doc/guides/nics/features/ifcvf.ini
 create mode 100644 doc/guides/nics/ifcvf.rst
 create mode 100644 drivers/net/ifc/Makefile
 create mode 100644 drivers/net/ifc/base/ifcvf.c
 create mode 100644 drivers/net/ifc/base/ifcvf.h
 create mode 100644 drivers/net/ifc/base/ifcvf_osdep.h
 create mode 100644 drivers/net/ifc/ifcvf_vdpa.c
 create mode 100644 drivers/net/ifc/rte_ifcvf_version.map
  

Patch

diff --git a/config/common_base b/config/common_base
index c09c7cf88..90c2821ae 100644
--- a/config/common_base
+++ b/config/common_base
@@ -74,6 +74,7 @@  CONFIG_RTE_EAL_ALWAYS_PANIC_ON_ERROR=n
 CONFIG_RTE_EAL_IGB_UIO=n
 CONFIG_RTE_EAL_VFIO=n
 CONFIG_RTE_MAX_VFIO_GROUPS=64
+CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 CONFIG_RTE_USE_LIBBSD=n
diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
index 4eafcb5ad..0a3d8783d 100644
--- a/lib/librte_eal/bsdapp/eal/eal.c
+++ b/lib/librte_eal/bsdapp/eal/eal.c
@@ -746,6 +746,14 @@  int rte_vfio_enable(const char *modname);
 int rte_vfio_is_enabled(const char *modname);
 int rte_vfio_noiommu_is_enabled(void);
 int rte_vfio_clear_group(int vfio_group_fd);
+int rte_vfio_create_container(void);
+int rte_vfio_destroy_container(int container_fd);
+int rte_vfio_bind_group(int container_fd, int iommu_group_no);
+int rte_vfio_unbind_group(int container_fd, int iommu_group_no);
+int rte_vfio_dma_map(int container_fd, int dma_type,
+		const struct rte_memseg *ms);
+int rte_vfio_dma_unmap(int container_fd, int dma_type,
+		const struct rte_memseg *ms);
 
 int rte_vfio_setup_device(__rte_unused const char *sysfs_base,
 		      __rte_unused const char *dev_addr,
@@ -781,3 +789,45 @@  int rte_vfio_clear_group(__rte_unused int vfio_group_fd)
 {
 	return 0;
 }
+
+int __rte_experimental
+rte_vfio_create_container(void)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_destroy_container(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_bind_group(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_dma_map(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_dma_unmap(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h
index 249095e46..9bb026703 100644
--- a/lib/librte_eal/common/include/rte_vfio.h
+++ b/lib/librte_eal/common/include/rte_vfio.h
@@ -32,6 +32,8 @@ 
 extern "C" {
 #endif
 
+struct rte_memseg;
+
 /**
  * Setup vfio_cfg for the device identified by its address.
  * It discovers the configured I/O MMU groups or sets a new one for the device.
@@ -131,6 +133,117 @@  rte_vfio_clear_group(int vfio_group_fd);
 }
 #endif
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Create a new container for device binding.
+ *
+ * @return
+ *   the container fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_create_container(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Destroy the container, unbind all vfio groups within it.
+ *
+ * @param container_fd
+ *   the container fd to destroy
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Bind a IOMMU group to a container.
+ *
+ * @param container_fd
+ *   the container's fd
+ *
+ * @param iommu_group_no
+ *   the iommu_group_no to bind to container
+ *
+ * @return
+ *   group fd if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_bind_group(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Unbind a IOMMU group from a container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ *
+ * @param iommu_group_no
+ *   the iommu_group_no to delete from container
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_unbind_group(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma mapping for devices in a conainer.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param dma_type
+ *   the dma map type
+ *
+ * @param ms
+ *   the dma address region to map
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_map(int container_fd, int dma_type, const struct rte_memseg *ms);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma unmapping for devices in a conainer.
+ *
+ * @param container_fd
+ *   the specified container fd
+ *
+ * @param dma_type
+ *    the dma map type
+ *
+ * @param ms
+ *   the dma address region to unmap
+ *
+ * @return
+ *    0 if successful
+ *   <0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_unmap(int container_fd, int dma_type, const struct rte_memseg *ms);
+
 #endif /* VFIO_PRESENT */
 
 #endif /* _RTE_VFIO_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index e44ae4d04..e474f6e9f 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -9,6 +9,7 @@ 
 
 #include <rte_log.h>
 #include <rte_memory.h>
+#include <rte_malloc.h>
 #include <rte_eal_memconfig.h>
 #include <rte_vfio.h>
 
@@ -19,7 +20,9 @@ 
 #ifdef VFIO_PRESENT
 
 /* per-process VFIO config */
-static struct vfio_config vfio_cfg;
+static struct vfio_config default_vfio_cfg;
+
+static struct vfio_config *vfio_cfgs[VFIO_MAX_CONTAINERS] = {&default_vfio_cfg};
 
 static int vfio_type1_dma_map(int);
 static int vfio_spapr_dma_map(int);
@@ -35,38 +38,13 @@  static const struct vfio_iommu_type iommu_types[] = {
 	{ RTE_VFIO_NOIOMMU, "No-IOMMU", &vfio_noiommu_dma_map},
 };
 
-int
-vfio_get_group_fd(int iommu_group_no)
+static int
+vfio_open_group_fd(int iommu_group_no)
 {
-	int i;
 	int vfio_group_fd;
 	char filename[PATH_MAX];
-	struct vfio_group *cur_grp;
-
-	/* check if we already have the group descriptor open */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == iommu_group_no)
-			return vfio_cfg.vfio_groups[i].fd;
-
-	/* Lets see first if there is room for a new group */
-	if (vfio_cfg.vfio_active_groups == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
-		return -1;
-	}
-
-	/* Now lets get an index for the new group */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == -1) {
-			cur_grp = &vfio_cfg.vfio_groups[i];
-			break;
-		}
 
-	/* This should not happen */
-	if (i == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
-		return -1;
-	}
-	/* if primary, try to open the group */
+	/* if in primary process, try to open the group */
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 		/* try regular group format */
 		snprintf(filename, sizeof(filename),
@@ -75,8 +53,8 @@  vfio_get_group_fd(int iommu_group_no)
 		if (vfio_group_fd < 0) {
 			/* if file not found, it's not an error */
 			if (errno != ENOENT) {
-				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-						strerror(errno));
+				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
+					filename, strerror(errno));
 				return -1;
 			}
 
@@ -86,8 +64,10 @@  vfio_get_group_fd(int iommu_group_no)
 			vfio_group_fd = open(filename, O_RDWR);
 			if (vfio_group_fd < 0) {
 				if (errno != ENOENT) {
-					RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-							strerror(errno));
+					RTE_LOG(ERR, EAL,
+						"Cannot open %s: %s\n",
+						filename,
+						strerror(errno));
 					return -1;
 				}
 				return 0;
@@ -95,21 +75,19 @@  vfio_get_group_fd(int iommu_group_no)
 			/* noiommu group found */
 		}
 
-		cur_grp->group_no = iommu_group_no;
-		cur_grp->fd = vfio_group_fd;
-		vfio_cfg.vfio_active_groups++;
 		return vfio_group_fd;
 	}
-	/* if we're in a secondary process, request group fd from the primary
+	/*
+	 * if we're in a secondary process, request group fd from the primary
 	 * process via our socket
 	 */
 	else {
-		int socket_fd, ret;
-
-		socket_fd = vfio_mp_sync_connect_to_primary();
+		int ret;
+		int socket_fd = vfio_mp_sync_connect_to_primary();
 
 		if (socket_fd < 0) {
-			RTE_LOG(ERR, EAL, "  cannot connect to primary process!\n");
+			RTE_LOG(ERR, EAL,
+				"  cannot connect to primary process!\n");
 			return -1;
 		}
 		if (vfio_mp_sync_send_request(socket_fd, SOCKET_REQ_GROUP) < 0) {
@@ -122,6 +100,7 @@  vfio_get_group_fd(int iommu_group_no)
 			close(socket_fd);
 			return -1;
 		}
+
 		ret = vfio_mp_sync_receive_request(socket_fd);
 		switch (ret) {
 		case SOCKET_NO_FD:
@@ -132,9 +111,6 @@  vfio_get_group_fd(int iommu_group_no)
 			/* if we got the fd, store it and return it */
 			if (vfio_group_fd > 0) {
 				close(socket_fd);
-				cur_grp->group_no = iommu_group_no;
-				cur_grp->fd = vfio_group_fd;
-				vfio_cfg.vfio_active_groups++;
 				return vfio_group_fd;
 			}
 			/* fall-through on error */
@@ -147,70 +123,349 @@  vfio_get_group_fd(int iommu_group_no)
 	return -1;
 }
 
+static struct vfio_config *
+get_vfio_cfg_by_group_fd(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return vfio_cfg;
+	}
+
+	return &default_vfio_cfg;
+}
+
+static struct vfio_config *
+get_vfio_cfg_by_group_no(int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_no ==
+					iommu_group_no)
+				return vfio_cfg;
+		}
+	}
+
+	return &default_vfio_cfg;
+}
 
 static int
-get_vfio_group_idx(int vfio_group_fd)
+get_container_idx(int container_fd)
 {
 	int i;
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].fd == vfio_group_fd)
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		if (vfio_cfgs[i]->vfio_container_fd == container_fd)
 			return i;
+	}
+
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_create_container(void)
+{
+	struct vfio_config *vfio_cfg;
+	int i;
+
+	/* Find an empty slot to store new vfio config */
+	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i] == NULL)
+			break;
+	}
+
+	if (i == VFIO_MAX_CONTAINERS) {
+		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
+		return -1;
+	}
+
+	vfio_cfgs[i] = rte_zmalloc("vfio_container", sizeof(struct vfio_config),
+		RTE_CACHE_LINE_SIZE);
+	if (vfio_cfgs[i] == NULL)
+		return -ENOMEM;
+
+	RTE_LOG(INFO, EAL, "alloc container at slot %d\n", i);
+	vfio_cfg = vfio_cfgs[i];
+	vfio_cfg->vfio_active_groups = 0;
+	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
+
+	if (vfio_cfg->vfio_container_fd < 0) {
+		rte_free(vfio_cfgs[i]);
+		vfio_cfgs[i] = NULL;
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+	}
+
+	return vfio_cfg->vfio_container_fd;
+}
+
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, idx;
+
+	idx = get_container_idx(container_fd);
+	if (idx < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	vfio_cfg = vfio_cfgs[idx];
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no != -1)
+			rte_vfio_unbind_group(container_fd,
+				vfio_cfg->vfio_groups[i].group_no);
+
+	rte_free(vfio_cfgs[idx]);
+	vfio_cfgs[idx] = NULL;
+	close(container_fd);
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_bind_group(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	struct vfio_group *cur_grp;
+	int vfio_group_fd;
+	int i;
+
+	i = get_container_idx(container_fd);
+	if (i < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	vfio_cfg = vfio_cfgs[i];
+	/* Check room for new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	struct vfio_group *cur_grp;
+	int i;
+
+	i = get_container_idx(container_fd);
+	if (i < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	vfio_cfg = vfio_cfgs[i];
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		if (vfio_cfg->vfio_groups[i].group_no == iommu_group_no) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+	}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Specified group number not found\n");
+		return -1;
+	}
+
+	if (cur_grp->fd >= 0 && close(cur_grp->fd) < 0) {
+		RTE_LOG(ERR, EAL, "Error when closing vfio_group_fd for"
+				" iommu_group_no %d\n",
+			iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = -1;
+	cur_grp->fd = -1;
+	vfio_cfg->vfio_active_groups--;
+
+	return 0;
+}
+
+int
+vfio_get_group_fd(int iommu_group_no)
+{
+	struct vfio_group *cur_grp;
+	struct vfio_config *vfio_cfg;
+	int vfio_group_fd;
+	int i;
+
+	vfio_cfg = get_vfio_cfg_by_group_no(iommu_group_no);
+
+	/* check if we already have the group descriptor open */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == iommu_group_no)
+			return vfio_cfg->vfio_groups[i].fd;
+
+	/* Lets see first if there is room for a new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Now lets get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+static int
+get_vfio_group_idx(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return j;
+		}
+	}
+
 	return -1;
 }
 
 static void
 vfio_group_device_get(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices++;
+		vfio_cfg->vfio_groups[i].devices++;
 }
 
 static void
 vfio_group_device_put(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices--;
+		vfio_cfg->vfio_groups[i].devices--;
 }
 
 static int
 vfio_group_device_count(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 		return -1;
 	}
 
-	return vfio_cfg.vfio_groups[i].devices;
+	return vfio_cfg->vfio_groups[i].devices;
 }
 
 int
 rte_vfio_clear_group(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 	int socket_fd, ret;
 
+	vfio_cfg = get_vfio_cfg_by_group_fd(vfio_group_fd);
+
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 
 		i = get_vfio_group_idx(vfio_group_fd);
-		if (i < 0)
+		if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
+			RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 			return -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
-		vfio_cfg.vfio_active_groups--;
+		}
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+		vfio_cfg->vfio_active_groups--;
 		return 0;
 	}
 
@@ -261,6 +516,8 @@  rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	struct vfio_config *vfio_cfg;
+	int vfio_container_fd;
 	int vfio_group_fd;
 	int iommu_group_no;
 	int ret;
@@ -309,12 +566,14 @@  rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		return -1;
 	}
 
+	vfio_cfg = get_vfio_cfg_by_group_no(iommu_group_no);
+	vfio_container_fd = vfio_cfg->vfio_container_fd;
+
 	/* check if group does not have a container yet */
 	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
-
 		/* add group to a container */
 		ret = ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
-				&vfio_cfg.vfio_container_fd);
+				&vfio_container_fd);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  %s cannot add VFIO group to container, "
 					"error %i (%s)\n", dev_addr, errno, strerror(errno));
@@ -331,11 +590,12 @@  rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		 * Note this can happen several times with the hotplug
 		 * functionality.
 		 */
+
 		if (internal_config.process_type == RTE_PROC_PRIMARY &&
-				vfio_cfg.vfio_active_groups == 1) {
+				vfio_cfg->vfio_active_groups == 1) {
 			/* select an IOMMU type which we will be using */
 			const struct vfio_iommu_type *t =
-				vfio_set_iommu_type(vfio_cfg.vfio_container_fd);
+				vfio_set_iommu_type(vfio_container_fd);
 			if (!t) {
 				RTE_LOG(ERR, EAL,
 					"  %s failed to select IOMMU type\n",
@@ -344,7 +604,13 @@  rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 				rte_vfio_clear_group(vfio_group_fd);
 				return -1;
 			}
-			ret = t->dma_map_func(vfio_cfg.vfio_container_fd);
+			/* DMA map for the default container only. */
+			if (default_vfio_cfg.vfio_container_fd ==
+				vfio_container_fd)
+				ret = t->dma_map_func(vfio_container_fd);
+			else
+				ret = 0;
+
 			if (ret) {
 				RTE_LOG(ERR, EAL,
 					"  %s DMA remapping failed, error %i (%s)\n",
@@ -388,7 +654,7 @@  rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 
 int
 rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
-		    int vfio_dev_fd)
+			int vfio_dev_fd)
 {
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
@@ -456,9 +722,9 @@  rte_vfio_enable(const char *modname)
 	int vfio_available;
 
 	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
+		default_vfio_cfg.vfio_groups[i].fd = -1;
+		default_vfio_cfg.vfio_groups[i].group_no = -1;
+		default_vfio_cfg.vfio_groups[i].devices = 0;
 	}
 
 	/* inform the user that we are probing for VFIO */
@@ -480,12 +746,12 @@  rte_vfio_enable(const char *modname)
 		return 0;
 	}
 
-	vfio_cfg.vfio_container_fd = vfio_get_container_fd();
+	default_vfio_cfg.vfio_container_fd = vfio_get_container_fd();
 
 	/* check if we have VFIO driver enabled */
-	if (vfio_cfg.vfio_container_fd != -1) {
+	if (default_vfio_cfg.vfio_container_fd != -1) {
 		RTE_LOG(NOTICE, EAL, "VFIO support initialized\n");
-		vfio_cfg.vfio_enabled = 1;
+		default_vfio_cfg.vfio_enabled = 1;
 	} else {
 		RTE_LOG(NOTICE, EAL, "VFIO support could not be initialized\n");
 	}
@@ -497,7 +763,7 @@  int
 rte_vfio_is_enabled(const char *modname)
 {
 	const int mod_available = rte_eal_check_module(modname) > 0;
-	return vfio_cfg.vfio_enabled && mod_available;
+	return default_vfio_cfg.vfio_enabled && mod_available;
 }
 
 const struct vfio_iommu_type *
@@ -665,41 +931,80 @@  vfio_get_group_no(const char *sysfs_base,
 }
 
 static int
-vfio_type1_dma_map(int vfio_container_fd)
+do_vfio_type1_dma_map(int vfio_container_fd, const struct rte_memseg *ms)
 {
-	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
-	int i, ret;
+	int ret;
+	struct vfio_iommu_type1_dma_map dma_map;
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		struct vfio_iommu_type1_dma_map dma_map;
+	memset(&dma_map, 0, sizeof(dma_map));
+	dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
+	dma_map.vaddr = ms->addr_64;
+	dma_map.size = ms->len;
 
-		if (ms[i].addr == NULL)
-			break;
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_map.iova = dma_map.vaddr;
+	else
+		dma_map.iova = ms->iova;
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
 
-		memset(&dma_map, 0, sizeof(dma_map));
-		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
-		dma_map.vaddr = ms[i].addr_64;
-		dma_map.size = ms[i].len;
-		if (rte_eal_iova_mode() == RTE_IOVA_VA)
-			dma_map.iova = dma_map.vaddr;
-		else
-			dma_map.iova = ms[i].iova;
-		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot set up DMA remapping, error %i (%s)\n",
+			errno,
+			strerror(errno));
+		return -1;
+	}
 
-		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+	return 0;
+}
 
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, "
-					  "error %i (%s)\n", errno,
-					  strerror(errno));
-			return -1;
-		}
+static int
+do_vfio_type1_dma_unmap(int vfio_container_fd, const struct rte_memseg *ms)
+{
+	int ret;
+	struct vfio_iommu_type1_dma_unmap dma_unmap;
+
+	memset(&dma_unmap, 0, sizeof(dma_unmap));
+	dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
+	dma_unmap.size = ms->len;
+
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_unmap.iova = ms->addr_64;
+	else
+		dma_unmap.iova = ms->iova;
+	dma_unmap.flags = 0;
+
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot unmap DMA, error %i (%s)\n",
+			errno,
+			strerror(errno));
+		return -1;
 	}
 
 	return 0;
 }
 
+static int
+vfio_type1_dma_map(int vfio_container_fd)
+{
+	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
+		if (ms[i].addr == NULL)
+			break;
+		ret = do_vfio_type1_dma_map(vfio_container_fd, &ms[i]);
+		if (ret < 0)
+			return ret;
+	}
+
+	return ret;
+}
+
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
@@ -843,4 +1148,37 @@  rte_vfio_noiommu_is_enabled(void)
 	return c == 'Y';
 }
 
+int __rte_experimental
+rte_vfio_dma_map(int container_fd, int dma_type, const struct rte_memseg *ms)
+{
+
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_map(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma map for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
+int __rte_experimental
+rte_vfio_dma_unmap(int container_fd, int dma_type, const struct rte_memseg *ms)
+{
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_unmap(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma unmap for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
 #endif
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h
index 80595773e..23a1e3608 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.h
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h
@@ -86,6 +86,7 @@  struct vfio_iommu_spapr_tce_info {
 #endif
 
 #define VFIO_MAX_GROUPS RTE_MAX_VFIO_GROUPS
+#define VFIO_MAX_CONTAINERS RTE_MAX_VFIO_CONTAINERS
 
 /*
  * Function prototypes for VFIO multiprocess sync functions
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index dd38783a2..a62833ed1 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -258,5 +258,11 @@  EXPERIMENTAL {
 	rte_service_start_with_defaults;
 	rte_socket_count;
 	rte_socket_id_by_idx;
+	rte_vfio_bind_group;
+	rte_vfio_create_container;
+	rte_vfio_destroy_container;
+	rte_vfio_dma_map;
+	rte_vfio_dma_unmap;
+	rte_vfio_unbind_group;
 
 } DPDK_18.02;