[v3] vhost: enable IOMMU for async vhost

Message ID 20210603173023.10487-1-xuan.ding@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Maxime Coquelin
Headers
Series [v3] vhost: enable IOMMU for async vhost |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-intel-Functional fail Functional Testing issues
ci/iol-abi-testing warning Testing issues
ci/github-robot success github build: passed
ci/iol-testing fail Testing issues
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS

Commit Message

Ding, Xuan June 3, 2021, 5:30 p.m. UTC
  From: Xuan Ding <xuan.ding@intel.com>

For async copy, it is unsafe to directly use the physical address.
And current address translation from GPA to HPA via SW also takes
CPU cycles, these can all benefit from IOMMU.

Since the existing DMA engine supports to use platform IOMMU,
this patch enables IOMMU for async vhost, which defines IOAT
devices to use virtual address instead of physical address.

When set memory table, the frontend's memory will be mapped
to the default container of DPDK where IOAT devices have been
added into. When DMA copy fails, the virtual address provided
to IOAT devices also allow us fallback to SW copy or PA copy.

With IOMMU enabled, to use IOAT devices:
1. IOAT devices must be binded to vfio-pci, rather than igb_uio.
2. DPDK must use "--iova-mode=va".

Signed-off-by: Xuan Ding <xuan.ding@intel.com>
---

v3:
* Fixed some typos.

v2:
* Fixed a format issue.
* Added the dma unmap logic when device is closed.
---
 doc/guides/prog_guide/vhost_lib.rst |  20 +++++
 lib/vhost/vhost_user.c              | 125 +++++++++-------------------
 lib/vhost/virtio_net.c              |  30 +++----
 3 files changed, 69 insertions(+), 106 deletions(-)
  

Comments

Maxime Coquelin June 18, 2021, 4:17 p.m. UTC | #1
Hi Xuan,

On 6/3/21 7:30 PM, xuan.ding@intel.com wrote:
> From: Xuan Ding <xuan.ding@intel.com>
> 
> For async copy, it is unsafe to directly use the physical address.
> And current address translation from GPA to HPA via SW also takes
> CPU cycles, these can all benefit from IOMMU.
> 
> Since the existing DMA engine supports to use platform IOMMU,
> this patch enables IOMMU for async vhost, which defines IOAT
> devices to use virtual address instead of physical address.

We have to keep in mind a generic DMA api is coming, and maybe we want
a SW implementation of a dmadev based on memcpy at least for
testing/debugging purpose.

> When set memory table, the frontend's memory will be mapped
> to the default container of DPDK where IOAT devices have been
> added into. When DMA copy fails, the virtual address provided
> to IOAT devices also allow us fallback to SW copy or PA copy.
> 
> With IOMMU enabled, to use IOAT devices:
> 1. IOAT devices must be binded to vfio-pci, rather than igb_uio.
> 2. DPDK must use "--iova-mode=va".

I think this is problematic, at least we need to check the right iova
mode has been selected, but even with doing that it is limiting.

What prevent us to reuse add_guest_pages() alogrithm to implement
IOVA_AS_PA?

> 
> Signed-off-by: Xuan Ding <xuan.ding@intel.com>
> ---
> 
> v3:
> * Fixed some typos.
> 
> v2:
> * Fixed a format issue.
> * Added the dma unmap logic when device is closed.
> ---
>  doc/guides/prog_guide/vhost_lib.rst |  20 +++++
>  lib/vhost/vhost_user.c              | 125 +++++++++-------------------
>  lib/vhost/virtio_net.c              |  30 +++----
>  3 files changed, 69 insertions(+), 106 deletions(-)
> 
> diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
> index d18fb98910..5777f0da96 100644
> --- a/doc/guides/prog_guide/vhost_lib.rst
> +++ b/doc/guides/prog_guide/vhost_lib.rst
> @@ -420,3 +420,23 @@ Finally, a set of device ops is defined for device specific operations:
>  * ``get_notify_area``
>  
>    Called to get the notify area info of the queue.
> +
> +Vhost async data path
> +---------------------
> +
> +* Address mode
> +
> +  Modern IOAT devices support to use the IOMMU, which can avoid using
> +  the unsafe HPA. Besides, the CPU cycles took by SW to translate from
> +  GPA to HPA can also be saved. So IOAT devices are defined to use
> +  virtual address instead of physical address.
> +
> +  With IOMMU enabled, to use IOAT devices:
> +  1. IOAT devices must be binded to vfio-pci, rather than igb_uio.
> +  2. DPDK must use ``--iova-mode=va``.
> +
> +* Fallback
> +
> +  When the DMA copy fails, the user who implements the transfer_data
> +  callback can fallback to SW copy or fallback to PA copy through
> +  rte_mem_virt2iova().
> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
> index 8f0eba6412..c33fa784ff 100644
> --- a/lib/vhost/vhost_user.c
> +++ b/lib/vhost/vhost_user.c
> @@ -45,6 +45,7 @@
>  #include <rte_common.h>
>  #include <rte_malloc.h>
>  #include <rte_log.h>
> +#include <rte_vfio.h>
>  
>  #include "iotlb.h"
>  #include "vhost.h"
> @@ -141,6 +142,34 @@ get_blk_size(int fd)
>  	return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
>  }
>  
> +static int
> +async_dma_map(struct rte_vhost_mem_region *region, bool do_map)
> +{
> +	int ret = 0;
> +	if (do_map) {
> +		/* Add mapped region into the default container of DPDK. */
> +		ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> +						 region->host_user_addr,
> +						 region->host_user_addr,
> +						 region->size);
> +		if (ret) {
> +			VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n");
> +			return ret;
> +		}
> +	} else {
> +		/* Remove mapped region from the default container of DPDK. */
> +		ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> +						   region->host_user_addr,
> +						   region->host_user_addr,
> +						   region->size);
> +		if (ret) {
> +			VHOST_LOG_CONFIG(ERR, "DMA engine unmap failed\n");
> +			return ret;
> +		}
> +	}
> +	return ret;
> +}
> +
>  static void
>  free_mem_region(struct virtio_net *dev)
>  {
> @@ -155,6 +184,9 @@ free_mem_region(struct virtio_net *dev)
>  		if (reg->host_user_addr) {
>  			munmap(reg->mmap_addr, reg->mmap_size);
>  			close(reg->fd);
> +
> +			if (dev->async_copy)
> +				async_dma_map(reg, false);
>  		}
>  	}
>  }
> @@ -866,87 +898,6 @@ vhost_user_set_vring_base(struct virtio_net **pdev,
>  	return RTE_VHOST_MSG_RESULT_OK;
>  }
>  
> -static int
> -add_one_guest_page(struct virtio_net *dev, uint64_t guest_phys_addr,
> -		   uint64_t host_phys_addr, uint64_t size)
> -{
> -	struct guest_page *page, *last_page;
> -	struct guest_page *old_pages;
> -
> -	if (dev->nr_guest_pages == dev->max_guest_pages) {
> -		dev->max_guest_pages *= 2;
> -		old_pages = dev->guest_pages;
> -		dev->guest_pages = rte_realloc(dev->guest_pages,
> -					dev->max_guest_pages * sizeof(*page),
> -					RTE_CACHE_LINE_SIZE);
> -		if (dev->guest_pages == NULL) {
> -			VHOST_LOG_CONFIG(ERR, "cannot realloc guest_pages\n");
> -			rte_free(old_pages);
> -			return -1;
> -		}
> -	}
> -
> -	if (dev->nr_guest_pages > 0) {
> -		last_page = &dev->guest_pages[dev->nr_guest_pages - 1];
> -		/* merge if the two pages are continuous */
> -		if (host_phys_addr == last_page->host_phys_addr +
> -				      last_page->size) {
> -			last_page->size += size;
> -			return 0;
> -		}
> -	}
> -
> -	page = &dev->guest_pages[dev->nr_guest_pages++];
> -	page->guest_phys_addr = guest_phys_addr;
> -	page->host_phys_addr  = host_phys_addr;
> -	page->size = size;
> -
> -	return 0;
> -}
> -
> -static int
> -add_guest_pages(struct virtio_net *dev, struct rte_vhost_mem_region *reg,
> -		uint64_t page_size)
> -{
> -	uint64_t reg_size = reg->size;
> -	uint64_t host_user_addr  = reg->host_user_addr;
> -	uint64_t guest_phys_addr = reg->guest_phys_addr;
> -	uint64_t host_phys_addr;
> -	uint64_t size;
> -
> -	host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t)host_user_addr);
> -	size = page_size - (guest_phys_addr & (page_size - 1));
> -	size = RTE_MIN(size, reg_size);
> -
> -	if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr, size) < 0)
> -		return -1;
> -
> -	host_user_addr  += size;
> -	guest_phys_addr += size;
> -	reg_size -= size;
> -
> -	while (reg_size > 0) {
> -		size = RTE_MIN(reg_size, page_size);
> -		host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t)
> -						  host_user_addr);
> -		if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr,
> -				size) < 0)
> -			return -1;
> -
> -		host_user_addr  += size;
> -		guest_phys_addr += size;
> -		reg_size -= size;
> -	}
> -
> -	/* sort guest page array if over binary search threshold */
> -	if (dev->nr_guest_pages >= VHOST_BINARY_SEARCH_THRESH) {
> -		qsort((void *)dev->guest_pages, dev->nr_guest_pages,
> -			sizeof(struct guest_page), guest_page_addrcmp);
> -	}
> -
> -	return 0;
> -}
> -
>  #ifdef RTE_LIBRTE_VHOST_DEBUG
>  /* TODO: enable it only in debug mode? */
>  static void
> @@ -1105,6 +1056,7 @@ vhost_user_mmap_region(struct virtio_net *dev,
>  	uint64_t mmap_size;
>  	uint64_t alignment;
>  	int populate;
> +	int ret;
>  
>  	/* Check for memory_size + mmap_offset overflow */
>  	if (mmap_offset >= -region->size) {
> @@ -1158,12 +1110,13 @@ vhost_user_mmap_region(struct virtio_net *dev,
>  	region->mmap_size = mmap_size;
>  	region->host_user_addr = (uint64_t)(uintptr_t)mmap_addr + mmap_offset;
>  
> -	if (dev->async_copy)
> -		if (add_guest_pages(dev, region, alignment) < 0) {
> -			VHOST_LOG_CONFIG(ERR,
> -					"adding guest pages to region failed.\n");
> +	if (dev->async_copy) {
> +		ret = async_dma_map(region, true);
> +		if (ret) {
> +			VHOST_LOG_CONFIG(ERR, "Configure IOMMU for DMA engine failed\n");
>  			return -1;

Maybe we're too late in the init already, but I would think we may want
to fallback to SW implementation insea

> -		}
> +			}
> +	}
>  
>  	VHOST_LOG_CONFIG(INFO,
>  			"guest memory region size: 0x%" PRIx64 "\n"
> diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> index 8da8a86a10..88110d2cb3 100644
> --- a/lib/vhost/virtio_net.c
> +++ b/lib/vhost/virtio_net.c
> @@ -980,11 +980,9 @@ async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
>  	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
>  	int error = 0;
> -	uint64_t mapped_len;
>  
>  	uint32_t tlen = 0;
>  	int tvec_idx = 0;
> -	void *hpa;
>  
>  	if (unlikely(m == NULL)) {
>  		error = -1;
> @@ -1074,27 +1072,19 @@ async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  
>  		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
>  
> -		while (unlikely(cpy_len && cpy_len >= cpy_threshold)) {
> -			hpa = (void *)(uintptr_t)gpa_to_first_hpa(dev,
> -					buf_iova + buf_offset,
> -					cpy_len, &mapped_len);
> -
> -			if (unlikely(!hpa || mapped_len < cpy_threshold))
> -				break;
> -
> +		if (unlikely(cpy_len >= cpy_threshold)) {
>  			async_fill_vec(src_iovec + tvec_idx,
> -				(void *)(uintptr_t)rte_pktmbuf_iova_offset(m,
> -				mbuf_offset), (size_t)mapped_len);
> +				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset), (size_t)cpy_len);
>  
>  			async_fill_vec(dst_iovec + tvec_idx,
> -					hpa, (size_t)mapped_len);
> -
> -			tlen += (uint32_t)mapped_len;
> -			cpy_len -= (uint32_t)mapped_len;
> -			mbuf_avail  -= (uint32_t)mapped_len;
> -			mbuf_offset += (uint32_t)mapped_len;
> -			buf_avail  -= (uint32_t)mapped_len;
> -			buf_offset += (uint32_t)mapped_len;
> +				(void *)((uintptr_t)(buf_addr + buf_offset)), (size_t)cpy_len);
> +
> +			tlen += cpy_len;
> +			mbuf_avail  -= cpy_len;
> +			mbuf_offset += cpy_len;
> +			buf_avail  -= cpy_len;
> +			buf_offset += cpy_len;
> +			cpy_len = 0;
>  			tvec_idx++;
>  		}
>  
>
  
Hu, Jiayu June 21, 2021, 3:57 a.m. UTC | #2
Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Saturday, June 19, 2021 12:18 AM
> To: Ding, Xuan <xuan.ding@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>
> Cc: dev@dpdk.org; Hu, Jiayu <jiayu.hu@intel.com>; Pai G, Sunil
> <sunil.pai.g@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> Van Haaren, Harry <harry.van.haaren@intel.com>; Liu, Yong
> <yong.liu@intel.com>
> Subject: Re: [PATCH v3] vhost: enable IOMMU for async vhost
> 
> Hi Xuan,
> 
> On 6/3/21 7:30 PM, xuan.ding@intel.com wrote:
> > From: Xuan Ding <xuan.ding@intel.com>
> >
> > For async copy, it is unsafe to directly use the physical address.
> > And current address translation from GPA to HPA via SW also takes CPU
> > cycles, these can all benefit from IOMMU.
> >
> > Since the existing DMA engine supports to use platform IOMMU, this
> > patch enables IOMMU for async vhost, which defines IOAT devices to use
> > virtual address instead of physical address.
> 
> We have to keep in mind a generic DMA api is coming, and maybe we want a
> SW implementation of a dmadev based on memcpy at least for
> testing/debugging purpose.

Agree, we need to support SW fallback, and I think this is also what this
patch wants to do. Originally, vhost passes IOVA to DMA callbacks; if
DPDK in PA mode, we cannot fallback to SW copy. In this patch, vhost
passes both VA for pktmbuf and guest's buffer to DMA callbacks, which
makes SW fallback possible.

In terms of generic DMA api, no matter it uses VA or IOVA as buffer addresses,
I think this design can work, as DMA callback implementations can do address
translation anyway.

> 
> > When set memory table, the frontend's memory will be mapped to the
> > default container of DPDK where IOAT devices have been added into.
> > When DMA copy fails, the virtual address provided to IOAT devices also
> > allow us fallback to SW copy or PA copy.
> >
> > With IOMMU enabled, to use IOAT devices:
> > 1. IOAT devices must be binded to vfio-pci, rather than igb_uio.
> > 2. DPDK must use "--iova-mode=va".
> 
> I think this is problematic, at least we need to check the right iova mode has
> been selected, but even with doing that it is limiting.
> 
> What prevent us to reuse add_guest_pages() alogrithm to implement
> IOVA_AS_PA?

In the original design, vfio doesn't work, as vhost doesn't programs iommu
table with guest's memory. Specifically, if DPDK is in VA mode, IOVA passed
to DMA callback is VA, but IOMMU cannot find corresponding PA for guest
buffers; if DPDK is in PA mode, IOVA passed to DMA callback is PA. In this case,
there are random errors for guest buffers when VT-d is enabled, as IOMMU
behavior is uncertain. I think supporting vfio is one of reasons of this patch.

One concern about this patch is how to handle when IOVA is PA. If IOVA is PA,
IOMMU cannot find correct PA for pktmbuf via VA passed by vhost. But can
DMA callback translate VA to PA before calling ioat/dmadev API? IMHO, IOVA
as PA with vfio is not a recommended configuration. Do you think it's a must
for vhost to support this case?

Thanks,
Jiayu

> 
> >
> > Signed-off-by: Xuan Ding <xuan.ding@intel.com>
> > ---
> >
> > v3:
> > * Fixed some typos.
> >
> > v2:
> > * Fixed a format issue.
> > * Added the dma unmap logic when device is closed.
> > ---
> >  doc/guides/prog_guide/vhost_lib.rst |  20 +++++
> >  lib/vhost/vhost_user.c              | 125 +++++++++-------------------
> >  lib/vhost/virtio_net.c              |  30 +++----
> >  3 files changed, 69 insertions(+), 106 deletions(-)
> >
> > diff --git a/doc/guides/prog_guide/vhost_lib.rst
> > b/doc/guides/prog_guide/vhost_lib.rst
> > index d18fb98910..5777f0da96 100644
> > --- a/doc/guides/prog_guide/vhost_lib.rst
> > +++ b/doc/guides/prog_guide/vhost_lib.rst
> > @@ -420,3 +420,23 @@ Finally, a set of device ops is defined for device
> specific operations:
> >  * ``get_notify_area``
> >
> >    Called to get the notify area info of the queue.
> > +
> > +Vhost async data path
> > +---------------------
> > +
> > +* Address mode
> > +
> > +  Modern IOAT devices support to use the IOMMU, which can avoid using
> > + the unsafe HPA. Besides, the CPU cycles took by SW to translate from
> > + GPA to HPA can also be saved. So IOAT devices are defined to use
> > + virtual address instead of physical address.
> > +
> > +  With IOMMU enabled, to use IOAT devices:
> > +  1. IOAT devices must be binded to vfio-pci, rather than igb_uio.
> > +  2. DPDK must use ``--iova-mode=va``.
> > +
> > +* Fallback
> > +
> > +  When the DMA copy fails, the user who implements the transfer_data
> > + callback can fallback to SW copy or fallback to PA copy through
> > + rte_mem_virt2iova().
> > diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c index
> > 8f0eba6412..c33fa784ff 100644
> > --- a/lib/vhost/vhost_user.c
> > +++ b/lib/vhost/vhost_user.c
> > @@ -45,6 +45,7 @@
> >  #include <rte_common.h>
> >  #include <rte_malloc.h>
> >  #include <rte_log.h>
> > +#include <rte_vfio.h>
> >
> >  #include "iotlb.h"
> >  #include "vhost.h"
> > @@ -141,6 +142,34 @@ get_blk_size(int fd)
> >  	return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;  }
> >
> > +static int
> > +async_dma_map(struct rte_vhost_mem_region *region, bool do_map) {
> > +	int ret = 0;
> > +	if (do_map) {
> > +		/* Add mapped region into the default container of DPDK. */
> > +		ret =
> rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > +						 region->host_user_addr,
> > +						 region->host_user_addr,
> > +						 region->size);
> > +		if (ret) {
> > +			VHOST_LOG_CONFIG(ERR, "DMA engine map
> failed\n");
> > +			return ret;
> > +		}
> > +	} else {
> > +		/* Remove mapped region from the default container of
> DPDK. */
> > +		ret =
> rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > +						   region->host_user_addr,
> > +						   region->host_user_addr,
> > +						   region->size);
> > +		if (ret) {
> > +			VHOST_LOG_CONFIG(ERR, "DMA engine unmap
> failed\n");
> > +			return ret;
> > +		}
> > +	}
> > +	return ret;
> > +}
> > +
> >  static void
> >  free_mem_region(struct virtio_net *dev)  { @@ -155,6 +184,9 @@
> > free_mem_region(struct virtio_net *dev)
> >  		if (reg->host_user_addr) {
> >  			munmap(reg->mmap_addr, reg->mmap_size);
> >  			close(reg->fd);
> > +
> > +			if (dev->async_copy)
> > +				async_dma_map(reg, false);
> >  		}
> >  	}
> >  }
> > @@ -866,87 +898,6 @@ vhost_user_set_vring_base(struct virtio_net
> **pdev,
> >  	return RTE_VHOST_MSG_RESULT_OK;
> >  }
> >
> > -static int
> > -add_one_guest_page(struct virtio_net *dev, uint64_t guest_phys_addr,
> > -		   uint64_t host_phys_addr, uint64_t size)
> > -{
> > -	struct guest_page *page, *last_page;
> > -	struct guest_page *old_pages;
> > -
> > -	if (dev->nr_guest_pages == dev->max_guest_pages) {
> > -		dev->max_guest_pages *= 2;
> > -		old_pages = dev->guest_pages;
> > -		dev->guest_pages = rte_realloc(dev->guest_pages,
> > -					dev->max_guest_pages *
> sizeof(*page),
> > -					RTE_CACHE_LINE_SIZE);
> > -		if (dev->guest_pages == NULL) {
> > -			VHOST_LOG_CONFIG(ERR, "cannot realloc
> guest_pages\n");
> > -			rte_free(old_pages);
> > -			return -1;
> > -		}
> > -	}
> > -
> > -	if (dev->nr_guest_pages > 0) {
> > -		last_page = &dev->guest_pages[dev->nr_guest_pages - 1];
> > -		/* merge if the two pages are continuous */
> > -		if (host_phys_addr == last_page->host_phys_addr +
> > -				      last_page->size) {
> > -			last_page->size += size;
> > -			return 0;
> > -		}
> > -	}
> > -
> > -	page = &dev->guest_pages[dev->nr_guest_pages++];
> > -	page->guest_phys_addr = guest_phys_addr;
> > -	page->host_phys_addr  = host_phys_addr;
> > -	page->size = size;
> > -
> > -	return 0;
> > -}
> > -
> > -static int
> > -add_guest_pages(struct virtio_net *dev, struct rte_vhost_mem_region
> *reg,
> > -		uint64_t page_size)
> > -{
> > -	uint64_t reg_size = reg->size;
> > -	uint64_t host_user_addr  = reg->host_user_addr;
> > -	uint64_t guest_phys_addr = reg->guest_phys_addr;
> > -	uint64_t host_phys_addr;
> > -	uint64_t size;
> > -
> > -	host_phys_addr = rte_mem_virt2iova((void
> *)(uintptr_t)host_user_addr);
> > -	size = page_size - (guest_phys_addr & (page_size - 1));
> > -	size = RTE_MIN(size, reg_size);
> > -
> > -	if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr, size)
> < 0)
> > -		return -1;
> > -
> > -	host_user_addr  += size;
> > -	guest_phys_addr += size;
> > -	reg_size -= size;
> > -
> > -	while (reg_size > 0) {
> > -		size = RTE_MIN(reg_size, page_size);
> > -		host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t)
> > -						  host_user_addr);
> > -		if (add_one_guest_page(dev, guest_phys_addr,
> host_phys_addr,
> > -				size) < 0)
> > -			return -1;
> > -
> > -		host_user_addr  += size;
> > -		guest_phys_addr += size;
> > -		reg_size -= size;
> > -	}
> > -
> > -	/* sort guest page array if over binary search threshold */
> > -	if (dev->nr_guest_pages >= VHOST_BINARY_SEARCH_THRESH) {
> > -		qsort((void *)dev->guest_pages, dev->nr_guest_pages,
> > -			sizeof(struct guest_page), guest_page_addrcmp);
> > -	}
> > -
> > -	return 0;
> > -}
> > -
> >  #ifdef RTE_LIBRTE_VHOST_DEBUG
> >  /* TODO: enable it only in debug mode? */  static void @@ -1105,6
> > +1056,7 @@ vhost_user_mmap_region(struct virtio_net *dev,
> >  	uint64_t mmap_size;
> >  	uint64_t alignment;
> >  	int populate;
> > +	int ret;
> >
> >  	/* Check for memory_size + mmap_offset overflow */
> >  	if (mmap_offset >= -region->size) {
> > @@ -1158,12 +1110,13 @@ vhost_user_mmap_region(struct virtio_net
> *dev,
> >  	region->mmap_size = mmap_size;
> >  	region->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
> > mmap_offset;
> >
> > -	if (dev->async_copy)
> > -		if (add_guest_pages(dev, region, alignment) < 0) {
> > -			VHOST_LOG_CONFIG(ERR,
> > -					"adding guest pages to region
> failed.\n");
> > +	if (dev->async_copy) {
> > +		ret = async_dma_map(region, true);
> > +		if (ret) {
> > +			VHOST_LOG_CONFIG(ERR, "Configure IOMMU for
> DMA engine failed\n");
> >  			return -1;
> 
> Maybe we're too late in the init already, but I would think we may want to
> fallback to SW implementation insea
> 
> > -		}
> > +			}
> > +	}
> >
> >  	VHOST_LOG_CONFIG(INFO,
> >  			"guest memory region size: 0x%" PRIx64 "\n"
> > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c index
> > 8da8a86a10..88110d2cb3 100644
> > --- a/lib/vhost/virtio_net.c
> > +++ b/lib/vhost/virtio_net.c
> > @@ -980,11 +980,9 @@ async_mbuf_to_desc(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
> >  	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
> >  	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
> >  	int error = 0;
> > -	uint64_t mapped_len;
> >
> >  	uint32_t tlen = 0;
> >  	int tvec_idx = 0;
> > -	void *hpa;
> >
> >  	if (unlikely(m == NULL)) {
> >  		error = -1;
> > @@ -1074,27 +1072,19 @@ async_mbuf_to_desc(struct virtio_net *dev,
> > struct vhost_virtqueue *vq,
> >
> >  		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
> >
> > -		while (unlikely(cpy_len && cpy_len >= cpy_threshold)) {
> > -			hpa = (void *)(uintptr_t)gpa_to_first_hpa(dev,
> > -					buf_iova + buf_offset,
> > -					cpy_len, &mapped_len);
> > -
> > -			if (unlikely(!hpa || mapped_len < cpy_threshold))
> > -				break;
> > -
> > +		if (unlikely(cpy_len >= cpy_threshold)) {
> >  			async_fill_vec(src_iovec + tvec_idx,
> > -				(void *)(uintptr_t)rte_pktmbuf_iova_offset(m,
> > -				mbuf_offset), (size_t)mapped_len);
> > +				rte_pktmbuf_mtod_offset(m, void *,
> mbuf_offset),
> > +(size_t)cpy_len);
> >
> >  			async_fill_vec(dst_iovec + tvec_idx,
> > -					hpa, (size_t)mapped_len);
> > -
> > -			tlen += (uint32_t)mapped_len;
> > -			cpy_len -= (uint32_t)mapped_len;
> > -			mbuf_avail  -= (uint32_t)mapped_len;
> > -			mbuf_offset += (uint32_t)mapped_len;
> > -			buf_avail  -= (uint32_t)mapped_len;
> > -			buf_offset += (uint32_t)mapped_len;
> > +				(void *)((uintptr_t)(buf_addr + buf_offset)),
> (size_t)cpy_len);
> > +
> > +			tlen += cpy_len;
> > +			mbuf_avail  -= cpy_len;
> > +			mbuf_offset += cpy_len;
> > +			buf_avail  -= cpy_len;
> > +			buf_offset += cpy_len;
> > +			cpy_len = 0;
> >  			tvec_idx++;
> >  		}
> >
> >
  
Ding, Xuan June 22, 2021, 6:18 a.m. UTC | #3
Hi Maxime,

Replies are inline.	

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Saturday, June 19, 2021 12:18 AM
> To: Ding, Xuan <xuan.ding@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>
> Cc: dev@dpdk.org; Hu, Jiayu <jiayu.hu@intel.com>; Pai G, Sunil
> <sunil.pai.g@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Van
> Haaren, Harry <harry.van.haaren@intel.com>; Liu, Yong <yong.liu@intel.com>
> Subject: Re: [PATCH v3] vhost: enable IOMMU for async vhost
> 
> Hi Xuan,
> 
> On 6/3/21 7:30 PM, xuan.ding@intel.com wrote:
> > From: Xuan Ding <xuan.ding@intel.com>
> >
> > For async copy, it is unsafe to directly use the physical address.
> > And current address translation from GPA to HPA via SW also takes
> > CPU cycles, these can all benefit from IOMMU.
> >
> > Since the existing DMA engine supports to use platform IOMMU,
> > this patch enables IOMMU for async vhost, which defines IOAT
> > devices to use virtual address instead of physical address.
> 
> We have to keep in mind a generic DMA api is coming, and maybe we want
> a SW implementation of a dmadev based on memcpy at least for
> testing/debugging purpose.

I noticed the generic dmadev model is under discussion. To support a SW
implementation, the VA mode support is needed, this is also the problem
that this patch hopes to solve. Traditionally, DMA engine can only use
physical address in PA mode.

> 
> > When set memory table, the frontend's memory will be mapped
> > to the default container of DPDK where IOAT devices have been
> > added into. When DMA copy fails, the virtual address provided
> > to IOAT devices also allow us fallback to SW copy or PA copy.
> >
> > With IOMMU enabled, to use IOAT devices:
> > 1. IOAT devices must be binded to vfio-pci, rather than igb_uio.
> > 2. DPDK must use "--iova-mode=va".
> 
> I think this is problematic, at least we need to check the right iova
> mode has been selected, but even with doing that it is limiting.

As a library, vhost is not aware of the device selected address(PA or VA)
and current DPDK iova mode. To some extent, this patch is a proposal.
With device fed with VA, SW fallback can be supported.
And VA can also be translated to PA through rte_mem_virt2iova().
Finally, the address selected by the device is determined by callback.
Not vice versa.

If the DMA callback implementer follows this design, SW fallback can be supported.
I would be very grateful if you could provide some insights for this design. :)

> 
> What prevent us to reuse add_guest_pages() alogrithm to implement
> IOVA_AS_PA?

If IOVA is PA, it's not easy to translate PA to VA to support SW implementation.
Until now, I don't have any good ideas to be compatible with IOVA_AS_PA 
and IOVA_AS_VA at the same time, because it requires vhost library to
select PA/VA for DMA device according to different DPDK iova mode.

Thanks,
Xuan

> 
> >
> > Signed-off-by: Xuan Ding <xuan.ding@intel.com>
> > ---
> >
> > v3:
> > * Fixed some typos.
> >
> > v2:
> > * Fixed a format issue.
> > * Added the dma unmap logic when device is closed.
> > ---
> >  doc/guides/prog_guide/vhost_lib.rst |  20 +++++
> >  lib/vhost/vhost_user.c              | 125 +++++++++-------------------
> >  lib/vhost/virtio_net.c              |  30 +++----
> >  3 files changed, 69 insertions(+), 106 deletions(-)
> >
> > diff --git a/doc/guides/prog_guide/vhost_lib.rst
> b/doc/guides/prog_guide/vhost_lib.rst
> > index d18fb98910..5777f0da96 100644
> > --- a/doc/guides/prog_guide/vhost_lib.rst
> > +++ b/doc/guides/prog_guide/vhost_lib.rst
> > @@ -420,3 +420,23 @@ Finally, a set of device ops is defined for device
> specific operations:
> >  * ``get_notify_area``
> >
> >    Called to get the notify area info of the queue.
> > +
> > +Vhost async data path
> > +---------------------
> > +
> > +* Address mode
> > +
> > +  Modern IOAT devices support to use the IOMMU, which can avoid using
> > +  the unsafe HPA. Besides, the CPU cycles took by SW to translate from
> > +  GPA to HPA can also be saved. So IOAT devices are defined to use
> > +  virtual address instead of physical address.
> > +
> > +  With IOMMU enabled, to use IOAT devices:
> > +  1. IOAT devices must be binded to vfio-pci, rather than igb_uio.
> > +  2. DPDK must use ``--iova-mode=va``.
> > +
> > +* Fallback
> > +
> > +  When the DMA copy fails, the user who implements the transfer_data
> > +  callback can fallback to SW copy or fallback to PA copy through
> > +  rte_mem_virt2iova().
> > diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
> > index 8f0eba6412..c33fa784ff 100644
> > --- a/lib/vhost/vhost_user.c
> > +++ b/lib/vhost/vhost_user.c
> > @@ -45,6 +45,7 @@
> >  #include <rte_common.h>
> >  #include <rte_malloc.h>
> >  #include <rte_log.h>
> > +#include <rte_vfio.h>
> >
> >  #include "iotlb.h"
> >  #include "vhost.h"
> > @@ -141,6 +142,34 @@ get_blk_size(int fd)
> >  	return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
> >  }
> >
> > +static int
> > +async_dma_map(struct rte_vhost_mem_region *region, bool do_map)
> > +{
> > +	int ret = 0;
> > +	if (do_map) {
> > +		/* Add mapped region into the default container of DPDK. */
> > +		ret =
> rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > +						 region->host_user_addr,
> > +						 region->host_user_addr,
> > +						 region->size);
> > +		if (ret) {
> > +			VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n");
> > +			return ret;
> > +		}
> > +	} else {
> > +		/* Remove mapped region from the default container of DPDK.
> */
> > +		ret =
> rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > +						   region->host_user_addr,
> > +						   region->host_user_addr,
> > +						   region->size);
> > +		if (ret) {
> > +			VHOST_LOG_CONFIG(ERR, "DMA engine unmap
> failed\n");
> > +			return ret;
> > +		}
> > +	}
> > +	return ret;
> > +}
> > +
> >  static void
> >  free_mem_region(struct virtio_net *dev)
> >  {
> > @@ -155,6 +184,9 @@ free_mem_region(struct virtio_net *dev)
> >  		if (reg->host_user_addr) {
> >  			munmap(reg->mmap_addr, reg->mmap_size);
> >  			close(reg->fd);
> > +
> > +			if (dev->async_copy)
> > +				async_dma_map(reg, false);
> >  		}
> >  	}
> >  }
> > @@ -866,87 +898,6 @@ vhost_user_set_vring_base(struct virtio_net **pdev,
> >  	return RTE_VHOST_MSG_RESULT_OK;
> >  }
> >
> > -static int
> > -add_one_guest_page(struct virtio_net *dev, uint64_t guest_phys_addr,
> > -		   uint64_t host_phys_addr, uint64_t size)
> > -{
> > -	struct guest_page *page, *last_page;
> > -	struct guest_page *old_pages;
> > -
> > -	if (dev->nr_guest_pages == dev->max_guest_pages) {
> > -		dev->max_guest_pages *= 2;
> > -		old_pages = dev->guest_pages;
> > -		dev->guest_pages = rte_realloc(dev->guest_pages,
> > -					dev->max_guest_pages * sizeof(*page),
> > -					RTE_CACHE_LINE_SIZE);
> > -		if (dev->guest_pages == NULL) {
> > -			VHOST_LOG_CONFIG(ERR, "cannot realloc
> guest_pages\n");
> > -			rte_free(old_pages);
> > -			return -1;
> > -		}
> > -	}
> > -
> > -	if (dev->nr_guest_pages > 0) {
> > -		last_page = &dev->guest_pages[dev->nr_guest_pages - 1];
> > -		/* merge if the two pages are continuous */
> > -		if (host_phys_addr == last_page->host_phys_addr +
> > -				      last_page->size) {
> > -			last_page->size += size;
> > -			return 0;
> > -		}
> > -	}
> > -
> > -	page = &dev->guest_pages[dev->nr_guest_pages++];
> > -	page->guest_phys_addr = guest_phys_addr;
> > -	page->host_phys_addr  = host_phys_addr;
> > -	page->size = size;
> > -
> > -	return 0;
> > -}
> > -
> > -static int
> > -add_guest_pages(struct virtio_net *dev, struct rte_vhost_mem_region *reg,
> > -		uint64_t page_size)
> > -{
> > -	uint64_t reg_size = reg->size;
> > -	uint64_t host_user_addr  = reg->host_user_addr;
> > -	uint64_t guest_phys_addr = reg->guest_phys_addr;
> > -	uint64_t host_phys_addr;
> > -	uint64_t size;
> > -
> > -	host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t)host_user_addr);
> > -	size = page_size - (guest_phys_addr & (page_size - 1));
> > -	size = RTE_MIN(size, reg_size);
> > -
> > -	if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr, size) <
> 0)
> > -		return -1;
> > -
> > -	host_user_addr  += size;
> > -	guest_phys_addr += size;
> > -	reg_size -= size;
> > -
> > -	while (reg_size > 0) {
> > -		size = RTE_MIN(reg_size, page_size);
> > -		host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t)
> > -						  host_user_addr);
> > -		if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr,
> > -				size) < 0)
> > -			return -1;
> > -
> > -		host_user_addr  += size;
> > -		guest_phys_addr += size;
> > -		reg_size -= size;
> > -	}
> > -
> > -	/* sort guest page array if over binary search threshold */
> > -	if (dev->nr_guest_pages >= VHOST_BINARY_SEARCH_THRESH) {
> > -		qsort((void *)dev->guest_pages, dev->nr_guest_pages,
> > -			sizeof(struct guest_page), guest_page_addrcmp);
> > -	}
> > -
> > -	return 0;
> > -}
> > -
> >  #ifdef RTE_LIBRTE_VHOST_DEBUG
> >  /* TODO: enable it only in debug mode? */
> >  static void
> > @@ -1105,6 +1056,7 @@ vhost_user_mmap_region(struct virtio_net *dev,
> >  	uint64_t mmap_size;
> >  	uint64_t alignment;
> >  	int populate;
> > +	int ret;
> >
> >  	/* Check for memory_size + mmap_offset overflow */
> >  	if (mmap_offset >= -region->size) {
> > @@ -1158,12 +1110,13 @@ vhost_user_mmap_region(struct virtio_net *dev,
> >  	region->mmap_size = mmap_size;
> >  	region->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
> mmap_offset;
> >
> > -	if (dev->async_copy)
> > -		if (add_guest_pages(dev, region, alignment) < 0) {
> > -			VHOST_LOG_CONFIG(ERR,
> > -					"adding guest pages to region
> failed.\n");
> > +	if (dev->async_copy) {
> > +		ret = async_dma_map(region, true);
> > +		if (ret) {
> > +			VHOST_LOG_CONFIG(ERR, "Configure IOMMU for DMA
> engine failed\n");
> >  			return -1;
> 
> Maybe we're too late in the init already, but I would think we may want
> to fallback to SW implementation insea
> 
> > -		}
> > +			}
> > +	}
> >
> >  	VHOST_LOG_CONFIG(INFO,
> >  			"guest memory region size: 0x%" PRIx64 "\n"
> > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> > index 8da8a86a10..88110d2cb3 100644
> > --- a/lib/vhost/virtio_net.c
> > +++ b/lib/vhost/virtio_net.c
> > @@ -980,11 +980,9 @@ async_mbuf_to_desc(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
> >  	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
> >  	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
> >  	int error = 0;
> > -	uint64_t mapped_len;
> >
> >  	uint32_t tlen = 0;
> >  	int tvec_idx = 0;
> > -	void *hpa;
> >
> >  	if (unlikely(m == NULL)) {
> >  		error = -1;
> > @@ -1074,27 +1072,19 @@ async_mbuf_to_desc(struct virtio_net *dev,
> struct vhost_virtqueue *vq,
> >
> >  		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
> >
> > -		while (unlikely(cpy_len && cpy_len >= cpy_threshold)) {
> > -			hpa = (void *)(uintptr_t)gpa_to_first_hpa(dev,
> > -					buf_iova + buf_offset,
> > -					cpy_len, &mapped_len);
> > -
> > -			if (unlikely(!hpa || mapped_len < cpy_threshold))
> > -				break;
> > -
> > +		if (unlikely(cpy_len >= cpy_threshold)) {
> >  			async_fill_vec(src_iovec + tvec_idx,
> > -				(void *)(uintptr_t)rte_pktmbuf_iova_offset(m,
> > -				mbuf_offset), (size_t)mapped_len);
> > +				rte_pktmbuf_mtod_offset(m, void *,
> mbuf_offset), (size_t)cpy_len);
> >
> >  			async_fill_vec(dst_iovec + tvec_idx,
> > -					hpa, (size_t)mapped_len);
> > -
> > -			tlen += (uint32_t)mapped_len;
> > -			cpy_len -= (uint32_t)mapped_len;
> > -			mbuf_avail  -= (uint32_t)mapped_len;
> > -			mbuf_offset += (uint32_t)mapped_len;
> > -			buf_avail  -= (uint32_t)mapped_len;
> > -			buf_offset += (uint32_t)mapped_len;
> > +				(void *)((uintptr_t)(buf_addr + buf_offset)),
> (size_t)cpy_len);
> > +
> > +			tlen += cpy_len;
> > +			mbuf_avail  -= cpy_len;
> > +			mbuf_offset += cpy_len;
> > +			buf_avail  -= cpy_len;
> > +			buf_offset += cpy_len;
> > +			cpy_len = 0;
> >  			tvec_idx++;
> >  		}
> >
> >
  
Maxime Coquelin June 29, 2021, 9:23 a.m. UTC | #4
Hi Xuan,

On 6/22/21 8:18 AM, Ding, Xuan wrote:
> Hi Maxime,
> 
> Replies are inline.	
> 
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Saturday, June 19, 2021 12:18 AM
>> To: Ding, Xuan <xuan.ding@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>
>> Cc: dev@dpdk.org; Hu, Jiayu <jiayu.hu@intel.com>; Pai G, Sunil
>> <sunil.pai.g@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Van
>> Haaren, Harry <harry.van.haaren@intel.com>; Liu, Yong <yong.liu@intel.com>
>> Subject: Re: [PATCH v3] vhost: enable IOMMU for async vhost
>>
>> Hi Xuan,
>>
>> On 6/3/21 7:30 PM, xuan.ding@intel.com wrote:
>>> From: Xuan Ding <xuan.ding@intel.com>
>>>
>>> For async copy, it is unsafe to directly use the physical address.
>>> And current address translation from GPA to HPA via SW also takes
>>> CPU cycles, these can all benefit from IOMMU.
>>>
>>> Since the existing DMA engine supports to use platform IOMMU,
>>> this patch enables IOMMU for async vhost, which defines IOAT
>>> devices to use virtual address instead of physical address.
>>
>> We have to keep in mind a generic DMA api is coming, and maybe we want
>> a SW implementation of a dmadev based on memcpy at least for
>> testing/debugging purpose.
> 
> I noticed the generic dmadev model is under discussion. To support a SW
> implementation, the VA mode support is needed, this is also the problem
> that this patch hopes to solve. Traditionally, DMA engine can only use
> physical address in PA mode.
> 
>>
>>> When set memory table, the frontend's memory will be mapped
>>> to the default container of DPDK where IOAT devices have been
>>> added into. When DMA copy fails, the virtual address provided
>>> to IOAT devices also allow us fallback to SW copy or PA copy.
>>>
>>> With IOMMU enabled, to use IOAT devices:
>>> 1. IOAT devices must be binded to vfio-pci, rather than igb_uio.
>>> 2. DPDK must use "--iova-mode=va".
>>
>> I think this is problematic, at least we need to check the right iova
>> mode has been selected, but even with doing that it is limiting.
> 
> As a library, vhost is not aware of the device selected address(PA or VA)
> and current DPDK iova mode. To some extent, this patch is a proposal.

If I'm not mistaken, the DMA device driver init should fail if it does
not support the DPDK IOVA mode.

Then, on Vhost lib side, you should be able to get the IOVA mode by
using the rte_eal_iova_mode() API.

> With device fed with VA, SW fallback can be supported.
> And VA can also be translated to PA through rte_mem_virt2iova().
> Finally, the address selected by the device is determined by callback.
> Not vice versa.
> 
> If the DMA callback implementer follows this design, SW fallback can be supported.
> I would be very grateful if you could provide some insights for this design. :)

TBH, I find the async design too much complicated.
Having some descriptors handled by the DMA engine, others by the CPU
makes it extremly hard to debug. Also, it makes Vhost library use less
deterministic.

>>
>> What prevent us to reuse add_guest_pages() alogrithm to implement
>> IOVA_AS_PA?
> 
> If IOVA is PA, it's not easy to translate PA to VA to support SW implementation.

What prevent you to use dev->guest_pages[] in that case to do the
translations?

> Until now, I don't have any good ideas to be compatible with IOVA_AS_PA 
> and IOVA_AS_VA at the same time, because it requires vhost library to
> select PA/VA for DMA device according to different DPDK iova mode.

If the DMA device claims to support IOVA_AS_PA at probe time, it should
be useable by the vhost library. It might not be the more efficient
mode, but we cannot just have a comment in the documenation saying that
IOVA_AS_VA is the only supported mode, without any safety check in the
code itself.

Regards,
Maxime

> Thanks,
> Xuan
> 
>>
>>>
>>> Signed-off-by: Xuan Ding <xuan.ding@intel.com>
>>> ---
>>>
>>> v3:
>>> * Fixed some typos.
>>>
>>> v2:
>>> * Fixed a format issue.
>>> * Added the dma unmap logic when device is closed.
>>> ---
>>>  doc/guides/prog_guide/vhost_lib.rst |  20 +++++
>>>  lib/vhost/vhost_user.c              | 125 +++++++++-------------------
>>>  lib/vhost/virtio_net.c              |  30 +++----
>>>  3 files changed, 69 insertions(+), 106 deletions(-)
>>>
>>> diff --git a/doc/guides/prog_guide/vhost_lib.rst
>> b/doc/guides/prog_guide/vhost_lib.rst
>>> index d18fb98910..5777f0da96 100644
>>> --- a/doc/guides/prog_guide/vhost_lib.rst
>>> +++ b/doc/guides/prog_guide/vhost_lib.rst
>>> @@ -420,3 +420,23 @@ Finally, a set of device ops is defined for device
>> specific operations:
>>>  * ``get_notify_area``
>>>
>>>    Called to get the notify area info of the queue.
>>> +
>>> +Vhost async data path
>>> +---------------------
>>> +
>>> +* Address mode
>>> +
>>> +  Modern IOAT devices support to use the IOMMU, which can avoid using
>>> +  the unsafe HPA. Besides, the CPU cycles took by SW to translate from
>>> +  GPA to HPA can also be saved. So IOAT devices are defined to use
>>> +  virtual address instead of physical address.
>>> +
>>> +  With IOMMU enabled, to use IOAT devices:
>>> +  1. IOAT devices must be binded to vfio-pci, rather than igb_uio.
>>> +  2. DPDK must use ``--iova-mode=va``.
>>> +
>>> +* Fallback
>>> +
>>> +  When the DMA copy fails, the user who implements the transfer_data
>>> +  callback can fallback to SW copy or fallback to PA copy through
>>> +  rte_mem_virt2iova().
>>> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
>>> index 8f0eba6412..c33fa784ff 100644
>>> --- a/lib/vhost/vhost_user.c
>>> +++ b/lib/vhost/vhost_user.c
>>> @@ -45,6 +45,7 @@
>>>  #include <rte_common.h>
>>>  #include <rte_malloc.h>
>>>  #include <rte_log.h>
>>> +#include <rte_vfio.h>
>>>
>>>  #include "iotlb.h"
>>>  #include "vhost.h"
>>> @@ -141,6 +142,34 @@ get_blk_size(int fd)
>>>  	return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
>>>  }
>>>
>>> +static int
>>> +async_dma_map(struct rte_vhost_mem_region *region, bool do_map)
>>> +{
>>> +	int ret = 0;
>>> +	if (do_map) {
>>> +		/* Add mapped region into the default container of DPDK. */
>>> +		ret =
>> rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
>>> +						 region->host_user_addr,
>>> +						 region->host_user_addr,
>>> +						 region->size);
>>> +		if (ret) {
>>> +			VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n");
>>> +			return ret;
>>> +		}
>>> +	} else {
>>> +		/* Remove mapped region from the default container of DPDK.
>> */
>>> +		ret =
>> rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
>>> +						   region->host_user_addr,
>>> +						   region->host_user_addr,
>>> +						   region->size);
>>> +		if (ret) {
>>> +			VHOST_LOG_CONFIG(ERR, "DMA engine unmap
>> failed\n");
>>> +			return ret;
>>> +		}
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>>  static void
>>>  free_mem_region(struct virtio_net *dev)
>>>  {
>>> @@ -155,6 +184,9 @@ free_mem_region(struct virtio_net *dev)
>>>  		if (reg->host_user_addr) {
>>>  			munmap(reg->mmap_addr, reg->mmap_size);
>>>  			close(reg->fd);
>>> +
>>> +			if (dev->async_copy)
>>> +				async_dma_map(reg, false);
>>>  		}
>>>  	}
>>>  }
>>> @@ -866,87 +898,6 @@ vhost_user_set_vring_base(struct virtio_net **pdev,
>>>  	return RTE_VHOST_MSG_RESULT_OK;
>>>  }
>>>
>>> -static int
>>> -add_one_guest_page(struct virtio_net *dev, uint64_t guest_phys_addr,
>>> -		   uint64_t host_phys_addr, uint64_t size)
>>> -{
>>> -	struct guest_page *page, *last_page;
>>> -	struct guest_page *old_pages;
>>> -
>>> -	if (dev->nr_guest_pages == dev->max_guest_pages) {
>>> -		dev->max_guest_pages *= 2;
>>> -		old_pages = dev->guest_pages;
>>> -		dev->guest_pages = rte_realloc(dev->guest_pages,
>>> -					dev->max_guest_pages * sizeof(*page),
>>> -					RTE_CACHE_LINE_SIZE);
>>> -		if (dev->guest_pages == NULL) {
>>> -			VHOST_LOG_CONFIG(ERR, "cannot realloc
>> guest_pages\n");
>>> -			rte_free(old_pages);
>>> -			return -1;
>>> -		}
>>> -	}
>>> -
>>> -	if (dev->nr_guest_pages > 0) {
>>> -		last_page = &dev->guest_pages[dev->nr_guest_pages - 1];
>>> -		/* merge if the two pages are continuous */
>>> -		if (host_phys_addr == last_page->host_phys_addr +
>>> -				      last_page->size) {
>>> -			last_page->size += size;
>>> -			return 0;
>>> -		}
>>> -	}
>>> -
>>> -	page = &dev->guest_pages[dev->nr_guest_pages++];
>>> -	page->guest_phys_addr = guest_phys_addr;
>>> -	page->host_phys_addr  = host_phys_addr;
>>> -	page->size = size;
>>> -
>>> -	return 0;
>>> -}
>>> -
>>> -static int
>>> -add_guest_pages(struct virtio_net *dev, struct rte_vhost_mem_region *reg,
>>> -		uint64_t page_size)
>>> -{
>>> -	uint64_t reg_size = reg->size;
>>> -	uint64_t host_user_addr  = reg->host_user_addr;
>>> -	uint64_t guest_phys_addr = reg->guest_phys_addr;
>>> -	uint64_t host_phys_addr;
>>> -	uint64_t size;
>>> -
>>> -	host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t)host_user_addr);
>>> -	size = page_size - (guest_phys_addr & (page_size - 1));
>>> -	size = RTE_MIN(size, reg_size);
>>> -
>>> -	if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr, size) <
>> 0)
>>> -		return -1;
>>> -
>>> -	host_user_addr  += size;
>>> -	guest_phys_addr += size;
>>> -	reg_size -= size;
>>> -
>>> -	while (reg_size > 0) {
>>> -		size = RTE_MIN(reg_size, page_size);
>>> -		host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t)
>>> -						  host_user_addr);
>>> -		if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr,
>>> -				size) < 0)
>>> -			return -1;
>>> -
>>> -		host_user_addr  += size;
>>> -		guest_phys_addr += size;
>>> -		reg_size -= size;
>>> -	}
>>> -
>>> -	/* sort guest page array if over binary search threshold */
>>> -	if (dev->nr_guest_pages >= VHOST_BINARY_SEARCH_THRESH) {
>>> -		qsort((void *)dev->guest_pages, dev->nr_guest_pages,
>>> -			sizeof(struct guest_page), guest_page_addrcmp);
>>> -	}
>>> -
>>> -	return 0;
>>> -}
>>> -
>>>  #ifdef RTE_LIBRTE_VHOST_DEBUG
>>>  /* TODO: enable it only in debug mode? */
>>>  static void
>>> @@ -1105,6 +1056,7 @@ vhost_user_mmap_region(struct virtio_net *dev,
>>>  	uint64_t mmap_size;
>>>  	uint64_t alignment;
>>>  	int populate;
>>> +	int ret;
>>>
>>>  	/* Check for memory_size + mmap_offset overflow */
>>>  	if (mmap_offset >= -region->size) {
>>> @@ -1158,12 +1110,13 @@ vhost_user_mmap_region(struct virtio_net *dev,
>>>  	region->mmap_size = mmap_size;
>>>  	region->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
>> mmap_offset;
>>>
>>> -	if (dev->async_copy)
>>> -		if (add_guest_pages(dev, region, alignment) < 0) {
>>> -			VHOST_LOG_CONFIG(ERR,
>>> -					"adding guest pages to region
>> failed.\n");
>>> +	if (dev->async_copy) {
>>> +		ret = async_dma_map(region, true);
>>> +		if (ret) {
>>> +			VHOST_LOG_CONFIG(ERR, "Configure IOMMU for DMA
>> engine failed\n");
>>>  			return -1;
>>
>> Maybe we're too late in the init already, but I would think we may want
>> to fallback to SW implementation insea
>>
>>> -		}
>>> +			}
>>> +	}
>>>
>>>  	VHOST_LOG_CONFIG(INFO,
>>>  			"guest memory region size: 0x%" PRIx64 "\n"
>>> diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
>>> index 8da8a86a10..88110d2cb3 100644
>>> --- a/lib/vhost/virtio_net.c
>>> +++ b/lib/vhost/virtio_net.c
>>> @@ -980,11 +980,9 @@ async_mbuf_to_desc(struct virtio_net *dev, struct
>> vhost_virtqueue *vq,
>>>  	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
>>>  	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
>>>  	int error = 0;
>>> -	uint64_t mapped_len;
>>>
>>>  	uint32_t tlen = 0;
>>>  	int tvec_idx = 0;
>>> -	void *hpa;
>>>
>>>  	if (unlikely(m == NULL)) {
>>>  		error = -1;
>>> @@ -1074,27 +1072,19 @@ async_mbuf_to_desc(struct virtio_net *dev,
>> struct vhost_virtqueue *vq,
>>>
>>>  		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
>>>
>>> -		while (unlikely(cpy_len && cpy_len >= cpy_threshold)) {
>>> -			hpa = (void *)(uintptr_t)gpa_to_first_hpa(dev,
>>> -					buf_iova + buf_offset,
>>> -					cpy_len, &mapped_len);
>>> -
>>> -			if (unlikely(!hpa || mapped_len < cpy_threshold))
>>> -				break;
>>> -
>>> +		if (unlikely(cpy_len >= cpy_threshold)) {
>>>  			async_fill_vec(src_iovec + tvec_idx,
>>> -				(void *)(uintptr_t)rte_pktmbuf_iova_offset(m,
>>> -				mbuf_offset), (size_t)mapped_len);
>>> +				rte_pktmbuf_mtod_offset(m, void *,
>> mbuf_offset), (size_t)cpy_len);
>>>
>>>  			async_fill_vec(dst_iovec + tvec_idx,
>>> -					hpa, (size_t)mapped_len);
>>> -
>>> -			tlen += (uint32_t)mapped_len;
>>> -			cpy_len -= (uint32_t)mapped_len;
>>> -			mbuf_avail  -= (uint32_t)mapped_len;
>>> -			mbuf_offset += (uint32_t)mapped_len;
>>> -			buf_avail  -= (uint32_t)mapped_len;
>>> -			buf_offset += (uint32_t)mapped_len;
>>> +				(void *)((uintptr_t)(buf_addr + buf_offset)),
>> (size_t)cpy_len);
>>> +
>>> +			tlen += cpy_len;
>>> +			mbuf_avail  -= cpy_len;
>>> +			mbuf_offset += cpy_len;
>>> +			buf_avail  -= cpy_len;
>>> +			buf_offset += cpy_len;
>>> +			cpy_len = 0;
>>>  			tvec_idx++;
>>>  		}
>>>
>>>
>
  
Ding, Xuan July 1, 2021, 5:12 a.m. UTC | #5
Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, June 29, 2021 5:23 PM
> To: Ding, Xuan <xuan.ding@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>
> Cc: dev@dpdk.org; Hu, Jiayu <jiayu.hu@intel.com>; Pai G, Sunil
> <sunil.pai.g@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Van
> Haaren, Harry <harry.van.haaren@intel.com>; Liu, Yong <yong.liu@intel.com>;
> Jiang, Cheng1 <cheng1.jiang@intel.com>
> Subject: Re: [PATCH v3] vhost: enable IOMMU for async vhost
> 
> Hi Xuan,
> 
> On 6/22/21 8:18 AM, Ding, Xuan wrote:
> > Hi Maxime,
> >
> > Replies are inline.
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >> Sent: Saturday, June 19, 2021 12:18 AM
> >> To: Ding, Xuan <xuan.ding@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>
> >> Cc: dev@dpdk.org; Hu, Jiayu <jiayu.hu@intel.com>; Pai G, Sunil
> >> <sunil.pai.g@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> Van
> >> Haaren, Harry <harry.van.haaren@intel.com>; Liu, Yong <yong.liu@intel.com>
> >> Subject: Re: [PATCH v3] vhost: enable IOMMU for async vhost
> >>
> >> Hi Xuan,
> >>
> >> On 6/3/21 7:30 PM, xuan.ding@intel.com wrote:
> >>> From: Xuan Ding <xuan.ding@intel.com>
> >>>
> >>> For async copy, it is unsafe to directly use the physical address.
> >>> And current address translation from GPA to HPA via SW also takes
> >>> CPU cycles, these can all benefit from IOMMU.
> >>>
> >>> Since the existing DMA engine supports to use platform IOMMU,
> >>> this patch enables IOMMU for async vhost, which defines IOAT
> >>> devices to use virtual address instead of physical address.
> >>
> >> We have to keep in mind a generic DMA api is coming, and maybe we want
> >> a SW implementation of a dmadev based on memcpy at least for
> >> testing/debugging purpose.
> >
> > I noticed the generic dmadev model is under discussion. To support a SW
> > implementation, the VA mode support is needed, this is also the problem
> > that this patch hopes to solve. Traditionally, DMA engine can only use
> > physical address in PA mode.
> >
> >>
> >>> When set memory table, the frontend's memory will be mapped
> >>> to the default container of DPDK where IOAT devices have been
> >>> added into. When DMA copy fails, the virtual address provided
> >>> to IOAT devices also allow us fallback to SW copy or PA copy.
> >>>
> >>> With IOMMU enabled, to use IOAT devices:
> >>> 1. IOAT devices must be binded to vfio-pci, rather than igb_uio.
> >>> 2. DPDK must use "--iova-mode=va".
> >>
> >> I think this is problematic, at least we need to check the right iova
> >> mode has been selected, but even with doing that it is limiting.
> >
> > As a library, vhost is not aware of the device selected address(PA or VA)
> > and current DPDK iova mode. To some extent, this patch is a proposal.
> 
> If I'm not mistaken, the DMA device driver init should fail if it does
> not support the DPDK IOVA mode.
> 
> Then, on Vhost lib side, you should be able to get the IOVA mode by
> using the rte_eal_iova_mode() API.

Get your point, if the Vhost lib is able to get the IOVA mode, I think it is
possible to be compatible with different DPDK IOVA mode. I will work out
a new patch to pass iova to callback instead of virtual address only.

> 
> > With device fed with VA, SW fallback can be supported.
> > And VA can also be translated to PA through rte_mem_virt2iova().
> > Finally, the address selected by the device is determined by callback.
> > Not vice versa.
> >
> > If the DMA callback implementer follows this design, SW fallback can be
> supported.
> > I would be very grateful if you could provide some insights for this design. :)
> 
> TBH, I find the async design too much complicated.
> Having some descriptors handled by the DMA engine, others by the CPU
> makes it extremly hard to debug. Also, it makes Vhost library use less
> deterministic.

Here is to consider the difference in copy efficiency. In the case of small packages
(less than threshold), the performance of CPU copy is better. When the package
length is bigger than the threshold, the DMA engine copy will get a big performance
improvement.

> 
> >>
> >> What prevent us to reuse add_guest_pages() alogrithm to implement
> >> IOVA_AS_PA?
> >
> > If IOVA is PA, it's not easy to translate PA to VA to support SW implementation.
> 
> What prevent you to use dev->guest_pages[] in that case to do the
> translations?

Yes, you are right, use rte_mem_iova2virt() can help to do so.

> 
> > Until now, I don't have any good ideas to be compatible with IOVA_AS_PA
> > and IOVA_AS_VA at the same time, because it requires vhost library to
> > select PA/VA for DMA device according to different DPDK iova mode.
> 
> If the DMA device claims to support IOVA_AS_PA at probe time, it should
> be useable by the vhost library. It might not be the more efficient
> mode, but we cannot just have a comment in the documenation saying that
> IOVA_AS_VA is the only supported mode, without any safety check in the
> code itself.

Compatible with IOVA_AS_PA is possible, the reason for IOVA_AS_VA design
is because VA is easier to operate. Thus some compatibility is sacrificed.

I will adopt your suggestion, thanks very much!

Regards,
Xuan

> 
> Regards,
> Maxime
> 
> > Thanks,
> > Xuan
> >
> >>
> >>>
> >>> Signed-off-by: Xuan Ding <xuan.ding@intel.com>
> >>> ---
> >>>
> >>> v3:
> >>> * Fixed some typos.
> >>>
> >>> v2:
> >>> * Fixed a format issue.
> >>> * Added the dma unmap logic when device is closed.
> >>> ---
> >>>  doc/guides/prog_guide/vhost_lib.rst |  20 +++++
> >>>  lib/vhost/vhost_user.c              | 125 +++++++++-------------------
> >>>  lib/vhost/virtio_net.c              |  30 +++----
> >>>  3 files changed, 69 insertions(+), 106 deletions(-)
> >>>
> >>> diff --git a/doc/guides/prog_guide/vhost_lib.rst
> >> b/doc/guides/prog_guide/vhost_lib.rst
> >>> index d18fb98910..5777f0da96 100644
> >>> --- a/doc/guides/prog_guide/vhost_lib.rst
> >>> +++ b/doc/guides/prog_guide/vhost_lib.rst
> >>> @@ -420,3 +420,23 @@ Finally, a set of device ops is defined for device
> >> specific operations:
> >>>  * ``get_notify_area``
> >>>
> >>>    Called to get the notify area info of the queue.
> >>> +
> >>> +Vhost async data path
> >>> +---------------------
> >>> +
> >>> +* Address mode
> >>> +
> >>> +  Modern IOAT devices support to use the IOMMU, which can avoid using
> >>> +  the unsafe HPA. Besides, the CPU cycles took by SW to translate from
> >>> +  GPA to HPA can also be saved. So IOAT devices are defined to use
> >>> +  virtual address instead of physical address.
> >>> +
> >>> +  With IOMMU enabled, to use IOAT devices:
> >>> +  1. IOAT devices must be binded to vfio-pci, rather than igb_uio.
> >>> +  2. DPDK must use ``--iova-mode=va``.
> >>> +
> >>> +* Fallback
> >>> +
> >>> +  When the DMA copy fails, the user who implements the transfer_data
> >>> +  callback can fallback to SW copy or fallback to PA copy through
> >>> +  rte_mem_virt2iova().
> >>> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
> >>> index 8f0eba6412..c33fa784ff 100644
> >>> --- a/lib/vhost/vhost_user.c
> >>> +++ b/lib/vhost/vhost_user.c
> >>> @@ -45,6 +45,7 @@
> >>>  #include <rte_common.h>
> >>>  #include <rte_malloc.h>
> >>>  #include <rte_log.h>
> >>> +#include <rte_vfio.h>
> >>>
> >>>  #include "iotlb.h"
> >>>  #include "vhost.h"
> >>> @@ -141,6 +142,34 @@ get_blk_size(int fd)
> >>>  	return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
> >>>  }
> >>>
> >>> +static int
> >>> +async_dma_map(struct rte_vhost_mem_region *region, bool do_map)
> >>> +{
> >>> +	int ret = 0;
> >>> +	if (do_map) {
> >>> +		/* Add mapped region into the default container of DPDK. */
> >>> +		ret =
> >> rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> >>> +						 region->host_user_addr,
> >>> +						 region->host_user_addr,
> >>> +						 region->size);
> >>> +		if (ret) {
> >>> +			VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n");
> >>> +			return ret;
> >>> +		}
> >>> +	} else {
> >>> +		/* Remove mapped region from the default container of DPDK.
> >> */
> >>> +		ret =
> >> rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> >>> +						   region->host_user_addr,
> >>> +						   region->host_user_addr,
> >>> +						   region->size);
> >>> +		if (ret) {
> >>> +			VHOST_LOG_CONFIG(ERR, "DMA engine unmap
> >> failed\n");
> >>> +			return ret;
> >>> +		}
> >>> +	}
> >>> +	return ret;
> >>> +}
> >>> +
> >>>  static void
> >>>  free_mem_region(struct virtio_net *dev)
> >>>  {
> >>> @@ -155,6 +184,9 @@ free_mem_region(struct virtio_net *dev)
> >>>  		if (reg->host_user_addr) {
> >>>  			munmap(reg->mmap_addr, reg->mmap_size);
> >>>  			close(reg->fd);
> >>> +
> >>> +			if (dev->async_copy)
> >>> +				async_dma_map(reg, false);
> >>>  		}
> >>>  	}
> >>>  }
> >>> @@ -866,87 +898,6 @@ vhost_user_set_vring_base(struct virtio_net
> **pdev,
> >>>  	return RTE_VHOST_MSG_RESULT_OK;
> >>>  }
> >>>
> >>> -static int
> >>> -add_one_guest_page(struct virtio_net *dev, uint64_t guest_phys_addr,
> >>> -		   uint64_t host_phys_addr, uint64_t size)
> >>> -{
> >>> -	struct guest_page *page, *last_page;
> >>> -	struct guest_page *old_pages;
> >>> -
> >>> -	if (dev->nr_guest_pages == dev->max_guest_pages) {
> >>> -		dev->max_guest_pages *= 2;
> >>> -		old_pages = dev->guest_pages;
> >>> -		dev->guest_pages = rte_realloc(dev->guest_pages,
> >>> -					dev->max_guest_pages * sizeof(*page),
> >>> -					RTE_CACHE_LINE_SIZE);
> >>> -		if (dev->guest_pages == NULL) {
> >>> -			VHOST_LOG_CONFIG(ERR, "cannot realloc
> >> guest_pages\n");
> >>> -			rte_free(old_pages);
> >>> -			return -1;
> >>> -		}
> >>> -	}
> >>> -
> >>> -	if (dev->nr_guest_pages > 0) {
> >>> -		last_page = &dev->guest_pages[dev->nr_guest_pages - 1];
> >>> -		/* merge if the two pages are continuous */
> >>> -		if (host_phys_addr == last_page->host_phys_addr +
> >>> -				      last_page->size) {
> >>> -			last_page->size += size;
> >>> -			return 0;
> >>> -		}
> >>> -	}
> >>> -
> >>> -	page = &dev->guest_pages[dev->nr_guest_pages++];
> >>> -	page->guest_phys_addr = guest_phys_addr;
> >>> -	page->host_phys_addr  = host_phys_addr;
> >>> -	page->size = size;
> >>> -
> >>> -	return 0;
> >>> -}
> >>> -
> >>> -static int
> >>> -add_guest_pages(struct virtio_net *dev, struct rte_vhost_mem_region
> *reg,
> >>> -		uint64_t page_size)
> >>> -{
> >>> -	uint64_t reg_size = reg->size;
> >>> -	uint64_t host_user_addr  = reg->host_user_addr;
> >>> -	uint64_t guest_phys_addr = reg->guest_phys_addr;
> >>> -	uint64_t host_phys_addr;
> >>> -	uint64_t size;
> >>> -
> >>> -	host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t)host_user_addr);
> >>> -	size = page_size - (guest_phys_addr & (page_size - 1));
> >>> -	size = RTE_MIN(size, reg_size);
> >>> -
> >>> -	if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr, size) <
> >> 0)
> >>> -		return -1;
> >>> -
> >>> -	host_user_addr  += size;
> >>> -	guest_phys_addr += size;
> >>> -	reg_size -= size;
> >>> -
> >>> -	while (reg_size > 0) {
> >>> -		size = RTE_MIN(reg_size, page_size);
> >>> -		host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t)
> >>> -						  host_user_addr);
> >>> -		if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr,
> >>> -				size) < 0)
> >>> -			return -1;
> >>> -
> >>> -		host_user_addr  += size;
> >>> -		guest_phys_addr += size;
> >>> -		reg_size -= size;
> >>> -	}
> >>> -
> >>> -	/* sort guest page array if over binary search threshold */
> >>> -	if (dev->nr_guest_pages >= VHOST_BINARY_SEARCH_THRESH) {
> >>> -		qsort((void *)dev->guest_pages, dev->nr_guest_pages,
> >>> -			sizeof(struct guest_page), guest_page_addrcmp);
> >>> -	}
> >>> -
> >>> -	return 0;
> >>> -}
> >>> -
> >>>  #ifdef RTE_LIBRTE_VHOST_DEBUG
> >>>  /* TODO: enable it only in debug mode? */
> >>>  static void
> >>> @@ -1105,6 +1056,7 @@ vhost_user_mmap_region(struct virtio_net *dev,
> >>>  	uint64_t mmap_size;
> >>>  	uint64_t alignment;
> >>>  	int populate;
> >>> +	int ret;
> >>>
> >>>  	/* Check for memory_size + mmap_offset overflow */
> >>>  	if (mmap_offset >= -region->size) {
> >>> @@ -1158,12 +1110,13 @@ vhost_user_mmap_region(struct virtio_net
> *dev,
> >>>  	region->mmap_size = mmap_size;
> >>>  	region->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
> >> mmap_offset;
> >>>
> >>> -	if (dev->async_copy)
> >>> -		if (add_guest_pages(dev, region, alignment) < 0) {
> >>> -			VHOST_LOG_CONFIG(ERR,
> >>> -					"adding guest pages to region
> >> failed.\n");
> >>> +	if (dev->async_copy) {
> >>> +		ret = async_dma_map(region, true);
> >>> +		if (ret) {
> >>> +			VHOST_LOG_CONFIG(ERR, "Configure IOMMU for DMA
> >> engine failed\n");
> >>>  			return -1;
> >>
> >> Maybe we're too late in the init already, but I would think we may want
> >> to fallback to SW implementation insea
> >>
> >>> -		}
> >>> +			}
> >>> +	}
> >>>
> >>>  	VHOST_LOG_CONFIG(INFO,
> >>>  			"guest memory region size: 0x%" PRIx64 "\n"
> >>> diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> >>> index 8da8a86a10..88110d2cb3 100644
> >>> --- a/lib/vhost/virtio_net.c
> >>> +++ b/lib/vhost/virtio_net.c
> >>> @@ -980,11 +980,9 @@ async_mbuf_to_desc(struct virtio_net *dev, struct
> >> vhost_virtqueue *vq,
> >>>  	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
> >>>  	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
> >>>  	int error = 0;
> >>> -	uint64_t mapped_len;
> >>>
> >>>  	uint32_t tlen = 0;
> >>>  	int tvec_idx = 0;
> >>> -	void *hpa;
> >>>
> >>>  	if (unlikely(m == NULL)) {
> >>>  		error = -1;
> >>> @@ -1074,27 +1072,19 @@ async_mbuf_to_desc(struct virtio_net *dev,
> >> struct vhost_virtqueue *vq,
> >>>
> >>>  		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
> >>>
> >>> -		while (unlikely(cpy_len && cpy_len >= cpy_threshold)) {
> >>> -			hpa = (void *)(uintptr_t)gpa_to_first_hpa(dev,
> >>> -					buf_iova + buf_offset,
> >>> -					cpy_len, &mapped_len);
> >>> -
> >>> -			if (unlikely(!hpa || mapped_len < cpy_threshold))
> >>> -				break;
> >>> -
> >>> +		if (unlikely(cpy_len >= cpy_threshold)) {
> >>>  			async_fill_vec(src_iovec + tvec_idx,
> >>> -				(void *)(uintptr_t)rte_pktmbuf_iova_offset(m,
> >>> -				mbuf_offset), (size_t)mapped_len);
> >>> +				rte_pktmbuf_mtod_offset(m, void *,
> >> mbuf_offset), (size_t)cpy_len);
> >>>
> >>>  			async_fill_vec(dst_iovec + tvec_idx,
> >>> -					hpa, (size_t)mapped_len);
> >>> -
> >>> -			tlen += (uint32_t)mapped_len;
> >>> -			cpy_len -= (uint32_t)mapped_len;
> >>> -			mbuf_avail  -= (uint32_t)mapped_len;
> >>> -			mbuf_offset += (uint32_t)mapped_len;
> >>> -			buf_avail  -= (uint32_t)mapped_len;
> >>> -			buf_offset += (uint32_t)mapped_len;
> >>> +				(void *)((uintptr_t)(buf_addr + buf_offset)),
> >> (size_t)cpy_len);
> >>> +
> >>> +			tlen += cpy_len;
> >>> +			mbuf_avail  -= cpy_len;
> >>> +			mbuf_offset += cpy_len;
> >>> +			buf_avail  -= cpy_len;
> >>> +			buf_offset += cpy_len;
> >>> +			cpy_len = 0;
> >>>  			tvec_idx++;
> >>>  		}
> >>>
> >>>
> >
  

Patch

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index d18fb98910..5777f0da96 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -420,3 +420,23 @@  Finally, a set of device ops is defined for device specific operations:
 * ``get_notify_area``
 
   Called to get the notify area info of the queue.
+
+Vhost async data path
+---------------------
+
+* Address mode
+
+  Modern IOAT devices support to use the IOMMU, which can avoid using
+  the unsafe HPA. Besides, the CPU cycles took by SW to translate from
+  GPA to HPA can also be saved. So IOAT devices are defined to use
+  virtual address instead of physical address.
+
+  With IOMMU enabled, to use IOAT devices:
+  1. IOAT devices must be binded to vfio-pci, rather than igb_uio.
+  2. DPDK must use ``--iova-mode=va``.
+
+* Fallback
+
+  When the DMA copy fails, the user who implements the transfer_data
+  callback can fallback to SW copy or fallback to PA copy through
+  rte_mem_virt2iova().
diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
index 8f0eba6412..c33fa784ff 100644
--- a/lib/vhost/vhost_user.c
+++ b/lib/vhost/vhost_user.c
@@ -45,6 +45,7 @@ 
 #include <rte_common.h>
 #include <rte_malloc.h>
 #include <rte_log.h>
+#include <rte_vfio.h>
 
 #include "iotlb.h"
 #include "vhost.h"
@@ -141,6 +142,34 @@  get_blk_size(int fd)
 	return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
 }
 
+static int
+async_dma_map(struct rte_vhost_mem_region *region, bool do_map)
+{
+	int ret = 0;
+	if (do_map) {
+		/* Add mapped region into the default container of DPDK. */
+		ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
+						 region->host_user_addr,
+						 region->host_user_addr,
+						 region->size);
+		if (ret) {
+			VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n");
+			return ret;
+		}
+	} else {
+		/* Remove mapped region from the default container of DPDK. */
+		ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+						   region->host_user_addr,
+						   region->host_user_addr,
+						   region->size);
+		if (ret) {
+			VHOST_LOG_CONFIG(ERR, "DMA engine unmap failed\n");
+			return ret;
+		}
+	}
+	return ret;
+}
+
 static void
 free_mem_region(struct virtio_net *dev)
 {
@@ -155,6 +184,9 @@  free_mem_region(struct virtio_net *dev)
 		if (reg->host_user_addr) {
 			munmap(reg->mmap_addr, reg->mmap_size);
 			close(reg->fd);
+
+			if (dev->async_copy)
+				async_dma_map(reg, false);
 		}
 	}
 }
@@ -866,87 +898,6 @@  vhost_user_set_vring_base(struct virtio_net **pdev,
 	return RTE_VHOST_MSG_RESULT_OK;
 }
 
-static int
-add_one_guest_page(struct virtio_net *dev, uint64_t guest_phys_addr,
-		   uint64_t host_phys_addr, uint64_t size)
-{
-	struct guest_page *page, *last_page;
-	struct guest_page *old_pages;
-
-	if (dev->nr_guest_pages == dev->max_guest_pages) {
-		dev->max_guest_pages *= 2;
-		old_pages = dev->guest_pages;
-		dev->guest_pages = rte_realloc(dev->guest_pages,
-					dev->max_guest_pages * sizeof(*page),
-					RTE_CACHE_LINE_SIZE);
-		if (dev->guest_pages == NULL) {
-			VHOST_LOG_CONFIG(ERR, "cannot realloc guest_pages\n");
-			rte_free(old_pages);
-			return -1;
-		}
-	}
-
-	if (dev->nr_guest_pages > 0) {
-		last_page = &dev->guest_pages[dev->nr_guest_pages - 1];
-		/* merge if the two pages are continuous */
-		if (host_phys_addr == last_page->host_phys_addr +
-				      last_page->size) {
-			last_page->size += size;
-			return 0;
-		}
-	}
-
-	page = &dev->guest_pages[dev->nr_guest_pages++];
-	page->guest_phys_addr = guest_phys_addr;
-	page->host_phys_addr  = host_phys_addr;
-	page->size = size;
-
-	return 0;
-}
-
-static int
-add_guest_pages(struct virtio_net *dev, struct rte_vhost_mem_region *reg,
-		uint64_t page_size)
-{
-	uint64_t reg_size = reg->size;
-	uint64_t host_user_addr  = reg->host_user_addr;
-	uint64_t guest_phys_addr = reg->guest_phys_addr;
-	uint64_t host_phys_addr;
-	uint64_t size;
-
-	host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t)host_user_addr);
-	size = page_size - (guest_phys_addr & (page_size - 1));
-	size = RTE_MIN(size, reg_size);
-
-	if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr, size) < 0)
-		return -1;
-
-	host_user_addr  += size;
-	guest_phys_addr += size;
-	reg_size -= size;
-
-	while (reg_size > 0) {
-		size = RTE_MIN(reg_size, page_size);
-		host_phys_addr = rte_mem_virt2iova((void *)(uintptr_t)
-						  host_user_addr);
-		if (add_one_guest_page(dev, guest_phys_addr, host_phys_addr,
-				size) < 0)
-			return -1;
-
-		host_user_addr  += size;
-		guest_phys_addr += size;
-		reg_size -= size;
-	}
-
-	/* sort guest page array if over binary search threshold */
-	if (dev->nr_guest_pages >= VHOST_BINARY_SEARCH_THRESH) {
-		qsort((void *)dev->guest_pages, dev->nr_guest_pages,
-			sizeof(struct guest_page), guest_page_addrcmp);
-	}
-
-	return 0;
-}
-
 #ifdef RTE_LIBRTE_VHOST_DEBUG
 /* TODO: enable it only in debug mode? */
 static void
@@ -1105,6 +1056,7 @@  vhost_user_mmap_region(struct virtio_net *dev,
 	uint64_t mmap_size;
 	uint64_t alignment;
 	int populate;
+	int ret;
 
 	/* Check for memory_size + mmap_offset overflow */
 	if (mmap_offset >= -region->size) {
@@ -1158,12 +1110,13 @@  vhost_user_mmap_region(struct virtio_net *dev,
 	region->mmap_size = mmap_size;
 	region->host_user_addr = (uint64_t)(uintptr_t)mmap_addr + mmap_offset;
 
-	if (dev->async_copy)
-		if (add_guest_pages(dev, region, alignment) < 0) {
-			VHOST_LOG_CONFIG(ERR,
-					"adding guest pages to region failed.\n");
+	if (dev->async_copy) {
+		ret = async_dma_map(region, true);
+		if (ret) {
+			VHOST_LOG_CONFIG(ERR, "Configure IOMMU for DMA engine failed\n");
 			return -1;
-		}
+			}
+	}
 
 	VHOST_LOG_CONFIG(INFO,
 			"guest memory region size: 0x%" PRIx64 "\n"
diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
index 8da8a86a10..88110d2cb3 100644
--- a/lib/vhost/virtio_net.c
+++ b/lib/vhost/virtio_net.c
@@ -980,11 +980,9 @@  async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
 	struct virtio_net_hdr_mrg_rxbuf tmp_hdr, *hdr = NULL;
 	int error = 0;
-	uint64_t mapped_len;
 
 	uint32_t tlen = 0;
 	int tvec_idx = 0;
-	void *hpa;
 
 	if (unlikely(m == NULL)) {
 		error = -1;
@@ -1074,27 +1072,19 @@  async_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 
 		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
 
-		while (unlikely(cpy_len && cpy_len >= cpy_threshold)) {
-			hpa = (void *)(uintptr_t)gpa_to_first_hpa(dev,
-					buf_iova + buf_offset,
-					cpy_len, &mapped_len);
-
-			if (unlikely(!hpa || mapped_len < cpy_threshold))
-				break;
-
+		if (unlikely(cpy_len >= cpy_threshold)) {
 			async_fill_vec(src_iovec + tvec_idx,
-				(void *)(uintptr_t)rte_pktmbuf_iova_offset(m,
-				mbuf_offset), (size_t)mapped_len);
+				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset), (size_t)cpy_len);
 
 			async_fill_vec(dst_iovec + tvec_idx,
-					hpa, (size_t)mapped_len);
-
-			tlen += (uint32_t)mapped_len;
-			cpy_len -= (uint32_t)mapped_len;
-			mbuf_avail  -= (uint32_t)mapped_len;
-			mbuf_offset += (uint32_t)mapped_len;
-			buf_avail  -= (uint32_t)mapped_len;
-			buf_offset += (uint32_t)mapped_len;
+				(void *)((uintptr_t)(buf_addr + buf_offset)), (size_t)cpy_len);
+
+			tlen += cpy_len;
+			mbuf_avail  -= cpy_len;
+			mbuf_offset += cpy_len;
+			buf_avail  -= cpy_len;
+			buf_offset += cpy_len;
+			cpy_len = 0;
 			tvec_idx++;
 		}