[4/4] vhost: remove dequeue zero-copy support
diff mbox series

Message ID 20200915133558.210639-5-maxime.coquelin@redhat.com
State Superseded, archived
Delegated to: Maxime Coquelin
Headers show
Series
  • pick
Related show

Checks

Context Check Description
ci/Intel-compilation fail Compilation issues
ci/travis-robot warning Travis build: failed
ci/iol-testing fail Testing issues
ci/checkpatch warning coding style issues

Commit Message

Maxime Coquelin Sept. 15, 2020, 1:35 p.m. UTC
Dequeue zero-copy removal was announced in DPDK v20.08.
This feature brings constraints which makes the maintenance
of the Vhost library difficult. Its limitations makes it also
diffcult to use by the applications (Tx vring starvation).

Removing it makes it easier to add new features, and also remove
some code in the hot path, which should bring a performance
improvement for the standard path.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 doc/guides/prog_guide/vhost_lib.rst |  52 +----
 lib/librte_vhost/rte_vhost.h        |   2 +-
 lib/librte_vhost/socket.c           |  47 ----
 lib/librte_vhost/vhost.c            |  14 --
 lib/librte_vhost/vhost.h            |   7 -
 lib/librte_vhost/vhost_user.c       |  79 +------
 lib/librte_vhost/virtio_net.c       | 326 +++-------------------------
 7 files changed, 33 insertions(+), 494 deletions(-)

Comments

Xia, Chenbo Sept. 23, 2020, 5:44 a.m. UTC | #1
Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, September 15, 2020 9:36 PM
> To: dev@dpdk.org; Xia, Chenbo <chenbo.xia@intel.com>; Stokes, Ian
> <ian.stokes@intel.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> Subject: [PATCH 4/4] vhost: remove dequeue zero-copy support
> 
> Dequeue zero-copy removal was announced in DPDK v20.08.
> This feature brings constraints which makes the maintenance
> of the Vhost library difficult. Its limitations makes it also
> diffcult to use by the applications (Tx vring starvation).

s/diffcult/difficult

Besides, there's some zcopy-related struct and function in vhost.h.
Should clean them up too 😊.

And in func 'free_mem_region', you can delete 'struct vhost_virtqueue *vq;'
Because it's not in use now. 

Thanks!
Chenbo

> 
> Removing it makes it easier to add new features, and also remove
> some code in the hot path, which should bring a performance
> improvement for the standard path.
> 
> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
>  doc/guides/prog_guide/vhost_lib.rst |  52 +----
>  lib/librte_vhost/rte_vhost.h        |   2 +-
>  lib/librte_vhost/socket.c           |  47 ----
>  lib/librte_vhost/vhost.c            |  14 --
>  lib/librte_vhost/vhost.h            |   7 -
>  lib/librte_vhost/vhost_user.c       |  79 +------
>  lib/librte_vhost/virtio_net.c       | 326 +++-------------------------
>  7 files changed, 33 insertions(+), 494 deletions(-)
> 
> diff --git a/doc/guides/prog_guide/vhost_lib.rst
> b/doc/guides/prog_guide/vhost_lib.rst
> index b892eec67a..ba4c62aeb8 100644
> --- a/doc/guides/prog_guide/vhost_lib.rst
> +++ b/doc/guides/prog_guide/vhost_lib.rst
> @@ -51,50 +51,6 @@ The following is an overview of some key Vhost API
> functions:
>      This reconnect option is enabled by default. However, it can be
> turned off
>      by setting this flag.
> 
> -  - ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY``
> -
> -    Dequeue zero copy will be enabled when this flag is set. It is
> disabled by
> -    default.
> -
> -    There are some truths (including limitations) you might want to know
> while
> -    setting this flag:
> -
> -    * zero copy is not good for small packets (typically for packet size
> below
> -      512).
> -
> -    * zero copy is really good for VM2VM case. For iperf between two VMs,
> the
> -      boost could be above 70% (when TSO is enabled).
> -
> -    * For zero copy in VM2NIC case, guest Tx used vring may be starved if
> the
> -      PMD driver consume the mbuf but not release them timely.
> -
> -      For example, i40e driver has an optimization to maximum NIC
> pipeline which
> -      postpones returning transmitted mbuf until only tx_free_threshold
> free
> -      descs left. The virtio TX used ring will be starved if the formula
> -      (num_i40e_tx_desc - num_virtio_tx_desc > tx_free_threshold) is true,
> since
> -      i40e will not return back mbuf.
> -
> -      A performance tip for tuning zero copy in VM2NIC case is to adjust
> the
> -      frequency of mbuf free (i.e. adjust tx_free_threshold of i40e
> driver) to
> -      balance consumer and producer.
> -
> -    * Guest memory should be backended with huge pages to achieve better
> -      performance. Using 1G page size is the best.
> -
> -      When dequeue zero copy is enabled, the guest phys address and host
> phys
> -      address mapping has to be established. Using non-huge pages means
> far
> -      more page segments. To make it simple, DPDK vhost does a linear
> search
> -      of those segments, thus the fewer the segments, the quicker we will
> get
> -      the mapping. NOTE: we may speed it by using tree searching in
> future.
> -
> -    * zero copy can not work when using vfio-pci with iommu mode
> currently, this
> -      is because we don't setup iommu dma mapping for guest memory. If
> you have
> -      to use vfio-pci driver, please insert vfio-pci kernel module in
> noiommu
> -      mode.
> -
> -    * The consumer of zero copy mbufs should consume these mbufs as soon
> as
> -      possible, otherwise it may block the operations in vhost.
> -
>    - ``RTE_VHOST_USER_IOMMU_SUPPORT``
> 
>      IOMMU support will be enabled when this flag is set. It is disabled
> by
> @@ -362,16 +318,16 @@ Guest memory requirement
> 
>  * Memory pre-allocation
> 
> -  For non-zerocopy non-async data path, guest memory pre-allocation is
> not a
> +  For non-async data path, guest memory pre-allocation is not a
>    must. This can help save of memory. If users really want the guest
> memory
>    to be pre-allocated (e.g., for performance reason), we can add option
>    ``-mem-prealloc`` when starting QEMU. Or, we can lock all memory at
> vhost
>    side which will force memory to be allocated when mmap at vhost side;
>    option --mlockall in ovs-dpdk is an example in hand.
> 
> -  For async and zerocopy data path, we force the VM memory to be
> -  pre-allocated at vhost lib when mapping the guest memory; and also we
> need
> -  to lock the memory to prevent pages being swapped out to disk.
> +  For async data path, we force the VM memory to be pre-allocated at
> vhost
> +  lib when mapping the guest memory; and also we need to lock the memory
> to
> +  prevent pages being swapped out to disk.
> 
>  * Memory sharing
> 
> diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
> index a94c84134d..46019df6fe 100644
> --- a/lib/librte_vhost/rte_vhost.h
> +++ b/lib/librte_vhost/rte_vhost.h
> @@ -28,7 +28,7 @@ extern "C" {
> 
>  #define RTE_VHOST_USER_CLIENT		(1ULL << 0)
>  #define RTE_VHOST_USER_NO_RECONNECT	(1ULL << 1)
> -#define RTE_VHOST_USER_DEQUEUE_ZERO_COPY	(1ULL << 2)
> +#define RTE_VHOST_USER_RESERVED_1	(1ULL << 2)
>  #define RTE_VHOST_USER_IOMMU_SUPPORT	(1ULL << 3)
>  #define RTE_VHOST_USER_POSTCOPY_SUPPORT		(1ULL << 4)
>  /* support mbuf with external buffer attached */
> diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
> index 73e1dca95e..0169d36481 100644
> --- a/lib/librte_vhost/socket.c
> +++ b/lib/librte_vhost/socket.c
> @@ -37,7 +37,6 @@ struct vhost_user_socket {
>  	struct sockaddr_un un;
>  	bool is_server;
>  	bool reconnect;
> -	bool dequeue_zero_copy;
>  	bool iommu_support;
>  	bool use_builtin_virtio_net;
>  	bool extbuf;
> @@ -229,9 +228,6 @@ vhost_user_add_connection(int fd, struct
> vhost_user_socket *vsocket)
> 
>  	vhost_attach_vdpa_device(vid, vsocket->vdpa_dev);
> 
> -	if (vsocket->dequeue_zero_copy)
> -		vhost_enable_dequeue_zero_copy(vid);
> -
>  	if (vsocket->extbuf)
>  		vhost_enable_extbuf(vid);
> 
> @@ -878,18 +874,8 @@ rte_vhost_driver_register(const char *path, uint64_t
> flags)
>  		goto out_free;
>  	}
>  	vsocket->vdpa_dev = NULL;
> -	vsocket->dequeue_zero_copy = flags &
> RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
>  	vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
>  	vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
> -
> -	if (vsocket->dequeue_zero_copy &&
> -	    (flags & RTE_VHOST_USER_IOMMU_SUPPORT)) {
> -		VHOST_LOG_CONFIG(ERR,
> -			"error: enabling dequeue zero copy and IOMMU features "
> -			"simultaneously is not supported\n");
> -		goto out_mutex;
> -	}
> -
>  	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
> 
>  	if (vsocket->async_copy &&
> @@ -918,39 +904,6 @@ rte_vhost_driver_register(const char *path, uint64_t
> flags)
>  	vsocket->features           = VIRTIO_NET_SUPPORTED_FEATURES;
>  	vsocket->protocol_features  = VHOST_USER_PROTOCOL_FEATURES;
> 
> -	/*
> -	 * Dequeue zero copy can't assure descriptors returned in order.
> -	 * Also, it requires that the guest memory is populated, which is
> -	 * not compatible with postcopy.
> -	 */
> -	if (vsocket->dequeue_zero_copy) {
> -		if (vsocket->extbuf) {
> -			VHOST_LOG_CONFIG(ERR,
> -			"error: zero copy is incompatible with external
> buffers\n");
> -			ret = -1;
> -			goto out_mutex;
> -		}
> -		if (vsocket->linearbuf) {
> -			VHOST_LOG_CONFIG(ERR,
> -			"error: zero copy is incompatible with linear
> buffers\n");
> -			ret = -1;
> -			goto out_mutex;
> -		}
> -		if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
> -			VHOST_LOG_CONFIG(ERR,
> -			"error: zero copy is incompatible with vhost client
> mode\n");
> -			ret = -1;
> -			goto out_mutex;
> -		}
> -		vsocket->supported_features &= ~(1ULL << VIRTIO_F_IN_ORDER);
> -		vsocket->features &= ~(1ULL << VIRTIO_F_IN_ORDER);
> -
> -		VHOST_LOG_CONFIG(INFO,
> -			"Dequeue zero copy requested, disabling postcopy
> support\n");
> -		vsocket->protocol_features &=
> -			~(1ULL << VHOST_USER_PROTOCOL_F_PAGEFAULT);
> -	}
> -
>  	if (vsocket->async_copy) {
>  		vsocket->supported_features &= ~(1ULL << VHOST_F_LOG_ALL);
>  		vsocket->features &= ~(1ULL << VHOST_F_LOG_ALL);
> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> index 8f20a0818f..c7cd34e42b 100644
> --- a/lib/librte_vhost/vhost.c
> +++ b/lib/librte_vhost/vhost.c
> @@ -539,8 +539,6 @@ init_vring_queue(struct virtio_net *dev, uint32_t
> vring_idx)
>  	vhost_user_iotlb_init(dev, vring_idx);
>  	/* Backends are set to -1 indicating an inactive device. */
>  	vq->backend = -1;
> -
> -	TAILQ_INIT(&vq->zmbuf_list);
>  }
> 
>  static void
> @@ -704,18 +702,6 @@ vhost_set_ifname(int vid, const char *if_name,
> unsigned int if_len)
>  	dev->ifname[sizeof(dev->ifname) - 1] = '\0';
>  }
> 
> -void
> -vhost_enable_dequeue_zero_copy(int vid)
> -{
> -	struct virtio_net *dev = get_device(vid);
> -
> -	if (dev == NULL)
> -		return;
> -
> -	dev->dequeue_zero_copy = 1;
> -	VHOST_LOG_CONFIG(INFO, "dequeue zero copy is enabled\n");
> -}
> -
>  void
>  vhost_set_builtin_virtio_net(int vid, bool enable)
>  {
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 632f66d532..321cf6e99c 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -379,7 +379,6 @@ struct virtio_net {
>  	/* to tell if we need broadcast rarp packet */
>  	int16_t			broadcast_rarp;
>  	uint32_t		nr_vring;
> -	int			dequeue_zero_copy;
>  	int			async_copy;
>  	int			extbuf;
>  	int			linearbuf;
> @@ -897,10 +896,4 @@ mbuf_is_consumed(struct rte_mbuf *m)
>  	return true;
>  }
> 
> -static __rte_always_inline void
> -put_zmbuf(struct zcopy_mbuf *zmbuf)
> -{
> -	zmbuf->in_use = 0;
> -}
> -
>  #endif /* _VHOST_NET_CDEV_H_ */
> diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
> index c3c924faec..8af5ea0a98 100644
> --- a/lib/librte_vhost/vhost_user.c
> +++ b/lib/librte_vhost/vhost_user.c
> @@ -134,29 +134,6 @@ get_blk_size(int fd)
>  	return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
>  }
> 
> -/*
> - * Reclaim all the outstanding zmbufs for a virtqueue.
> - */
> -static void
> -drain_zmbuf_list(struct vhost_virtqueue *vq)
> -{
> -	struct zcopy_mbuf *zmbuf, *next;
> -
> -	for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
> -	     zmbuf != NULL; zmbuf = next) {
> -		next = TAILQ_NEXT(zmbuf, next);
> -
> -		while (!mbuf_is_consumed(zmbuf->mbuf))
> -			usleep(1000);
> -
> -		TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
> -		restore_mbuf(zmbuf->mbuf);
> -		rte_pktmbuf_free(zmbuf->mbuf);
> -		put_zmbuf(zmbuf);
> -		vq->nr_zmbuf -= 1;
> -	}
> -}
> -
>  static void
>  free_mem_region(struct virtio_net *dev)
>  {
> @@ -167,14 +144,6 @@ free_mem_region(struct virtio_net *dev)
>  	if (!dev || !dev->mem)
>  		return;
> 
> -	if (dev->dequeue_zero_copy) {
> -		for (i = 0; i < dev->nr_vring; i++) {
> -			vq = dev->virtqueue[i];
> -			if (vq)
> -				drain_zmbuf_list(vq);
> -		}
> -	}
> -
>  	for (i = 0; i < dev->mem->nregions; i++) {
>  		reg = &dev->mem->regions[i];
>  		if (reg->host_user_addr) {
> @@ -454,23 +423,6 @@ vhost_user_set_vring_num(struct virtio_net **pdev,
>  		return RTE_VHOST_MSG_RESULT_ERR;
>  	}
> 
> -	if (dev->dequeue_zero_copy) {
> -		vq->nr_zmbuf = 0;
> -		vq->last_zmbuf_idx = 0;
> -		vq->zmbuf_size = vq->size;
> -		if (vq->zmbufs)
> -			rte_free(vq->zmbufs);
> -		vq->zmbufs = rte_zmalloc(NULL, vq->zmbuf_size *
> -					 sizeof(struct zcopy_mbuf), 0);
> -		if (vq->zmbufs == NULL) {
> -			VHOST_LOG_CONFIG(WARNING,
> -				"failed to allocate mem for zero copy; "
> -				"zero copy is force disabled\n");
> -			dev->dequeue_zero_copy = 0;
> -		}
> -		TAILQ_INIT(&vq->zmbuf_list);
> -	}
> -
>  	if (vq_is_packed(dev)) {
>  		if (vq->shadow_used_packed)
>  			rte_free(vq->shadow_used_packed);
> @@ -524,7 +476,6 @@ numa_realloc(struct virtio_net *dev, int index)
>  	int oldnode, newnode;
>  	struct virtio_net *old_dev;
>  	struct vhost_virtqueue *old_vq, *vq;
> -	struct zcopy_mbuf *new_zmbuf;
>  	struct vring_used_elem *new_shadow_used_split;
>  	struct vring_used_elem_packed *new_shadow_used_packed;
>  	struct batch_copy_elem *new_batch_copy_elems;
> @@ -555,16 +506,6 @@ numa_realloc(struct virtio_net *dev, int index)
>  			return dev;
> 
>  		memcpy(vq, old_vq, sizeof(*vq));
> -		TAILQ_INIT(&vq->zmbuf_list);
> -
> -		if (dev->dequeue_zero_copy) {
> -			new_zmbuf = rte_malloc_socket(NULL, vq->zmbuf_size *
> -					sizeof(struct zcopy_mbuf), 0, newnode);
> -			if (new_zmbuf) {
> -				rte_free(vq->zmbufs);
> -				vq->zmbufs = new_zmbuf;
> -			}
> -		}
> 
>  		if (vq_is_packed(dev)) {
>  			new_shadow_used_packed = rte_malloc_socket(NULL,
> @@ -1179,8 +1120,7 @@ vhost_user_set_mem_table(struct virtio_net **pdev,
> struct VhostUserMsg *msg,
>  			goto err_mmap;
>  		}
> 
> -		populate = (dev->dequeue_zero_copy || dev->async_copy) ?
> -			MAP_POPULATE : 0;
> +		populate = dev->async_copy ? MAP_POPULATE : 0;
>  		mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
>  				 MAP_SHARED | populate, fd, 0);
> 
> @@ -1195,7 +1135,7 @@ vhost_user_set_mem_table(struct virtio_net **pdev,
> struct VhostUserMsg *msg,
>  		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
>  				      mmap_offset;
> 
> -		if (dev->dequeue_zero_copy || dev->async_copy)
> +		if (dev->async_copy)
>  			if (add_guest_pages(dev, reg, alignment) < 0) {
>  				VHOST_LOG_CONFIG(ERR,
>  					"adding guest pages to region %u failed.\n",
> @@ -1933,15 +1873,6 @@ vhost_user_set_vring_kick(struct virtio_net **pdev,
> struct VhostUserMsg *msg,
>  	return RTE_VHOST_MSG_RESULT_OK;
>  }
> 
> -static void
> -free_zmbufs(struct vhost_virtqueue *vq)
> -{
> -	drain_zmbuf_list(vq);
> -
> -	rte_free(vq->zmbufs);
> -	vq->zmbufs = NULL;
> -}
> -
>  /*
>   * when virtio is stopped, qemu will send us the GET_VRING_BASE message.
>   */
> @@ -1996,8 +1927,6 @@ vhost_user_get_vring_base(struct virtio_net **pdev,
> 
>  	vq->signalled_used_valid = false;
> 
> -	if (dev->dequeue_zero_copy)
> -		free_zmbufs(vq);
>  	if (vq_is_packed(dev)) {
>  		rte_free(vq->shadow_used_packed);
>  		vq->shadow_used_packed = NULL;
> @@ -2051,10 +1980,6 @@ vhost_user_set_vring_enable(struct virtio_net
> **pdev,
>  		}
>  	}
> 
> -	/* On disable, rings have to be stopped being processed. */
> -	if (!enable && dev->dequeue_zero_copy)
> -		drain_zmbuf_list(dev->virtqueue[index]);
> -
>  	dev->virtqueue[index]->enabled = enable;
> 
>  	return RTE_VHOST_MSG_RESULT_OK;
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index bd9303c8a9..0a0bea1a5a 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -1946,7 +1946,7 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>  		  struct rte_mbuf *m, struct rte_mempool *mbuf_pool)
>  {
>  	uint32_t buf_avail, buf_offset;
> -	uint64_t buf_addr, buf_iova, buf_len;
> +	uint64_t buf_addr, buf_len;
>  	uint32_t mbuf_avail, mbuf_offset;
>  	uint32_t cpy_len;
>  	struct rte_mbuf *cur = m, *prev = m;
> @@ -1958,7 +1958,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>  	int error = 0;
> 
>  	buf_addr = buf_vec[vec_idx].buf_addr;
> -	buf_iova = buf_vec[vec_idx].buf_iova;
>  	buf_len = buf_vec[vec_idx].buf_len;
> 
>  	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
> @@ -1988,14 +1987,12 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>  		buf_offset = dev->vhost_hlen - buf_len;
>  		vec_idx++;
>  		buf_addr = buf_vec[vec_idx].buf_addr;
> -		buf_iova = buf_vec[vec_idx].buf_iova;
>  		buf_len = buf_vec[vec_idx].buf_len;
>  		buf_avail  = buf_len - buf_offset;
>  	} else if (buf_len == dev->vhost_hlen) {
>  		if (unlikely(++vec_idx >= nr_vec))
>  			goto out;
>  		buf_addr = buf_vec[vec_idx].buf_addr;
> -		buf_iova = buf_vec[vec_idx].buf_iova;
>  		buf_len = buf_vec[vec_idx].buf_len;
> 
>  		buf_offset = 0;
> @@ -2012,48 +2009,23 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>  	mbuf_offset = 0;
>  	mbuf_avail  = m->buf_len - RTE_PKTMBUF_HEADROOM;
>  	while (1) {
> -		uint64_t hpa;
> -
>  		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
> 
> -		/*
> -		 * A desc buf might across two host physical pages that are
> -		 * not continuous. In such case (gpa_to_hpa returns 0), data
> -		 * will be copied even though zero copy is enabled.
> -		 */
> -		if (unlikely(dev->dequeue_zero_copy && (hpa = gpa_to_hpa(dev,
> -					buf_iova + buf_offset, cpy_len)))) {
> -			cur->data_len = cpy_len;
> -			cur->data_off = 0;
> -			cur->buf_addr =
> -				(void *)(uintptr_t)(buf_addr + buf_offset);
> -			cur->buf_iova = hpa;
> -
> -			/*
> -			 * In zero copy mode, one mbuf can only reference data
> -			 * for one or partial of one desc buff.
> -			 */
> -			mbuf_avail = cpy_len;
> -		} else {
> -			if (likely(cpy_len > MAX_BATCH_LEN ||
> -				   vq->batch_copy_nb_elems >= vq->size ||
> -				   (hdr && cur == m))) {
> -				rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *,
> -								   mbuf_offset),
> -					   (void *)((uintptr_t)(buf_addr +
> -							   buf_offset)),
> -					   cpy_len);
> -			} else {
> -				batch_copy[vq->batch_copy_nb_elems].dst =
> -					rte_pktmbuf_mtod_offset(cur, void *,
> -								mbuf_offset);
> -				batch_copy[vq->batch_copy_nb_elems].src =
> +		if (likely(cpy_len > MAX_BATCH_LEN ||
> +					vq->batch_copy_nb_elems >= vq->size ||
> +					(hdr && cur == m))) {
> +			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *,
> +						mbuf_offset),
>  					(void *)((uintptr_t)(buf_addr +
> -								buf_offset));
> -				batch_copy[vq->batch_copy_nb_elems].len =
> -					cpy_len;
> -				vq->batch_copy_nb_elems++;
> -			}
> +							buf_offset)), cpy_len);
> +		} else {
> +			batch_copy[vq->batch_copy_nb_elems].dst =
> +				rte_pktmbuf_mtod_offset(cur, void *,
> +						mbuf_offset);
> +			batch_copy[vq->batch_copy_nb_elems].src =
> +				(void *)((uintptr_t)(buf_addr + buf_offset));
> +			batch_copy[vq->batch_copy_nb_elems].len = cpy_len;
> +			vq->batch_copy_nb_elems++;
>  		}
> 
>  		mbuf_avail  -= cpy_len;
> @@ -2067,7 +2039,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>  				break;
> 
>  			buf_addr = buf_vec[vec_idx].buf_addr;
> -			buf_iova = buf_vec[vec_idx].buf_iova;
>  			buf_len = buf_vec[vec_idx].buf_len;
> 
>  			buf_offset = 0;
> @@ -2089,8 +2060,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>  				error = -1;
>  				goto out;
>  			}
> -			if (unlikely(dev->dequeue_zero_copy))
> -				rte_mbuf_refcnt_update(cur, 1);
> 
>  			prev->next = cur;
>  			prev->data_len = mbuf_offset;
> @@ -2114,37 +2083,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>  	return error;
>  }
> 
> -static __rte_always_inline struct zcopy_mbuf *
> -get_zmbuf(struct vhost_virtqueue *vq)
> -{
> -	uint16_t i;
> -	uint16_t last;
> -	int tries = 0;
> -
> -	/* search [last_zmbuf_idx, zmbuf_size) */
> -	i = vq->last_zmbuf_idx;
> -	last = vq->zmbuf_size;
> -
> -again:
> -	for (; i < last; i++) {
> -		if (vq->zmbufs[i].in_use == 0) {
> -			vq->last_zmbuf_idx = i + 1;
> -			vq->zmbufs[i].in_use = 1;
> -			return &vq->zmbufs[i];
> -		}
> -	}
> -
> -	tries++;
> -	if (tries == 1) {
> -		/* search [0, last_zmbuf_idx) */
> -		i = 0;
> -		last = vq->last_zmbuf_idx;
> -		goto again;
> -	}
> -
> -	return NULL;
> -}
> -
>  static void
>  virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
>  {
> @@ -2244,30 +2182,6 @@ virtio_dev_tx_split(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>  	uint16_t dropped = 0;
>  	static bool allocerr_warned;
> 
> -	if (unlikely(dev->dequeue_zero_copy)) {
> -		struct zcopy_mbuf *zmbuf, *next;
> -
> -		for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
> -		     zmbuf != NULL; zmbuf = next) {
> -			next = TAILQ_NEXT(zmbuf, next);
> -
> -			if (mbuf_is_consumed(zmbuf->mbuf)) {
> -				update_shadow_used_ring_split(vq,
> -						zmbuf->desc_idx, 0);
> -				TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
> -				restore_mbuf(zmbuf->mbuf);
> -				rte_pktmbuf_free(zmbuf->mbuf);
> -				put_zmbuf(zmbuf);
> -				vq->nr_zmbuf -= 1;
> -			}
> -		}
> -
> -		if (likely(vq->shadow_used_idx)) {
> -			flush_shadow_used_ring_split(dev, vq);
> -			vhost_vring_call_split(dev, vq);
> -		}
> -	}
> -
>  	/*
>  	 * The ordering between avail index and
>  	 * desc reads needs to be enforced.
> @@ -2300,8 +2214,7 @@ virtio_dev_tx_split(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>  						VHOST_ACCESS_RO) < 0))
>  			break;
> 
> -		if (likely(dev->dequeue_zero_copy == 0))
> -			update_shadow_used_ring_split(vq, head_idx, 0);
> +		update_shadow_used_ring_split(vq, head_idx, 0);
> 
>  		pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, buf_len);
>  		if (unlikely(pkts[i] == NULL)) {
> @@ -2335,42 +2248,16 @@ virtio_dev_tx_split(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>  			i++;
>  			break;
>  		}
> -
> -		if (unlikely(dev->dequeue_zero_copy)) {
> -			struct zcopy_mbuf *zmbuf;
> -
> -			zmbuf = get_zmbuf(vq);
> -			if (!zmbuf) {
> -				rte_pktmbuf_free(pkts[i]);
> -				dropped += 1;
> -				i++;
> -				break;
> -			}
> -			zmbuf->mbuf = pkts[i];
> -			zmbuf->desc_idx = head_idx;
> -
> -			/*
> -			 * Pin lock the mbuf; we will check later to see
> -			 * whether the mbuf is freed (when we are the last
> -			 * user) or not. If that's the case, we then could
> -			 * update the used ring safely.
> -			 */
> -			rte_mbuf_refcnt_update(pkts[i], 1);
> -
> -			vq->nr_zmbuf += 1;
> -			TAILQ_INSERT_TAIL(&vq->zmbuf_list, zmbuf, next);
> -		}
>  	}
> +
>  	vq->last_avail_idx += i;
> 
> -	if (likely(dev->dequeue_zero_copy == 0)) {
> -		do_data_copy_dequeue(vq);
> -		if (unlikely(i < count))
> -			vq->shadow_used_idx = i;
> -		if (likely(vq->shadow_used_idx)) {
> -			flush_shadow_used_ring_split(dev, vq);
> -			vhost_vring_call_split(dev, vq);
> -		}
> +	do_data_copy_dequeue(vq);
> +	if (unlikely(i < count))
> +		vq->shadow_used_idx = i;
> +	if (likely(vq->shadow_used_idx)) {
> +		flush_shadow_used_ring_split(dev, vq);
> +		vhost_vring_call_split(dev, vq);
>  	}
> 
>  	return (i - dropped);
> @@ -2570,162 +2457,6 @@ virtio_dev_tx_single_packed(struct virtio_net *dev,
>  	return ret;
>  }
> 
> -static __rte_always_inline int
> -virtio_dev_tx_batch_packed_zmbuf(struct virtio_net *dev,
> -				 struct vhost_virtqueue *vq,
> -				 struct rte_mempool *mbuf_pool,
> -				 struct rte_mbuf **pkts)
> -{
> -	struct zcopy_mbuf *zmbufs[PACKED_BATCH_SIZE];
> -	uintptr_t desc_addrs[PACKED_BATCH_SIZE];
> -	uint16_t ids[PACKED_BATCH_SIZE];
> -	uint16_t i;
> -
> -	uint16_t avail_idx = vq->last_avail_idx;
> -
> -	if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> -					     avail_idx, desc_addrs, ids))
> -		return -1;
> -
> -	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> -		zmbufs[i] = get_zmbuf(vq);
> -
> -	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> -		if (!zmbufs[i])
> -			goto free_pkt;
> -	}
> -
> -	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> -		zmbufs[i]->mbuf = pkts[i];
> -		zmbufs[i]->desc_idx = ids[i];
> -		zmbufs[i]->desc_count = 1;
> -	}
> -
> -	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> -		rte_mbuf_refcnt_update(pkts[i], 1);
> -
> -	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> -		TAILQ_INSERT_TAIL(&vq->zmbuf_list, zmbufs[i], next);
> -
> -	vq->nr_zmbuf += PACKED_BATCH_SIZE;
> -	vq_inc_last_avail_packed(vq, PACKED_BATCH_SIZE);
> -
> -	return 0;
> -
> -free_pkt:
> -	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> -		rte_pktmbuf_free(pkts[i]);
> -
> -	return -1;
> -}
> -
> -static __rte_always_inline int
> -virtio_dev_tx_single_packed_zmbuf(struct virtio_net *dev,
> -				  struct vhost_virtqueue *vq,
> -				  struct rte_mempool *mbuf_pool,
> -				  struct rte_mbuf **pkts)
> -{
> -	uint16_t buf_id, desc_count;
> -	struct zcopy_mbuf *zmbuf;
> -
> -	if (vhost_dequeue_single_packed(dev, vq, mbuf_pool, pkts, &buf_id,
> -					&desc_count))
> -		return -1;
> -
> -	zmbuf = get_zmbuf(vq);
> -	if (!zmbuf) {
> -		rte_pktmbuf_free(*pkts);
> -		return -1;
> -	}
> -	zmbuf->mbuf = *pkts;
> -	zmbuf->desc_idx = buf_id;
> -	zmbuf->desc_count = desc_count;
> -
> -	rte_mbuf_refcnt_update(*pkts, 1);
> -
> -	vq->nr_zmbuf += 1;
> -	TAILQ_INSERT_TAIL(&vq->zmbuf_list, zmbuf, next);
> -
> -	vq_inc_last_avail_packed(vq, desc_count);
> -	return 0;
> -}
> -
> -static __rte_always_inline void
> -free_zmbuf(struct vhost_virtqueue *vq)
> -{
> -	struct zcopy_mbuf *next = NULL;
> -	struct zcopy_mbuf *zmbuf;
> -
> -	for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
> -	     zmbuf != NULL; zmbuf = next) {
> -		next = TAILQ_NEXT(zmbuf, next);
> -
> -		uint16_t last_used_idx = vq->last_used_idx;
> -
> -		if (mbuf_is_consumed(zmbuf->mbuf)) {
> -			uint16_t flags;
> -			flags = vq->desc_packed[last_used_idx].flags;
> -			if (vq->used_wrap_counter) {
> -				flags |= VRING_DESC_F_USED;
> -				flags |= VRING_DESC_F_AVAIL;
> -			} else {
> -				flags &= ~VRING_DESC_F_USED;
> -				flags &= ~VRING_DESC_F_AVAIL;
> -			}
> -
> -			vq->desc_packed[last_used_idx].id = zmbuf->desc_idx;
> -			vq->desc_packed[last_used_idx].len = 0;
> -
> -			rte_smp_wmb();
> -			vq->desc_packed[last_used_idx].flags = flags;
> -
> -			vq_inc_last_used_packed(vq, zmbuf->desc_count);
> -
> -			TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
> -			restore_mbuf(zmbuf->mbuf);
> -			rte_pktmbuf_free(zmbuf->mbuf);
> -			put_zmbuf(zmbuf);
> -			vq->nr_zmbuf -= 1;
> -		}
> -	}
> -}
> -
> -static __rte_noinline uint16_t
> -virtio_dev_tx_packed_zmbuf(struct virtio_net *dev,
> -			   struct vhost_virtqueue *__rte_restrict vq,
> -			   struct rte_mempool *mbuf_pool,
> -			   struct rte_mbuf **__rte_restrict pkts,
> -			   uint32_t count)
> -{
> -	uint32_t pkt_idx = 0;
> -	uint32_t remained = count;
> -
> -	free_zmbuf(vq);
> -
> -	do {
> -		if (remained >= PACKED_BATCH_SIZE) {
> -			if (!virtio_dev_tx_batch_packed_zmbuf(dev, vq,
> -				mbuf_pool, &pkts[pkt_idx])) {
> -				pkt_idx += PACKED_BATCH_SIZE;
> -				remained -= PACKED_BATCH_SIZE;
> -				continue;
> -			}
> -		}
> -
> -		if (virtio_dev_tx_single_packed_zmbuf(dev, vq, mbuf_pool,
> -						      &pkts[pkt_idx]))
> -			break;
> -		pkt_idx++;
> -		remained--;
> -
> -	} while (remained);
> -
> -	if (pkt_idx)
> -		vhost_vring_call_packed(dev, vq);
> -
> -	return pkt_idx;
> -}
> -
>  static __rte_noinline uint16_t
>  virtio_dev_tx_packed(struct virtio_net *dev,
>  		     struct vhost_virtqueue *__rte_restrict vq,
> @@ -2841,14 +2572,9 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>  		count -= 1;
>  	}
> 
> -	if (vq_is_packed(dev)) {
> -		if (unlikely(dev->dequeue_zero_copy))
> -			count = virtio_dev_tx_packed_zmbuf(dev, vq, mbuf_pool,
> -							   pkts, count);
> -		else
> -			count = virtio_dev_tx_packed(dev, vq, mbuf_pool, pkts,
> -						     count);
> -	} else
> +	if (vq_is_packed(dev))
> +		count = virtio_dev_tx_packed(dev, vq, mbuf_pool, pkts, count);
> +	else
>  		count = virtio_dev_tx_split(dev, vq, mbuf_pool, pkts, count);
> 
>  out:
> --
> 2.26.2

Patch
diff mbox series

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index b892eec67a..ba4c62aeb8 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -51,50 +51,6 @@  The following is an overview of some key Vhost API functions:
     This reconnect option is enabled by default. However, it can be turned off
     by setting this flag.
 
-  - ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY``
-
-    Dequeue zero copy will be enabled when this flag is set. It is disabled by
-    default.
-
-    There are some truths (including limitations) you might want to know while
-    setting this flag:
-
-    * zero copy is not good for small packets (typically for packet size below
-      512).
-
-    * zero copy is really good for VM2VM case. For iperf between two VMs, the
-      boost could be above 70% (when TSO is enabled).
-
-    * For zero copy in VM2NIC case, guest Tx used vring may be starved if the
-      PMD driver consume the mbuf but not release them timely.
-
-      For example, i40e driver has an optimization to maximum NIC pipeline which
-      postpones returning transmitted mbuf until only tx_free_threshold free
-      descs left. The virtio TX used ring will be starved if the formula
-      (num_i40e_tx_desc - num_virtio_tx_desc > tx_free_threshold) is true, since
-      i40e will not return back mbuf.
-
-      A performance tip for tuning zero copy in VM2NIC case is to adjust the
-      frequency of mbuf free (i.e. adjust tx_free_threshold of i40e driver) to
-      balance consumer and producer.
-
-    * Guest memory should be backended with huge pages to achieve better
-      performance. Using 1G page size is the best.
-
-      When dequeue zero copy is enabled, the guest phys address and host phys
-      address mapping has to be established. Using non-huge pages means far
-      more page segments. To make it simple, DPDK vhost does a linear search
-      of those segments, thus the fewer the segments, the quicker we will get
-      the mapping. NOTE: we may speed it by using tree searching in future.
-
-    * zero copy can not work when using vfio-pci with iommu mode currently, this
-      is because we don't setup iommu dma mapping for guest memory. If you have
-      to use vfio-pci driver, please insert vfio-pci kernel module in noiommu
-      mode.
-
-    * The consumer of zero copy mbufs should consume these mbufs as soon as
-      possible, otherwise it may block the operations in vhost.
-
   - ``RTE_VHOST_USER_IOMMU_SUPPORT``
 
     IOMMU support will be enabled when this flag is set. It is disabled by
@@ -362,16 +318,16 @@  Guest memory requirement
 
 * Memory pre-allocation
 
-  For non-zerocopy non-async data path, guest memory pre-allocation is not a
+  For non-async data path, guest memory pre-allocation is not a
   must. This can help save of memory. If users really want the guest memory
   to be pre-allocated (e.g., for performance reason), we can add option
   ``-mem-prealloc`` when starting QEMU. Or, we can lock all memory at vhost
   side which will force memory to be allocated when mmap at vhost side;
   option --mlockall in ovs-dpdk is an example in hand.
 
-  For async and zerocopy data path, we force the VM memory to be
-  pre-allocated at vhost lib when mapping the guest memory; and also we need
-  to lock the memory to prevent pages being swapped out to disk.
+  For async data path, we force the VM memory to be pre-allocated at vhost
+  lib when mapping the guest memory; and also we need to lock the memory to
+  prevent pages being swapped out to disk.
 
 * Memory sharing
 
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index a94c84134d..46019df6fe 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -28,7 +28,7 @@  extern "C" {
 
 #define RTE_VHOST_USER_CLIENT		(1ULL << 0)
 #define RTE_VHOST_USER_NO_RECONNECT	(1ULL << 1)
-#define RTE_VHOST_USER_DEQUEUE_ZERO_COPY	(1ULL << 2)
+#define RTE_VHOST_USER_RESERVED_1	(1ULL << 2)
 #define RTE_VHOST_USER_IOMMU_SUPPORT	(1ULL << 3)
 #define RTE_VHOST_USER_POSTCOPY_SUPPORT		(1ULL << 4)
 /* support mbuf with external buffer attached */
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 73e1dca95e..0169d36481 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -37,7 +37,6 @@  struct vhost_user_socket {
 	struct sockaddr_un un;
 	bool is_server;
 	bool reconnect;
-	bool dequeue_zero_copy;
 	bool iommu_support;
 	bool use_builtin_virtio_net;
 	bool extbuf;
@@ -229,9 +228,6 @@  vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 
 	vhost_attach_vdpa_device(vid, vsocket->vdpa_dev);
 
-	if (vsocket->dequeue_zero_copy)
-		vhost_enable_dequeue_zero_copy(vid);
-
 	if (vsocket->extbuf)
 		vhost_enable_extbuf(vid);
 
@@ -878,18 +874,8 @@  rte_vhost_driver_register(const char *path, uint64_t flags)
 		goto out_free;
 	}
 	vsocket->vdpa_dev = NULL;
-	vsocket->dequeue_zero_copy = flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
 	vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
 	vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
-
-	if (vsocket->dequeue_zero_copy &&
-	    (flags & RTE_VHOST_USER_IOMMU_SUPPORT)) {
-		VHOST_LOG_CONFIG(ERR,
-			"error: enabling dequeue zero copy and IOMMU features "
-			"simultaneously is not supported\n");
-		goto out_mutex;
-	}
-
 	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
 
 	if (vsocket->async_copy &&
@@ -918,39 +904,6 @@  rte_vhost_driver_register(const char *path, uint64_t flags)
 	vsocket->features           = VIRTIO_NET_SUPPORTED_FEATURES;
 	vsocket->protocol_features  = VHOST_USER_PROTOCOL_FEATURES;
 
-	/*
-	 * Dequeue zero copy can't assure descriptors returned in order.
-	 * Also, it requires that the guest memory is populated, which is
-	 * not compatible with postcopy.
-	 */
-	if (vsocket->dequeue_zero_copy) {
-		if (vsocket->extbuf) {
-			VHOST_LOG_CONFIG(ERR,
-			"error: zero copy is incompatible with external buffers\n");
-			ret = -1;
-			goto out_mutex;
-		}
-		if (vsocket->linearbuf) {
-			VHOST_LOG_CONFIG(ERR,
-			"error: zero copy is incompatible with linear buffers\n");
-			ret = -1;
-			goto out_mutex;
-		}
-		if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
-			VHOST_LOG_CONFIG(ERR,
-			"error: zero copy is incompatible with vhost client mode\n");
-			ret = -1;
-			goto out_mutex;
-		}
-		vsocket->supported_features &= ~(1ULL << VIRTIO_F_IN_ORDER);
-		vsocket->features &= ~(1ULL << VIRTIO_F_IN_ORDER);
-
-		VHOST_LOG_CONFIG(INFO,
-			"Dequeue zero copy requested, disabling postcopy support\n");
-		vsocket->protocol_features &=
-			~(1ULL << VHOST_USER_PROTOCOL_F_PAGEFAULT);
-	}
-
 	if (vsocket->async_copy) {
 		vsocket->supported_features &= ~(1ULL << VHOST_F_LOG_ALL);
 		vsocket->features &= ~(1ULL << VHOST_F_LOG_ALL);
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 8f20a0818f..c7cd34e42b 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -539,8 +539,6 @@  init_vring_queue(struct virtio_net *dev, uint32_t vring_idx)
 	vhost_user_iotlb_init(dev, vring_idx);
 	/* Backends are set to -1 indicating an inactive device. */
 	vq->backend = -1;
-
-	TAILQ_INIT(&vq->zmbuf_list);
 }
 
 static void
@@ -704,18 +702,6 @@  vhost_set_ifname(int vid, const char *if_name, unsigned int if_len)
 	dev->ifname[sizeof(dev->ifname) - 1] = '\0';
 }
 
-void
-vhost_enable_dequeue_zero_copy(int vid)
-{
-	struct virtio_net *dev = get_device(vid);
-
-	if (dev == NULL)
-		return;
-
-	dev->dequeue_zero_copy = 1;
-	VHOST_LOG_CONFIG(INFO, "dequeue zero copy is enabled\n");
-}
-
 void
 vhost_set_builtin_virtio_net(int vid, bool enable)
 {
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 632f66d532..321cf6e99c 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -379,7 +379,6 @@  struct virtio_net {
 	/* to tell if we need broadcast rarp packet */
 	int16_t			broadcast_rarp;
 	uint32_t		nr_vring;
-	int			dequeue_zero_copy;
 	int			async_copy;
 	int			extbuf;
 	int			linearbuf;
@@ -897,10 +896,4 @@  mbuf_is_consumed(struct rte_mbuf *m)
 	return true;
 }
 
-static __rte_always_inline void
-put_zmbuf(struct zcopy_mbuf *zmbuf)
-{
-	zmbuf->in_use = 0;
-}
-
 #endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index c3c924faec..8af5ea0a98 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -134,29 +134,6 @@  get_blk_size(int fd)
 	return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
 }
 
-/*
- * Reclaim all the outstanding zmbufs for a virtqueue.
- */
-static void
-drain_zmbuf_list(struct vhost_virtqueue *vq)
-{
-	struct zcopy_mbuf *zmbuf, *next;
-
-	for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
-	     zmbuf != NULL; zmbuf = next) {
-		next = TAILQ_NEXT(zmbuf, next);
-
-		while (!mbuf_is_consumed(zmbuf->mbuf))
-			usleep(1000);
-
-		TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
-		restore_mbuf(zmbuf->mbuf);
-		rte_pktmbuf_free(zmbuf->mbuf);
-		put_zmbuf(zmbuf);
-		vq->nr_zmbuf -= 1;
-	}
-}
-
 static void
 free_mem_region(struct virtio_net *dev)
 {
@@ -167,14 +144,6 @@  free_mem_region(struct virtio_net *dev)
 	if (!dev || !dev->mem)
 		return;
 
-	if (dev->dequeue_zero_copy) {
-		for (i = 0; i < dev->nr_vring; i++) {
-			vq = dev->virtqueue[i];
-			if (vq)
-				drain_zmbuf_list(vq);
-		}
-	}
-
 	for (i = 0; i < dev->mem->nregions; i++) {
 		reg = &dev->mem->regions[i];
 		if (reg->host_user_addr) {
@@ -454,23 +423,6 @@  vhost_user_set_vring_num(struct virtio_net **pdev,
 		return RTE_VHOST_MSG_RESULT_ERR;
 	}
 
-	if (dev->dequeue_zero_copy) {
-		vq->nr_zmbuf = 0;
-		vq->last_zmbuf_idx = 0;
-		vq->zmbuf_size = vq->size;
-		if (vq->zmbufs)
-			rte_free(vq->zmbufs);
-		vq->zmbufs = rte_zmalloc(NULL, vq->zmbuf_size *
-					 sizeof(struct zcopy_mbuf), 0);
-		if (vq->zmbufs == NULL) {
-			VHOST_LOG_CONFIG(WARNING,
-				"failed to allocate mem for zero copy; "
-				"zero copy is force disabled\n");
-			dev->dequeue_zero_copy = 0;
-		}
-		TAILQ_INIT(&vq->zmbuf_list);
-	}
-
 	if (vq_is_packed(dev)) {
 		if (vq->shadow_used_packed)
 			rte_free(vq->shadow_used_packed);
@@ -524,7 +476,6 @@  numa_realloc(struct virtio_net *dev, int index)
 	int oldnode, newnode;
 	struct virtio_net *old_dev;
 	struct vhost_virtqueue *old_vq, *vq;
-	struct zcopy_mbuf *new_zmbuf;
 	struct vring_used_elem *new_shadow_used_split;
 	struct vring_used_elem_packed *new_shadow_used_packed;
 	struct batch_copy_elem *new_batch_copy_elems;
@@ -555,16 +506,6 @@  numa_realloc(struct virtio_net *dev, int index)
 			return dev;
 
 		memcpy(vq, old_vq, sizeof(*vq));
-		TAILQ_INIT(&vq->zmbuf_list);
-
-		if (dev->dequeue_zero_copy) {
-			new_zmbuf = rte_malloc_socket(NULL, vq->zmbuf_size *
-					sizeof(struct zcopy_mbuf), 0, newnode);
-			if (new_zmbuf) {
-				rte_free(vq->zmbufs);
-				vq->zmbufs = new_zmbuf;
-			}
-		}
 
 		if (vq_is_packed(dev)) {
 			new_shadow_used_packed = rte_malloc_socket(NULL,
@@ -1179,8 +1120,7 @@  vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
 			goto err_mmap;
 		}
 
-		populate = (dev->dequeue_zero_copy || dev->async_copy) ?
-			MAP_POPULATE : 0;
+		populate = dev->async_copy ? MAP_POPULATE : 0;
 		mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
 				 MAP_SHARED | populate, fd, 0);
 
@@ -1195,7 +1135,7 @@  vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
 		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
 				      mmap_offset;
 
-		if (dev->dequeue_zero_copy || dev->async_copy)
+		if (dev->async_copy)
 			if (add_guest_pages(dev, reg, alignment) < 0) {
 				VHOST_LOG_CONFIG(ERR,
 					"adding guest pages to region %u failed.\n",
@@ -1933,15 +1873,6 @@  vhost_user_set_vring_kick(struct virtio_net **pdev, struct VhostUserMsg *msg,
 	return RTE_VHOST_MSG_RESULT_OK;
 }
 
-static void
-free_zmbufs(struct vhost_virtqueue *vq)
-{
-	drain_zmbuf_list(vq);
-
-	rte_free(vq->zmbufs);
-	vq->zmbufs = NULL;
-}
-
 /*
  * when virtio is stopped, qemu will send us the GET_VRING_BASE message.
  */
@@ -1996,8 +1927,6 @@  vhost_user_get_vring_base(struct virtio_net **pdev,
 
 	vq->signalled_used_valid = false;
 
-	if (dev->dequeue_zero_copy)
-		free_zmbufs(vq);
 	if (vq_is_packed(dev)) {
 		rte_free(vq->shadow_used_packed);
 		vq->shadow_used_packed = NULL;
@@ -2051,10 +1980,6 @@  vhost_user_set_vring_enable(struct virtio_net **pdev,
 		}
 	}
 
-	/* On disable, rings have to be stopped being processed. */
-	if (!enable && dev->dequeue_zero_copy)
-		drain_zmbuf_list(dev->virtqueue[index]);
-
 	dev->virtqueue[index]->enabled = enable;
 
 	return RTE_VHOST_MSG_RESULT_OK;
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index bd9303c8a9..0a0bea1a5a 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1946,7 +1946,7 @@  copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		  struct rte_mbuf *m, struct rte_mempool *mbuf_pool)
 {
 	uint32_t buf_avail, buf_offset;
-	uint64_t buf_addr, buf_iova, buf_len;
+	uint64_t buf_addr, buf_len;
 	uint32_t mbuf_avail, mbuf_offset;
 	uint32_t cpy_len;
 	struct rte_mbuf *cur = m, *prev = m;
@@ -1958,7 +1958,6 @@  copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	int error = 0;
 
 	buf_addr = buf_vec[vec_idx].buf_addr;
-	buf_iova = buf_vec[vec_idx].buf_iova;
 	buf_len = buf_vec[vec_idx].buf_len;
 
 	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
@@ -1988,14 +1987,12 @@  copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		buf_offset = dev->vhost_hlen - buf_len;
 		vec_idx++;
 		buf_addr = buf_vec[vec_idx].buf_addr;
-		buf_iova = buf_vec[vec_idx].buf_iova;
 		buf_len = buf_vec[vec_idx].buf_len;
 		buf_avail  = buf_len - buf_offset;
 	} else if (buf_len == dev->vhost_hlen) {
 		if (unlikely(++vec_idx >= nr_vec))
 			goto out;
 		buf_addr = buf_vec[vec_idx].buf_addr;
-		buf_iova = buf_vec[vec_idx].buf_iova;
 		buf_len = buf_vec[vec_idx].buf_len;
 
 		buf_offset = 0;
@@ -2012,48 +2009,23 @@  copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	mbuf_offset = 0;
 	mbuf_avail  = m->buf_len - RTE_PKTMBUF_HEADROOM;
 	while (1) {
-		uint64_t hpa;
-
 		cpy_len = RTE_MIN(buf_avail, mbuf_avail);
 
-		/*
-		 * A desc buf might across two host physical pages that are
-		 * not continuous. In such case (gpa_to_hpa returns 0), data
-		 * will be copied even though zero copy is enabled.
-		 */
-		if (unlikely(dev->dequeue_zero_copy && (hpa = gpa_to_hpa(dev,
-					buf_iova + buf_offset, cpy_len)))) {
-			cur->data_len = cpy_len;
-			cur->data_off = 0;
-			cur->buf_addr =
-				(void *)(uintptr_t)(buf_addr + buf_offset);
-			cur->buf_iova = hpa;
-
-			/*
-			 * In zero copy mode, one mbuf can only reference data
-			 * for one or partial of one desc buff.
-			 */
-			mbuf_avail = cpy_len;
-		} else {
-			if (likely(cpy_len > MAX_BATCH_LEN ||
-				   vq->batch_copy_nb_elems >= vq->size ||
-				   (hdr && cur == m))) {
-				rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *,
-								   mbuf_offset),
-					   (void *)((uintptr_t)(buf_addr +
-							   buf_offset)),
-					   cpy_len);
-			} else {
-				batch_copy[vq->batch_copy_nb_elems].dst =
-					rte_pktmbuf_mtod_offset(cur, void *,
-								mbuf_offset);
-				batch_copy[vq->batch_copy_nb_elems].src =
+		if (likely(cpy_len > MAX_BATCH_LEN ||
+					vq->batch_copy_nb_elems >= vq->size ||
+					(hdr && cur == m))) {
+			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *,
+						mbuf_offset),
 					(void *)((uintptr_t)(buf_addr +
-								buf_offset));
-				batch_copy[vq->batch_copy_nb_elems].len =
-					cpy_len;
-				vq->batch_copy_nb_elems++;
-			}
+							buf_offset)), cpy_len);
+		} else {
+			batch_copy[vq->batch_copy_nb_elems].dst =
+				rte_pktmbuf_mtod_offset(cur, void *,
+						mbuf_offset);
+			batch_copy[vq->batch_copy_nb_elems].src =
+				(void *)((uintptr_t)(buf_addr + buf_offset));
+			batch_copy[vq->batch_copy_nb_elems].len = cpy_len;
+			vq->batch_copy_nb_elems++;
 		}
 
 		mbuf_avail  -= cpy_len;
@@ -2067,7 +2039,6 @@  copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 				break;
 
 			buf_addr = buf_vec[vec_idx].buf_addr;
-			buf_iova = buf_vec[vec_idx].buf_iova;
 			buf_len = buf_vec[vec_idx].buf_len;
 
 			buf_offset = 0;
@@ -2089,8 +2060,6 @@  copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 				error = -1;
 				goto out;
 			}
-			if (unlikely(dev->dequeue_zero_copy))
-				rte_mbuf_refcnt_update(cur, 1);
 
 			prev->next = cur;
 			prev->data_len = mbuf_offset;
@@ -2114,37 +2083,6 @@  copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return error;
 }
 
-static __rte_always_inline struct zcopy_mbuf *
-get_zmbuf(struct vhost_virtqueue *vq)
-{
-	uint16_t i;
-	uint16_t last;
-	int tries = 0;
-
-	/* search [last_zmbuf_idx, zmbuf_size) */
-	i = vq->last_zmbuf_idx;
-	last = vq->zmbuf_size;
-
-again:
-	for (; i < last; i++) {
-		if (vq->zmbufs[i].in_use == 0) {
-			vq->last_zmbuf_idx = i + 1;
-			vq->zmbufs[i].in_use = 1;
-			return &vq->zmbufs[i];
-		}
-	}
-
-	tries++;
-	if (tries == 1) {
-		/* search [0, last_zmbuf_idx) */
-		i = 0;
-		last = vq->last_zmbuf_idx;
-		goto again;
-	}
-
-	return NULL;
-}
-
 static void
 virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
 {
@@ -2244,30 +2182,6 @@  virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	uint16_t dropped = 0;
 	static bool allocerr_warned;
 
-	if (unlikely(dev->dequeue_zero_copy)) {
-		struct zcopy_mbuf *zmbuf, *next;
-
-		for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
-		     zmbuf != NULL; zmbuf = next) {
-			next = TAILQ_NEXT(zmbuf, next);
-
-			if (mbuf_is_consumed(zmbuf->mbuf)) {
-				update_shadow_used_ring_split(vq,
-						zmbuf->desc_idx, 0);
-				TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
-				restore_mbuf(zmbuf->mbuf);
-				rte_pktmbuf_free(zmbuf->mbuf);
-				put_zmbuf(zmbuf);
-				vq->nr_zmbuf -= 1;
-			}
-		}
-
-		if (likely(vq->shadow_used_idx)) {
-			flush_shadow_used_ring_split(dev, vq);
-			vhost_vring_call_split(dev, vq);
-		}
-	}
-
 	/*
 	 * The ordering between avail index and
 	 * desc reads needs to be enforced.
@@ -2300,8 +2214,7 @@  virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 						VHOST_ACCESS_RO) < 0))
 			break;
 
-		if (likely(dev->dequeue_zero_copy == 0))
-			update_shadow_used_ring_split(vq, head_idx, 0);
+		update_shadow_used_ring_split(vq, head_idx, 0);
 
 		pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, buf_len);
 		if (unlikely(pkts[i] == NULL)) {
@@ -2335,42 +2248,16 @@  virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			i++;
 			break;
 		}
-
-		if (unlikely(dev->dequeue_zero_copy)) {
-			struct zcopy_mbuf *zmbuf;
-
-			zmbuf = get_zmbuf(vq);
-			if (!zmbuf) {
-				rte_pktmbuf_free(pkts[i]);
-				dropped += 1;
-				i++;
-				break;
-			}
-			zmbuf->mbuf = pkts[i];
-			zmbuf->desc_idx = head_idx;
-
-			/*
-			 * Pin lock the mbuf; we will check later to see
-			 * whether the mbuf is freed (when we are the last
-			 * user) or not. If that's the case, we then could
-			 * update the used ring safely.
-			 */
-			rte_mbuf_refcnt_update(pkts[i], 1);
-
-			vq->nr_zmbuf += 1;
-			TAILQ_INSERT_TAIL(&vq->zmbuf_list, zmbuf, next);
-		}
 	}
+
 	vq->last_avail_idx += i;
 
-	if (likely(dev->dequeue_zero_copy == 0)) {
-		do_data_copy_dequeue(vq);
-		if (unlikely(i < count))
-			vq->shadow_used_idx = i;
-		if (likely(vq->shadow_used_idx)) {
-			flush_shadow_used_ring_split(dev, vq);
-			vhost_vring_call_split(dev, vq);
-		}
+	do_data_copy_dequeue(vq);
+	if (unlikely(i < count))
+		vq->shadow_used_idx = i;
+	if (likely(vq->shadow_used_idx)) {
+		flush_shadow_used_ring_split(dev, vq);
+		vhost_vring_call_split(dev, vq);
 	}
 
 	return (i - dropped);
@@ -2570,162 +2457,6 @@  virtio_dev_tx_single_packed(struct virtio_net *dev,
 	return ret;
 }
 
-static __rte_always_inline int
-virtio_dev_tx_batch_packed_zmbuf(struct virtio_net *dev,
-				 struct vhost_virtqueue *vq,
-				 struct rte_mempool *mbuf_pool,
-				 struct rte_mbuf **pkts)
-{
-	struct zcopy_mbuf *zmbufs[PACKED_BATCH_SIZE];
-	uintptr_t desc_addrs[PACKED_BATCH_SIZE];
-	uint16_t ids[PACKED_BATCH_SIZE];
-	uint16_t i;
-
-	uint16_t avail_idx = vq->last_avail_idx;
-
-	if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
-					     avail_idx, desc_addrs, ids))
-		return -1;
-
-	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
-		zmbufs[i] = get_zmbuf(vq);
-
-	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
-		if (!zmbufs[i])
-			goto free_pkt;
-	}
-
-	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
-		zmbufs[i]->mbuf = pkts[i];
-		zmbufs[i]->desc_idx = ids[i];
-		zmbufs[i]->desc_count = 1;
-	}
-
-	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
-		rte_mbuf_refcnt_update(pkts[i], 1);
-
-	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
-		TAILQ_INSERT_TAIL(&vq->zmbuf_list, zmbufs[i], next);
-
-	vq->nr_zmbuf += PACKED_BATCH_SIZE;
-	vq_inc_last_avail_packed(vq, PACKED_BATCH_SIZE);
-
-	return 0;
-
-free_pkt:
-	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
-		rte_pktmbuf_free(pkts[i]);
-
-	return -1;
-}
-
-static __rte_always_inline int
-virtio_dev_tx_single_packed_zmbuf(struct virtio_net *dev,
-				  struct vhost_virtqueue *vq,
-				  struct rte_mempool *mbuf_pool,
-				  struct rte_mbuf **pkts)
-{
-	uint16_t buf_id, desc_count;
-	struct zcopy_mbuf *zmbuf;
-
-	if (vhost_dequeue_single_packed(dev, vq, mbuf_pool, pkts, &buf_id,
-					&desc_count))
-		return -1;
-
-	zmbuf = get_zmbuf(vq);
-	if (!zmbuf) {
-		rte_pktmbuf_free(*pkts);
-		return -1;
-	}
-	zmbuf->mbuf = *pkts;
-	zmbuf->desc_idx = buf_id;
-	zmbuf->desc_count = desc_count;
-
-	rte_mbuf_refcnt_update(*pkts, 1);
-
-	vq->nr_zmbuf += 1;
-	TAILQ_INSERT_TAIL(&vq->zmbuf_list, zmbuf, next);
-
-	vq_inc_last_avail_packed(vq, desc_count);
-	return 0;
-}
-
-static __rte_always_inline void
-free_zmbuf(struct vhost_virtqueue *vq)
-{
-	struct zcopy_mbuf *next = NULL;
-	struct zcopy_mbuf *zmbuf;
-
-	for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
-	     zmbuf != NULL; zmbuf = next) {
-		next = TAILQ_NEXT(zmbuf, next);
-
-		uint16_t last_used_idx = vq->last_used_idx;
-
-		if (mbuf_is_consumed(zmbuf->mbuf)) {
-			uint16_t flags;
-			flags = vq->desc_packed[last_used_idx].flags;
-			if (vq->used_wrap_counter) {
-				flags |= VRING_DESC_F_USED;
-				flags |= VRING_DESC_F_AVAIL;
-			} else {
-				flags &= ~VRING_DESC_F_USED;
-				flags &= ~VRING_DESC_F_AVAIL;
-			}
-
-			vq->desc_packed[last_used_idx].id = zmbuf->desc_idx;
-			vq->desc_packed[last_used_idx].len = 0;
-
-			rte_smp_wmb();
-			vq->desc_packed[last_used_idx].flags = flags;
-
-			vq_inc_last_used_packed(vq, zmbuf->desc_count);
-
-			TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
-			restore_mbuf(zmbuf->mbuf);
-			rte_pktmbuf_free(zmbuf->mbuf);
-			put_zmbuf(zmbuf);
-			vq->nr_zmbuf -= 1;
-		}
-	}
-}
-
-static __rte_noinline uint16_t
-virtio_dev_tx_packed_zmbuf(struct virtio_net *dev,
-			   struct vhost_virtqueue *__rte_restrict vq,
-			   struct rte_mempool *mbuf_pool,
-			   struct rte_mbuf **__rte_restrict pkts,
-			   uint32_t count)
-{
-	uint32_t pkt_idx = 0;
-	uint32_t remained = count;
-
-	free_zmbuf(vq);
-
-	do {
-		if (remained >= PACKED_BATCH_SIZE) {
-			if (!virtio_dev_tx_batch_packed_zmbuf(dev, vq,
-				mbuf_pool, &pkts[pkt_idx])) {
-				pkt_idx += PACKED_BATCH_SIZE;
-				remained -= PACKED_BATCH_SIZE;
-				continue;
-			}
-		}
-
-		if (virtio_dev_tx_single_packed_zmbuf(dev, vq, mbuf_pool,
-						      &pkts[pkt_idx]))
-			break;
-		pkt_idx++;
-		remained--;
-
-	} while (remained);
-
-	if (pkt_idx)
-		vhost_vring_call_packed(dev, vq);
-
-	return pkt_idx;
-}
-
 static __rte_noinline uint16_t
 virtio_dev_tx_packed(struct virtio_net *dev,
 		     struct vhost_virtqueue *__rte_restrict vq,
@@ -2841,14 +2572,9 @@  rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 		count -= 1;
 	}
 
-	if (vq_is_packed(dev)) {
-		if (unlikely(dev->dequeue_zero_copy))
-			count = virtio_dev_tx_packed_zmbuf(dev, vq, mbuf_pool,
-							   pkts, count);
-		else
-			count = virtio_dev_tx_packed(dev, vq, mbuf_pool, pkts,
-						     count);
-	} else
+	if (vq_is_packed(dev))
+		count = virtio_dev_tx_packed(dev, vq, mbuf_pool, pkts, count);
+	else
 		count = virtio_dev_tx_split(dev, vq, mbuf_pool, pkts, count);
 
 out: