vhost: add support to large linear mbufs

Message ID 20191001221935.12140-1-fbl@sysclose.org (mailing list archive)
State Superseded, archived
Headers
Series vhost: add support to large linear mbufs |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation success Compilation OK
ci/iol-compilation warning Compilie Testing issues
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS

Commit Message

Flavio Leitner Oct. 1, 2019, 10:19 p.m. UTC
  The rte_vhost_dequeue_burst supports two ways of dequeuing data. If
the data fits into a buffer, then all data is copied and a single
linear buffer is returned. Otherwise it allocates additional mbufs
and chains them together to return a multiple segments mbuf.

While that covers most use cases, it forces applications that need
to work with larger data sizes to support multiple segments mbufs.
The non-linear characteristic brings complexity and performance
implications to the application.

To resolve the issue, change the API so that the application can
optionally provide a second mempool containing larger mbufs. If that
is not provided (NULL), the behavior remains as before the change.
Otherwise, the data size is checked and the corresponding mempool
is used to return linear mbufs.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
---
 drivers/net/vhost/rte_eth_vhost.c |  4 +--
 examples/tep_termination/main.c   |  2 +-
 examples/vhost/main.c             |  2 +-
 lib/librte_vhost/rte_vhost.h      |  5 ++-
 lib/librte_vhost/virtio_net.c     | 57 +++++++++++++++++++++++--------
 5 files changed, 50 insertions(+), 20 deletions(-)
  

Comments

Flavio Leitner Oct. 1, 2019, 11:10 p.m. UTC | #1
On Tue,  1 Oct 2019 19:19:35 -0300
Flavio Leitner <fbl@sysclose.org> wrote:

> The rte_vhost_dequeue_burst supports two ways of dequeuing data. If
> the data fits into a buffer, then all data is copied and a single
> linear buffer is returned. Otherwise it allocates additional mbufs
> and chains them together to return a multiple segments mbuf.
> 
> While that covers most use cases, it forces applications that need
> to work with larger data sizes to support multiple segments mbufs.
> The non-linear characteristic brings complexity and performance
> implications to the application.
> 
> To resolve the issue, change the API so that the application can
> optionally provide a second mempool containing larger mbufs. If that
> is not provided (NULL), the behavior remains as before the change.
> Otherwise, the data size is checked and the corresponding mempool
> is used to return linear mbufs.
> 
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>

The use case behind the patch is TSO support in DPDK accelerated Open
vSwitch (customers are using ovs-dpdk to send big packets too!). The
effort has been going on for many months now. There is a recent
proposal in OvS dev mailing using the API as is today here:
https://mail.openvswitch.org/pipermail/ovs-dev/2019-September/362573.html

There is also a reply from me in that thread with the patches
implementing a Proof-Of-Concept with the proposed API change:
https://mail.openvswitch.org/pipermail/ovs-dev/2019-September/362881.html

I opted to improve the current API instead of adding a new one, but I
have no strong opinion either way.

I know it's a bit late for v19.11, but since the TSO support effort is
going on for months now and OvS uses only LTS, I wonder if this can be
part of v19.11.

Thanks,
fbl
  
Shahaf Shuler Oct. 2, 2019, 4:45 a.m. UTC | #2
Wednesday, October 2, 2019 1:20 AM, Flavio Leitner:
> Subject: [dpdk-dev] [PATCH] vhost: add support to large linear mbufs
> 
> The rte_vhost_dequeue_burst supports two ways of dequeuing data. If the
> data fits into a buffer, then all data is copied and a single linear buffer is
> returned. Otherwise it allocates additional mbufs and chains them together
> to return a multiple segments mbuf.
> 
> While that covers most use cases, it forces applications that need to work
> with larger data sizes to support multiple segments mbufs.
> The non-linear characteristic brings complexity and performance implications
> to the application.
> 
> To resolve the issue, change the API so that the application can optionally
> provide a second mempool containing larger mbufs. If that is not provided
> (NULL), the behavior remains as before the change.
> Otherwise, the data size is checked and the corresponding mempool is used
> to return linear mbufs.

I understand the motivation. 
However, providing a static pool w/ large buffers is not so efficient in terms of memory footprint. You will need to prepare to worst case (all packet are large) w/ max size of 64KB. 
Also, the two mempools are quite restrictive as the memory fill of the mbufs might be very sparse. E.g. mempool1 mbuf.size = 1.5K , mempool2 mbuf.size = 64K, packet size 4KB. 

Instead, how about using the mbuf external buffer feature? 
The flow will be:
1. vhost PMD always receive a single mempool (like today)
2. on dequeue, PMD looks on the virtio packet size. If smaller than the mbuf size use the mbuf as is (like today)
3. otherwise, allocate a new buffer (inside the PMD) and link it to the mbuf as external buffer (rte_pktmbuf_attach_extbuf)

The pros of this approach is that you have full flexibility on the memory allocation, and therefore a lower footprint.
The cons is the OVS will need to know how to handle mbuf w/ external buffers (not too complex IMO).  

> 
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> ---
>  drivers/net/vhost/rte_eth_vhost.c |  4 +--
>  examples/tep_termination/main.c   |  2 +-
>  examples/vhost/main.c             |  2 +-
>  lib/librte_vhost/rte_vhost.h      |  5 ++-
>  lib/librte_vhost/virtio_net.c     | 57 +++++++++++++++++++++++--------
>  5 files changed, 50 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/net/vhost/rte_eth_vhost.c
> b/drivers/net/vhost/rte_eth_vhost.c
> index 46f01a7f4..ce7f68a5b 100644
> --- a/drivers/net/vhost/rte_eth_vhost.c
> +++ b/drivers/net/vhost/rte_eth_vhost.c
> @@ -393,8 +393,8 @@ eth_vhost_rx(void *q, struct rte_mbuf **bufs,
> uint16_t nb_bufs)
>  						 VHOST_MAX_PKT_BURST);
> 
>  		nb_pkts = rte_vhost_dequeue_burst(r->vid, r-
> >virtqueue_id,
> -						  r->mb_pool, &bufs[nb_rx],
> -						  num);
> +						  r->mb_pool, NULL,
> +						  &bufs[nb_rx], num);
> 
>  		nb_rx += nb_pkts;
>  		nb_receive -= nb_pkts;
> diff --git a/examples/tep_termination/main.c
> b/examples/tep_termination/main.c index ab956ad7c..3ebf0fa6e 100644
> --- a/examples/tep_termination/main.c
> +++ b/examples/tep_termination/main.c
> @@ -697,7 +697,7 @@ switch_worker(__rte_unused void *arg)
>  			if (likely(!vdev->remove)) {
>  				/* Handle guest TX*/
>  				tx_count = rte_vhost_dequeue_burst(vdev-
> >vid,
> -						VIRTIO_TXQ, mbuf_pool,
> +						VIRTIO_TXQ, mbuf_pool,
> NULL,
>  						pkts_burst,
> MAX_PKT_BURST);
>  				/* If this is the first received packet we need
> to learn the MAC */
>  				if (unlikely(vdev->ready ==
> DEVICE_MAC_LEARNING) && tx_count) { diff --git a/examples/vhost/main.c
> b/examples/vhost/main.c index ab649bf14..e9b306af3 100644
> --- a/examples/vhost/main.c
> +++ b/examples/vhost/main.c
> @@ -1092,7 +1092,7 @@ drain_virtio_tx(struct vhost_dev *vdev)
>  					pkts, MAX_PKT_BURST);
>  	} else {
>  		count = rte_vhost_dequeue_burst(vdev->vid, VIRTIO_TXQ,
> -					mbuf_pool, pkts, MAX_PKT_BURST);
> +					mbuf_pool, NULL, pkts,
> MAX_PKT_BURST);
>  	}
> 
>  	/* setup VMDq for the first packet */
> diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h index
> 19474bca0..b05fd8e2a 100644
> --- a/lib/librte_vhost/rte_vhost.h
> +++ b/lib/librte_vhost/rte_vhost.h
> @@ -593,6 +593,8 @@ uint16_t rte_vhost_enqueue_burst(int vid, uint16_t
> queue_id,
>   *  virtio queue index in mq case
>   * @param mbuf_pool
>   *  mbuf_pool where host mbuf is allocated.
> + * @param mbuf_pool_large
> + *  mbuf_pool where larger host mbuf is allocated.
>   * @param pkts
>   *  array to contain packets to be dequeued
>   * @param count
> @@ -601,7 +603,8 @@ uint16_t rte_vhost_enqueue_burst(int vid, uint16_t
> queue_id,
>   *  num of packets dequeued
>   */
>  uint16_t rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
> -	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t
> count);
> +	struct rte_mempool *mbuf_pool, struct rte_mempool
> *mbuf_pool_large,
> +	struct rte_mbuf **pkts, uint16_t count);
> 
>  /**
>   * Get guest mem table: a list of memory regions.
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c index
> 5b85b832d..da9d77732 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -1291,10 +1291,12 @@ get_zmbuf(struct vhost_virtqueue *vq)
> 
>  static __rte_noinline uint16_t
>  virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
> -	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t
> count)
> +	struct rte_mempool *mbuf_pool, struct rte_mempool
> *mbuf_pool_large,
> +	struct rte_mbuf **pkts, uint16_t count)
>  {
>  	uint16_t i;
>  	uint16_t free_entries;
> +	uint16_t mbuf_avail;
> 
>  	if (unlikely(dev->dequeue_zero_copy)) {
>  		struct zcopy_mbuf *zmbuf, *next;
> @@ -1340,32 +1342,42 @@ virtio_dev_tx_split(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
>  	VHOST_LOG_DEBUG(VHOST_DATA, "(%d) about to dequeue %u
> buffers\n",
>  			dev->vid, count);
> 
> +	/* If the large mpool is provided, find the threshold */
> +	mbuf_avail = 0;
> +	if (mbuf_pool_large)
> +		mbuf_avail = rte_pktmbuf_data_room_size(mbuf_pool) -
> +RTE_PKTMBUF_HEADROOM;
> +
>  	for (i = 0; i < count; i++) {
>  		struct buf_vector buf_vec[BUF_VECTOR_MAX];
>  		uint16_t head_idx;
> -		uint32_t dummy_len;
> +		uint32_t buf_len;
>  		uint16_t nr_vec = 0;
> +		struct rte_mempool *mpool;
>  		int err;
> 
>  		if (unlikely(fill_vec_buf_split(dev, vq,
>  						vq->last_avail_idx + i,
>  						&nr_vec, buf_vec,
> -						&head_idx, &dummy_len,
> +						&head_idx, &buf_len,
>  						VHOST_ACCESS_RO) < 0))
>  			break;
> 
>  		if (likely(dev->dequeue_zero_copy == 0))
>  			update_shadow_used_ring_split(vq, head_idx, 0);
> 
> -		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
> +		if (mbuf_pool_large && buf_len > mbuf_avail)
> +			mpool = mbuf_pool_large;
> +		else
> +			mpool = mbuf_pool;
> +
> +		pkts[i] = rte_pktmbuf_alloc(mpool);
>  		if (unlikely(pkts[i] == NULL)) {
>  			RTE_LOG(ERR, VHOST_DATA,
>  				"Failed to allocate memory for mbuf.\n");
>  			break;
>  		}
> 
> -		err = copy_desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkts[i],
> -				mbuf_pool);
> +		err = copy_desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkts[i],
> mpool);
>  		if (unlikely(err)) {
>  			rte_pktmbuf_free(pkts[i]);
>  			break;
> @@ -1411,9 +1423,11 @@ virtio_dev_tx_split(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
> 
>  static __rte_noinline uint16_t
>  virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
> -	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t
> count)
> +	struct rte_mempool *mbuf_pool, struct rte_mempool
> *mbuf_pool_large,
> +	struct rte_mbuf **pkts, uint16_t count)
>  {
>  	uint16_t i;
> +	uint16_t mbuf_avail;
> 
>  	if (unlikely(dev->dequeue_zero_copy)) {
>  		struct zcopy_mbuf *zmbuf, *next;
> @@ -1448,17 +1462,23 @@ virtio_dev_tx_packed(struct virtio_net *dev,
> struct vhost_virtqueue *vq,
>  	VHOST_LOG_DEBUG(VHOST_DATA, "(%d) about to dequeue %u
> buffers\n",
>  			dev->vid, count);
> 
> +	/* If the large mpool is provided, find the threshold */
> +	mbuf_avail = 0;
> +	if (mbuf_pool_large)
> +		mbuf_avail = rte_pktmbuf_data_room_size(mbuf_pool) -
> +RTE_PKTMBUF_HEADROOM;
> +
>  	for (i = 0; i < count; i++) {
>  		struct buf_vector buf_vec[BUF_VECTOR_MAX];
>  		uint16_t buf_id;
> -		uint32_t dummy_len;
> +		uint32_t buf_len;
>  		uint16_t desc_count, nr_vec = 0;
> +		struct rte_mempool *mpool;
>  		int err;
> 
>  		if (unlikely(fill_vec_buf_packed(dev, vq,
>  						vq->last_avail_idx,
> &desc_count,
>  						buf_vec, &nr_vec,
> -						&buf_id, &dummy_len,
> +						&buf_id, &buf_len,
>  						VHOST_ACCESS_RO) < 0))
>  			break;
> 
> @@ -1466,15 +1486,19 @@ virtio_dev_tx_packed(struct virtio_net *dev,
> struct vhost_virtqueue *vq,
>  			update_shadow_used_ring_packed(vq, buf_id, 0,
>  					desc_count);
> 
> -		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
> +		if (mbuf_pool_large && buf_len > mbuf_avail)
> +			mpool = mbuf_pool_large;
> +		else
> +			mpool = mbuf_pool;
> +
> +		pkts[i] = rte_pktmbuf_alloc(mpool);
>  		if (unlikely(pkts[i] == NULL)) {
>  			RTE_LOG(ERR, VHOST_DATA,
>  				"Failed to allocate memory for mbuf.\n");
>  			break;
>  		}
> 
> -		err = copy_desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkts[i],
> -				mbuf_pool);
> +		err = copy_desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkts[i],
> mpool);
>  		if (unlikely(err)) {
>  			rte_pktmbuf_free(pkts[i]);
>  			break;
> @@ -1526,7 +1550,8 @@ virtio_dev_tx_packed(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
> 
>  uint16_t
>  rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
> -	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t
> count)
> +	struct rte_mempool *mbuf_pool, struct rte_mempool
> *mbuf_pool_large,
> +	struct rte_mbuf **pkts, uint16_t count)
>  {
>  	struct virtio_net *dev;
>  	struct rte_mbuf *rarp_mbuf = NULL;
> @@ -1598,9 +1623,11 @@ rte_vhost_dequeue_burst(int vid, uint16_t
> queue_id,
>  	}
> 
>  	if (vq_is_packed(dev))
> -		count = virtio_dev_tx_packed(dev, vq, mbuf_pool, pkts,
> count);
> +		count = virtio_dev_tx_packed(dev, vq, mbuf_pool,
> mbuf_pool_large, pkts,
> +                                     count);
>  	else
> -		count = virtio_dev_tx_split(dev, vq, mbuf_pool, pkts, count);
> +		count = virtio_dev_tx_split(dev, vq, mbuf_pool,
> mbuf_pool_large, pkts,
> +                                    count);
> 
>  out:
>  	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
> --
> 2.20.1
  
Maxime Coquelin Oct. 2, 2019, 7:51 a.m. UTC | #3
Hi Flavio,

On 10/2/19 12:19 AM, Flavio Leitner wrote:
> The rte_vhost_dequeue_burst supports two ways of dequeuing data. If
> the data fits into a buffer, then all data is copied and a single
> linear buffer is returned. Otherwise it allocates additional mbufs
> and chains them together to return a multiple segments mbuf.
> 
> While that covers most use cases, it forces applications that need
> to work with larger data sizes to support multiple segments mbufs.
> The non-linear characteristic brings complexity and performance
> implications to the application.
> 
> To resolve the issue, change the API so that the application can
> optionally provide a second mempool containing larger mbufs. If that
> is not provided (NULL), the behavior remains as before the change.
> Otherwise, the data size is checked and the corresponding mempool
> is used to return linear mbufs.

It is not necessary to break the API (and it would require a deprecation
notice), but instead you could create a new API:

static inline uint16_t
vhost_dequeue_burst(int vid, uint16_t queue_id,
		struct rte_mempool *mbuf_pool,
		struct rte_mempool *mbuf_pool_large,
		struct rte_mbuf **pkts, uint16_t count);

uint16_t rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
				struct rte_mempool *mbuf_pool,
				struct rte_mbuf **pkts, uint16_t count)
{
	return vhost_dequeue_burst(vid, queue_id, mbuf_pool, NULL,
				pkts, count);
}

uint16_t rte_vhost_dequeue_burst2(int vid, uint16_t queue_id,
				struct rte_mempool *mbuf_pool,
				struct rte_mempool *mbuf_pool_large,
				struct rte_mbuf **pkts, uint16_t count)
{
	return vhost_dequeue_burst(vid, queue_id, mbuf_pool,
				mbuf_pool_large,
				pkts, count);
}



Yeah, the name isn't very creative, I'm sure you'll have better idea!

> 
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> ---
>  drivers/net/vhost/rte_eth_vhost.c |  4 +--
>  examples/tep_termination/main.c   |  2 +-
>  examples/vhost/main.c             |  2 +-
>  lib/librte_vhost/rte_vhost.h      |  5 ++-
>  lib/librte_vhost/virtio_net.c     | 57 +++++++++++++++++++++++--------
>  5 files changed, 50 insertions(+), 20 deletions(-)

Maxime
  
David Marchand Oct. 2, 2019, 8:04 a.m. UTC | #4
Hello Shahaf,

On Wed, Oct 2, 2019 at 6:46 AM Shahaf Shuler <shahafs@mellanox.com> wrote:
>
> Wednesday, October 2, 2019 1:20 AM, Flavio Leitner:
> > Subject: [dpdk-dev] [PATCH] vhost: add support to large linear mbufs
> >
> > The rte_vhost_dequeue_burst supports two ways of dequeuing data. If the
> > data fits into a buffer, then all data is copied and a single linear buffer is
> > returned. Otherwise it allocates additional mbufs and chains them together
> > to return a multiple segments mbuf.
> >
> > While that covers most use cases, it forces applications that need to work
> > with larger data sizes to support multiple segments mbufs.
> > The non-linear characteristic brings complexity and performance implications
> > to the application.
> >
> > To resolve the issue, change the API so that the application can optionally
> > provide a second mempool containing larger mbufs. If that is not provided
> > (NULL), the behavior remains as before the change.
> > Otherwise, the data size is checked and the corresponding mempool is used
> > to return linear mbufs.
>
> I understand the motivation.
> However, providing a static pool w/ large buffers is not so efficient in terms of memory footprint. You will need to prepare to worst case (all packet are large) w/ max size of 64KB.
> Also, the two mempools are quite restrictive as the memory fill of the mbufs might be very sparse. E.g. mempool1 mbuf.size = 1.5K , mempool2 mbuf.size = 64K, packet size 4KB.
>
> Instead, how about using the mbuf external buffer feature?
> The flow will be:
> 1. vhost PMD always receive a single mempool (like today)
> 2. on dequeue, PMD looks on the virtio packet size. If smaller than the mbuf size use the mbuf as is (like today)
> 3. otherwise, allocate a new buffer (inside the PMD) and link it to the mbuf as external buffer (rte_pktmbuf_attach_extbuf)

I am missing some piece here.
Which pool would the PMD take those external buffers from?

If it is from an additional mempool passed to the vhost pmd, I can't
see the difference with Flavio proposal.


> The pros of this approach is that you have full flexibility on the memory allocation, and therefore a lower footprint.
> The cons is the OVS will need to know how to handle mbuf w/ external buffers (not too complex IMO).
  
Shahaf Shuler Oct. 2, 2019, 9 a.m. UTC | #5
Wednesday, October 2, 2019 11:05 AM, David Marchand:
> Subject: Re: [dpdk-dev] [PATCH] vhost: add support to large linear mbufs
> 
> Hello Shahaf,
> 
> On Wed, Oct 2, 2019 at 6:46 AM Shahaf Shuler <shahafs@mellanox.com>
> wrote:
> >
> > Wednesday, October 2, 2019 1:20 AM, Flavio Leitner:
> > > Subject: [dpdk-dev] [PATCH] vhost: add support to large linear mbufs
> > >
> > > The rte_vhost_dequeue_burst supports two ways of dequeuing data. If
> > > the data fits into a buffer, then all data is copied and a single
> > > linear buffer is returned. Otherwise it allocates additional mbufs
> > > and chains them together to return a multiple segments mbuf.
> > >
> > > While that covers most use cases, it forces applications that need
> > > to work with larger data sizes to support multiple segments mbufs.
> > > The non-linear characteristic brings complexity and performance
> > > implications to the application.
> > >
> > > To resolve the issue, change the API so that the application can
> > > optionally provide a second mempool containing larger mbufs. If that
> > > is not provided (NULL), the behavior remains as before the change.
> > > Otherwise, the data size is checked and the corresponding mempool is
> > > used to return linear mbufs.
> >
> > I understand the motivation.
> > However, providing a static pool w/ large buffers is not so efficient in terms
> of memory footprint. You will need to prepare to worst case (all packet are
> large) w/ max size of 64KB.
> > Also, the two mempools are quite restrictive as the memory fill of the
> mbufs might be very sparse. E.g. mempool1 mbuf.size = 1.5K , mempool2
> mbuf.size = 64K, packet size 4KB.
> >
> > Instead, how about using the mbuf external buffer feature?
> > The flow will be:
> > 1. vhost PMD always receive a single mempool (like today) 2. on
> > dequeue, PMD looks on the virtio packet size. If smaller than the mbuf
> > size use the mbuf as is (like today) 3. otherwise, allocate a new
> > buffer (inside the PMD) and link it to the mbuf as external buffer
> > (rte_pktmbuf_attach_extbuf)
> 
> I am missing some piece here.
> Which pool would the PMD take those external buffers from?

The mbuf is always taken from the single mempool associated w/ the rxq.
The buffer for the mbuf may be allocated (in case virtio payload is bigger than current mbuf size) from DPDK hugepages or any other system memory and be attached to the mbuf.

You can see example implementation of it in mlx5 PMD (checkout rte_pktmbuf_attach_extbuf call)

> 
> If it is from an additional mempool passed to the vhost pmd, I can't see the
> difference with Flavio proposal.
> 
> 
> > The pros of this approach is that you have full flexibility on the memory
> allocation, and therefore a lower footprint.
> > The cons is the OVS will need to know how to handle mbuf w/ external
> buffers (not too complex IMO).
> 
> 
> --
> David Marchand
  
Flavio Leitner Oct. 2, 2019, 12:58 p.m. UTC | #6
Hi Shahaf,

Thanks for looking into this, see my inline comments.

On Wed, 2 Oct 2019 09:00:11 +0000
Shahaf Shuler <shahafs@mellanox.com> wrote:

> Wednesday, October 2, 2019 11:05 AM, David Marchand:
> > Subject: Re: [dpdk-dev] [PATCH] vhost: add support to large linear
> > mbufs
> > 
> > Hello Shahaf,
> > 
> > On Wed, Oct 2, 2019 at 6:46 AM Shahaf Shuler <shahafs@mellanox.com>
> > wrote:  
> > >
> > > Wednesday, October 2, 2019 1:20 AM, Flavio Leitner:  
> > > > Subject: [dpdk-dev] [PATCH] vhost: add support to large linear
> > > > mbufs
> > > >
> > > > The rte_vhost_dequeue_burst supports two ways of dequeuing
> > > > data. If the data fits into a buffer, then all data is copied
> > > > and a single linear buffer is returned. Otherwise it allocates
> > > > additional mbufs and chains them together to return a multiple
> > > > segments mbuf.
> > > >
> > > > While that covers most use cases, it forces applications that
> > > > need to work with larger data sizes to support multiple
> > > > segments mbufs. The non-linear characteristic brings complexity
> > > > and performance implications to the application.
> > > >
> > > > To resolve the issue, change the API so that the application can
> > > > optionally provide a second mempool containing larger mbufs. If
> > > > that is not provided (NULL), the behavior remains as before the
> > > > change. Otherwise, the data size is checked and the
> > > > corresponding mempool is used to return linear mbufs.  
> > >
> > > I understand the motivation.
> > > However, providing a static pool w/ large buffers is not so
> > > efficient in terms  
> > of memory footprint. You will need to prepare to worst case (all
> > packet are large) w/ max size of 64KB.  
> > > Also, the two mempools are quite restrictive as the memory fill
> > > of the  
> > mbufs might be very sparse. E.g. mempool1 mbuf.size = 1.5K ,
> > mempool2 mbuf.size = 64K, packet size 4KB.  
> > >
> > > Instead, how about using the mbuf external buffer feature?
> > > The flow will be:
> > > 1. vhost PMD always receive a single mempool (like today) 2. on
> > > dequeue, PMD looks on the virtio packet size. If smaller than the
> > > mbuf size use the mbuf as is (like today) 3. otherwise, allocate
> > > a new buffer (inside the PMD) and link it to the mbuf as external
> > > buffer (rte_pktmbuf_attach_extbuf)  
> > 
> > I am missing some piece here.
> > Which pool would the PMD take those external buffers from?  
> 
> The mbuf is always taken from the single mempool associated w/ the
> rxq. The buffer for the mbuf may be allocated (in case virtio payload
> is bigger than current mbuf size) from DPDK hugepages or any other
> system memory and be attached to the mbuf.
> 
> You can see example implementation of it in mlx5 PMD (checkout
> rte_pktmbuf_attach_extbuf call)

Thanks, I wasn't aware of external buffers.

I see that attaching external buffers of the correct size would be more
efficient in terms of saving memory/avoiding sparsing.

However, we still need to be prepared to the worse case scenario (all
packets 64K), so that doesn't help with the total memory required.

The current patch pushes the decision to the application which knows
better the workload. If more memory is available, it can optionally use
large buffers, otherwise just don't pass that. Or even decide whether
to share the same 64K mempool between multiple vhost ports or use one
mempool per port.

Perhaps I missed something, but managing memory with mempool still
require us to have buffers of 64K regardless if the data consumes less
space. Otherwise the application or the PMD will have to manage memory
itself.

If we let the PMD manages the memory, what happens if a port/queue
is closed and one or more buffers are still in use (switching)? I don't
see how to solve this cleanly.

fbl

> 
> > 
> > If it is from an additional mempool passed to the vhost pmd, I
> > can't see the difference with Flavio proposal.
> > 
> >   
> > > The pros of this approach is that you have full flexibility on
> > > the memory  
> > allocation, and therefore a lower footprint.  
> > > The cons is the OVS will need to know how to handle mbuf w/
> > > external  
> > buffers (not too complex IMO).
> > 
> > 
> > --
> > David Marchand
  
Shahaf Shuler Oct. 2, 2019, 5:50 p.m. UTC | #7
Wednesday, October 2, 2019 3:59 PM, Flavio Leitner:
> Obrembski MichalX <michalx.obrembski@intel.com>; Stokes Ian
> <ian.stokes@intel.com>
> Subject: Re: [dpdk-dev] [PATCH] vhost: add support to large linear mbufs
> 
> 
> Hi Shahaf,
> 
> Thanks for looking into this, see my inline comments.
> 
> On Wed, 2 Oct 2019 09:00:11 +0000
> Shahaf Shuler <shahafs@mellanox.com> wrote:
> 
> > Wednesday, October 2, 2019 11:05 AM, David Marchand:
> > > Subject: Re: [dpdk-dev] [PATCH] vhost: add support to large linear
> > > mbufs
> > >
> > > Hello Shahaf,
> > >
> > > On Wed, Oct 2, 2019 at 6:46 AM Shahaf Shuler <shahafs@mellanox.com>
> > > wrote:
> > > >

[...]

> > >
> > > I am missing some piece here.
> > > Which pool would the PMD take those external buffers from?
> >
> > The mbuf is always taken from the single mempool associated w/ the
> > rxq. The buffer for the mbuf may be allocated (in case virtio payload
> > is bigger than current mbuf size) from DPDK hugepages or any other
> > system memory and be attached to the mbuf.
> >
> > You can see example implementation of it in mlx5 PMD (checkout
> > rte_pktmbuf_attach_extbuf call)
> 
> Thanks, I wasn't aware of external buffers.
> 
> I see that attaching external buffers of the correct size would be more
> efficient in terms of saving memory/avoiding sparsing.
> 
> However, we still need to be prepared to the worse case scenario (all
> packets 64K), so that doesn't help with the total memory required.

Am not sure why. 
The allocation can be per demand. That is - only when you encounter a large buffer. 

Having buffer allocated in advance will benefit only from removing the cost of the rte_*malloc. However on such big buffers, and further more w/ device offloads like TSO, am not sure that is an issue. 

> 
> The current patch pushes the decision to the application which knows better
> the workload. If more memory is available, it can optionally use large buffers,
> otherwise just don't pass that. Or even decide whether to share the same
> 64K mempool between multiple vhost ports or use one mempool per port.
> 
> Perhaps I missed something, but managing memory with mempool still
> require us to have buffers of 64K regardless if the data consumes less space.
> Otherwise the application or the PMD will have to manage memory itself.
> 
> If we let the PMD manages the memory, what happens if a port/queue is
> closed and one or more buffers are still in use (switching)? I don't see how to
> solve this cleanly.

Closing of the dev should return EBUSY till all buffers are free. 
What is the use case of closing a port while still having packet pending on other port of the switch? And why we cannot wait for them to complete transmission? 

> 
> fbl
> 
> >
> > >
> > > If it is from an additional mempool passed to the vhost pmd, I can't
> > > see the difference with Flavio proposal.
> > >
> > >
> > > > The pros of this approach is that you have full flexibility on the
> > > > memory
> > > allocation, and therefore a lower footprint.
> > > > The cons is the OVS will need to know how to handle mbuf w/
> > > > external
> > > buffers (not too complex IMO).
> > >
> > >
> > > --
> > > David Marchand
  
Flavio Leitner Oct. 2, 2019, 6:15 p.m. UTC | #8
On Wed, 2 Oct 2019 17:50:41 +0000
Shahaf Shuler <shahafs@mellanox.com> wrote:

> Wednesday, October 2, 2019 3:59 PM, Flavio Leitner:
> > Obrembski MichalX <michalx.obrembski@intel.com>; Stokes Ian
> > <ian.stokes@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH] vhost: add support to large linear
> > mbufs
> > 
> > 
> > Hi Shahaf,
> > 
> > Thanks for looking into this, see my inline comments.
> > 
> > On Wed, 2 Oct 2019 09:00:11 +0000
> > Shahaf Shuler <shahafs@mellanox.com> wrote:
> >   
> > > Wednesday, October 2, 2019 11:05 AM, David Marchand:  
> > > > Subject: Re: [dpdk-dev] [PATCH] vhost: add support to large
> > > > linear mbufs
> > > >
> > > > Hello Shahaf,
> > > >
> > > > On Wed, Oct 2, 2019 at 6:46 AM Shahaf Shuler
> > > > <shahafs@mellanox.com> wrote:  
> > > > >  
> 
> [...]
> 
> > > >
> > > > I am missing some piece here.
> > > > Which pool would the PMD take those external buffers from?  
> > >
> > > The mbuf is always taken from the single mempool associated w/ the
> > > rxq. The buffer for the mbuf may be allocated (in case virtio
> > > payload is bigger than current mbuf size) from DPDK hugepages or
> > > any other system memory and be attached to the mbuf.
> > >
> > > You can see example implementation of it in mlx5 PMD (checkout
> > > rte_pktmbuf_attach_extbuf call)  
> > 
> > Thanks, I wasn't aware of external buffers.
> > 
> > I see that attaching external buffers of the correct size would be
> > more efficient in terms of saving memory/avoiding sparsing.
> > 
> > However, we still need to be prepared to the worse case scenario
> > (all packets 64K), so that doesn't help with the total memory
> > required.  
> 
> Am not sure why. 
> The allocation can be per demand. That is - only when you encounter a
> large buffer. 
> 
> Having buffer allocated in advance will benefit only from removing
> the cost of the rte_*malloc. However on such big buffers, and further
> more w/ device offloads like TSO, am not sure that is an issue. 

Now I see what you're saying. I was thinking we had to reserve the
memory before, like mempool does, then get the buffers as needed.

OK, I can give a try with rte_*malloc and see how it goes.

> > The current patch pushes the decision to the application which
> > knows better the workload. If more memory is available, it can
> > optionally use large buffers, otherwise just don't pass that. Or
> > even decide whether to share the same 64K mempool between multiple
> > vhost ports or use one mempool per port.
> > 
> > Perhaps I missed something, but managing memory with mempool still
> > require us to have buffers of 64K regardless if the data consumes
> > less space. Otherwise the application or the PMD will have to
> > manage memory itself.
> > 
> > If we let the PMD manages the memory, what happens if a port/queue
> > is closed and one or more buffers are still in use (switching)? I
> > don't see how to solve this cleanly.  
> 
> Closing of the dev should return EBUSY till all buffers are free. 
> What is the use case of closing a port while still having packet
> pending on other port of the switch? And why we cannot wait for them
> to complete transmission?

The vswitch gets the request from outside and the assumption is that
the command will succeed. AFAIK, there is no retry mechanism.

Thanks Shahaf!
fbl
  
Ilya Maximets Oct. 3, 2019, 4:57 p.m. UTC | #9
On 02.10.2019 20:15, Flavio Leitner wrote:
> On Wed, 2 Oct 2019 17:50:41 +0000
> Shahaf Shuler <shahafs@mellanox.com> wrote:
> 
>> Wednesday, October 2, 2019 3:59 PM, Flavio Leitner:
>>> Obrembski MichalX <michalx.obrembski@intel.com>; Stokes Ian
>>> <ian.stokes@intel.com>
>>> Subject: Re: [dpdk-dev] [PATCH] vhost: add support to large linear
>>> mbufs
>>>
>>>
>>> Hi Shahaf,
>>>
>>> Thanks for looking into this, see my inline comments.
>>>
>>> On Wed, 2 Oct 2019 09:00:11 +0000
>>> Shahaf Shuler <shahafs@mellanox.com> wrote:
>>>    
>>>> Wednesday, October 2, 2019 11:05 AM, David Marchand:
>>>>> Subject: Re: [dpdk-dev] [PATCH] vhost: add support to large
>>>>> linear mbufs
>>>>>
>>>>> Hello Shahaf,
>>>>>
>>>>> On Wed, Oct 2, 2019 at 6:46 AM Shahaf Shuler
>>>>> <shahafs@mellanox.com> wrote:
>>>>>>   
>>
>> [...]
>>
>>>>>
>>>>> I am missing some piece here.
>>>>> Which pool would the PMD take those external buffers from?
>>>>
>>>> The mbuf is always taken from the single mempool associated w/ the
>>>> rxq. The buffer for the mbuf may be allocated (in case virtio
>>>> payload is bigger than current mbuf size) from DPDK hugepages or
>>>> any other system memory and be attached to the mbuf.
>>>>
>>>> You can see example implementation of it in mlx5 PMD (checkout
>>>> rte_pktmbuf_attach_extbuf call)
>>>
>>> Thanks, I wasn't aware of external buffers.
>>>
>>> I see that attaching external buffers of the correct size would be
>>> more efficient in terms of saving memory/avoiding sparsing.
>>>
>>> However, we still need to be prepared to the worse case scenario
>>> (all packets 64K), so that doesn't help with the total memory
>>> required.
>>
>> Am not sure why.
>> The allocation can be per demand. That is - only when you encounter a
>> large buffer.
>>
>> Having buffer allocated in advance will benefit only from removing
>> the cost of the rte_*malloc. However on such big buffers, and further
>> more w/ device offloads like TSO, am not sure that is an issue.
> 
> Now I see what you're saying. I was thinking we had to reserve the
> memory before, like mempool does, then get the buffers as needed.
> 
> OK, I can give a try with rte_*malloc and see how it goes.

This way we actually could have a nice API.  For example, by
introducing some new flag RTE_VHOST_USER_NO_CHAINED_MBUFS (there
might be better name) which could be passed to driver_register().
On receive, depending on this flag, function will create chained
mbufs or allocate new contiguous memory chunk and attach it as
an external buffer if the data could not be stored in a single
mbuf from the registered memory pool.

Supporting external memory in mbufs will require some additional
work from the OVS side (e.g. better work with ol_flags), but
we'll have to do it anyway for upgrade to DPDK 19.11.

Best regards, Ilya Maximets.

> 
>>> The current patch pushes the decision to the application which
>>> knows better the workload. If more memory is available, it can
>>> optionally use large buffers, otherwise just don't pass that. Or
>>> even decide whether to share the same 64K mempool between multiple
>>> vhost ports or use one mempool per port.
>>>
>>> Perhaps I missed something, but managing memory with mempool still
>>> require us to have buffers of 64K regardless if the data consumes
>>> less space. Otherwise the application or the PMD will have to
>>> manage memory itself.
>>>
>>> If we let the PMD manages the memory, what happens if a port/queue
>>> is closed and one or more buffers are still in use (switching)? I
>>> don't see how to solve this cleanly.
>>
>> Closing of the dev should return EBUSY till all buffers are free.
>> What is the use case of closing a port while still having packet
>> pending on other port of the switch? And why we cannot wait for them
>> to complete transmission?
> 
> The vswitch gets the request from outside and the assumption is that
> the command will succeed. AFAIK, there is no retry mechanism.
> 
> Thanks Shahaf!
> fbl
  
Flavio Leitner Oct. 3, 2019, 9:25 p.m. UTC | #10
On Thu, 3 Oct 2019 18:57:32 +0200
Ilya Maximets <i.maximets@ovn.org> wrote:

> On 02.10.2019 20:15, Flavio Leitner wrote:
> > On Wed, 2 Oct 2019 17:50:41 +0000
> > Shahaf Shuler <shahafs@mellanox.com> wrote:
> >   
> >> Wednesday, October 2, 2019 3:59 PM, Flavio Leitner:  
> >>> Obrembski MichalX <michalx.obrembski@intel.com>; Stokes Ian
> >>> <ian.stokes@intel.com>
> >>> Subject: Re: [dpdk-dev] [PATCH] vhost: add support to large linear
> >>> mbufs
> >>>
> >>>
> >>> Hi Shahaf,
> >>>
> >>> Thanks for looking into this, see my inline comments.
> >>>
> >>> On Wed, 2 Oct 2019 09:00:11 +0000
> >>> Shahaf Shuler <shahafs@mellanox.com> wrote:
> >>>      
> >>>> Wednesday, October 2, 2019 11:05 AM, David Marchand:  
> >>>>> Subject: Re: [dpdk-dev] [PATCH] vhost: add support to large
> >>>>> linear mbufs
> >>>>>
> >>>>> Hello Shahaf,
> >>>>>
> >>>>> On Wed, Oct 2, 2019 at 6:46 AM Shahaf Shuler
> >>>>> <shahafs@mellanox.com> wrote:  
> >>>>>>     
> >>
> >> [...]
> >>  
> >>>>>
> >>>>> I am missing some piece here.
> >>>>> Which pool would the PMD take those external buffers from?  
> >>>>
> >>>> The mbuf is always taken from the single mempool associated w/
> >>>> the rxq. The buffer for the mbuf may be allocated (in case virtio
> >>>> payload is bigger than current mbuf size) from DPDK hugepages or
> >>>> any other system memory and be attached to the mbuf.
> >>>>
> >>>> You can see example implementation of it in mlx5 PMD (checkout
> >>>> rte_pktmbuf_attach_extbuf call)  
> >>>
> >>> Thanks, I wasn't aware of external buffers.
> >>>
> >>> I see that attaching external buffers of the correct size would be
> >>> more efficient in terms of saving memory/avoiding sparsing.
> >>>
> >>> However, we still need to be prepared to the worse case scenario
> >>> (all packets 64K), so that doesn't help with the total memory
> >>> required.  
> >>
> >> Am not sure why.
> >> The allocation can be per demand. That is - only when you
> >> encounter a large buffer.
> >>
> >> Having buffer allocated in advance will benefit only from removing
> >> the cost of the rte_*malloc. However on such big buffers, and
> >> further more w/ device offloads like TSO, am not sure that is an
> >> issue.  
> > 
> > Now I see what you're saying. I was thinking we had to reserve the
> > memory before, like mempool does, then get the buffers as needed.
> > 
> > OK, I can give a try with rte_*malloc and see how it goes.  
> 
> This way we actually could have a nice API.  For example, by
> introducing some new flag RTE_VHOST_USER_NO_CHAINED_MBUFS (there
> might be better name) which could be passed to driver_register().
> On receive, depending on this flag, function will create chained
> mbufs or allocate new contiguous memory chunk and attach it as
> an external buffer if the data could not be stored in a single
> mbuf from the registered memory pool.
> 
> Supporting external memory in mbufs will require some additional
> work from the OVS side (e.g. better work with ol_flags), but
> we'll have to do it anyway for upgrade to DPDK 19.11.

Agreed. Looks like rte_malloc is fast enough after all. I have a PoC
running iperf3 from VM to another baremetal using vhost-user client
with TSO enabled:
[...]
[  5] 140.00-141.00 sec  4.60 GBytes  39.5 Gbits/sec    0   1.26 MBytes
[  5] 141.00-142.00 sec  4.65 GBytes  39.9 Gbits/sec    0   1.26 MBytes
[  5] 142.00-143.00 sec  4.65 GBytes  40.0 Gbits/sec    0   1.26 MBytes
[  5] 143.00-144.00 sec  4.65 GBytes  39.9 Gbits/sec    9   1.04 MBytes
[  5] 144.00-145.00 sec  4.59 GBytes  39.4 Gbits/sec    0   1.16 MBytes
[  5] 145.00-146.00 sec  4.58 GBytes  39.3 Gbits/sec    0   1.26 MBytes
[  5] 146.00-147.00 sec  4.48 GBytes  38.5 Gbits/sec  700    973 KBytes
[...]
(The physical link is 40Gbps)

I will clean that, test more and post the patches soon.
Thanks!
fbl
  

Patch

diff --git a/drivers/net/vhost/rte_eth_vhost.c b/drivers/net/vhost/rte_eth_vhost.c
index 46f01a7f4..ce7f68a5b 100644
--- a/drivers/net/vhost/rte_eth_vhost.c
+++ b/drivers/net/vhost/rte_eth_vhost.c
@@ -393,8 +393,8 @@  eth_vhost_rx(void *q, struct rte_mbuf **bufs, uint16_t nb_bufs)
 						 VHOST_MAX_PKT_BURST);
 
 		nb_pkts = rte_vhost_dequeue_burst(r->vid, r->virtqueue_id,
-						  r->mb_pool, &bufs[nb_rx],
-						  num);
+						  r->mb_pool, NULL,
+						  &bufs[nb_rx], num);
 
 		nb_rx += nb_pkts;
 		nb_receive -= nb_pkts;
diff --git a/examples/tep_termination/main.c b/examples/tep_termination/main.c
index ab956ad7c..3ebf0fa6e 100644
--- a/examples/tep_termination/main.c
+++ b/examples/tep_termination/main.c
@@ -697,7 +697,7 @@  switch_worker(__rte_unused void *arg)
 			if (likely(!vdev->remove)) {
 				/* Handle guest TX*/
 				tx_count = rte_vhost_dequeue_burst(vdev->vid,
-						VIRTIO_TXQ, mbuf_pool,
+						VIRTIO_TXQ, mbuf_pool, NULL,
 						pkts_burst, MAX_PKT_BURST);
 				/* If this is the first received packet we need to learn the MAC */
 				if (unlikely(vdev->ready == DEVICE_MAC_LEARNING) && tx_count) {
diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index ab649bf14..e9b306af3 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -1092,7 +1092,7 @@  drain_virtio_tx(struct vhost_dev *vdev)
 					pkts, MAX_PKT_BURST);
 	} else {
 		count = rte_vhost_dequeue_burst(vdev->vid, VIRTIO_TXQ,
-					mbuf_pool, pkts, MAX_PKT_BURST);
+					mbuf_pool, NULL, pkts, MAX_PKT_BURST);
 	}
 
 	/* setup VMDq for the first packet */
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index 19474bca0..b05fd8e2a 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -593,6 +593,8 @@  uint16_t rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
  *  virtio queue index in mq case
  * @param mbuf_pool
  *  mbuf_pool where host mbuf is allocated.
+ * @param mbuf_pool_large
+ *  mbuf_pool where larger host mbuf is allocated.
  * @param pkts
  *  array to contain packets to be dequeued
  * @param count
@@ -601,7 +603,8 @@  uint16_t rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
  *  num of packets dequeued
  */
 uint16_t rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
-	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count);
+	struct rte_mempool *mbuf_pool, struct rte_mempool *mbuf_pool_large,
+	struct rte_mbuf **pkts, uint16_t count);
 
 /**
  * Get guest mem table: a list of memory regions.
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5b85b832d..da9d77732 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1291,10 +1291,12 @@  get_zmbuf(struct vhost_virtqueue *vq)
 
 static __rte_noinline uint16_t
 virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
-	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
+	struct rte_mempool *mbuf_pool, struct rte_mempool *mbuf_pool_large,
+	struct rte_mbuf **pkts, uint16_t count)
 {
 	uint16_t i;
 	uint16_t free_entries;
+	uint16_t mbuf_avail;
 
 	if (unlikely(dev->dequeue_zero_copy)) {
 		struct zcopy_mbuf *zmbuf, *next;
@@ -1340,32 +1342,42 @@  virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	VHOST_LOG_DEBUG(VHOST_DATA, "(%d) about to dequeue %u buffers\n",
 			dev->vid, count);
 
+	/* If the large mpool is provided, find the threshold */
+	mbuf_avail = 0;
+	if (mbuf_pool_large)
+		mbuf_avail = rte_pktmbuf_data_room_size(mbuf_pool) - RTE_PKTMBUF_HEADROOM;
+
 	for (i = 0; i < count; i++) {
 		struct buf_vector buf_vec[BUF_VECTOR_MAX];
 		uint16_t head_idx;
-		uint32_t dummy_len;
+		uint32_t buf_len;
 		uint16_t nr_vec = 0;
+		struct rte_mempool *mpool;
 		int err;
 
 		if (unlikely(fill_vec_buf_split(dev, vq,
 						vq->last_avail_idx + i,
 						&nr_vec, buf_vec,
-						&head_idx, &dummy_len,
+						&head_idx, &buf_len,
 						VHOST_ACCESS_RO) < 0))
 			break;
 
 		if (likely(dev->dequeue_zero_copy == 0))
 			update_shadow_used_ring_split(vq, head_idx, 0);
 
-		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
+		if (mbuf_pool_large && buf_len > mbuf_avail)
+			mpool = mbuf_pool_large;
+		else
+			mpool = mbuf_pool;
+
+		pkts[i] = rte_pktmbuf_alloc(mpool);
 		if (unlikely(pkts[i] == NULL)) {
 			RTE_LOG(ERR, VHOST_DATA,
 				"Failed to allocate memory for mbuf.\n");
 			break;
 		}
 
-		err = copy_desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkts[i],
-				mbuf_pool);
+		err = copy_desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkts[i], mpool);
 		if (unlikely(err)) {
 			rte_pktmbuf_free(pkts[i]);
 			break;
@@ -1411,9 +1423,11 @@  virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 
 static __rte_noinline uint16_t
 virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
-	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
+	struct rte_mempool *mbuf_pool, struct rte_mempool *mbuf_pool_large,
+	struct rte_mbuf **pkts, uint16_t count)
 {
 	uint16_t i;
+	uint16_t mbuf_avail;
 
 	if (unlikely(dev->dequeue_zero_copy)) {
 		struct zcopy_mbuf *zmbuf, *next;
@@ -1448,17 +1462,23 @@  virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	VHOST_LOG_DEBUG(VHOST_DATA, "(%d) about to dequeue %u buffers\n",
 			dev->vid, count);
 
+	/* If the large mpool is provided, find the threshold */
+	mbuf_avail = 0;
+	if (mbuf_pool_large)
+		mbuf_avail = rte_pktmbuf_data_room_size(mbuf_pool) - RTE_PKTMBUF_HEADROOM;
+
 	for (i = 0; i < count; i++) {
 		struct buf_vector buf_vec[BUF_VECTOR_MAX];
 		uint16_t buf_id;
-		uint32_t dummy_len;
+		uint32_t buf_len;
 		uint16_t desc_count, nr_vec = 0;
+		struct rte_mempool *mpool;
 		int err;
 
 		if (unlikely(fill_vec_buf_packed(dev, vq,
 						vq->last_avail_idx, &desc_count,
 						buf_vec, &nr_vec,
-						&buf_id, &dummy_len,
+						&buf_id, &buf_len,
 						VHOST_ACCESS_RO) < 0))
 			break;
 
@@ -1466,15 +1486,19 @@  virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			update_shadow_used_ring_packed(vq, buf_id, 0,
 					desc_count);
 
-		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
+		if (mbuf_pool_large && buf_len > mbuf_avail)
+			mpool = mbuf_pool_large;
+		else
+			mpool = mbuf_pool;
+
+		pkts[i] = rte_pktmbuf_alloc(mpool);
 		if (unlikely(pkts[i] == NULL)) {
 			RTE_LOG(ERR, VHOST_DATA,
 				"Failed to allocate memory for mbuf.\n");
 			break;
 		}
 
-		err = copy_desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkts[i],
-				mbuf_pool);
+		err = copy_desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkts[i], mpool);
 		if (unlikely(err)) {
 			rte_pktmbuf_free(pkts[i]);
 			break;
@@ -1526,7 +1550,8 @@  virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 
 uint16_t
 rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
-	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
+	struct rte_mempool *mbuf_pool, struct rte_mempool *mbuf_pool_large,
+	struct rte_mbuf **pkts, uint16_t count)
 {
 	struct virtio_net *dev;
 	struct rte_mbuf *rarp_mbuf = NULL;
@@ -1598,9 +1623,11 @@  rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	}
 
 	if (vq_is_packed(dev))
-		count = virtio_dev_tx_packed(dev, vq, mbuf_pool, pkts, count);
+		count = virtio_dev_tx_packed(dev, vq, mbuf_pool, mbuf_pool_large, pkts,
+                                     count);
 	else
-		count = virtio_dev_tx_split(dev, vq, mbuf_pool, pkts, count);
+		count = virtio_dev_tx_split(dev, vq, mbuf_pool, mbuf_pool_large, pkts,
+                                    count);
 
 out:
 	if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))