[v5] vhost: add support for large buffers

Message ID 20191015185951.6295-1-fbl@sysclose.org (mailing list archive)
State Accepted, archived
Delegated to: Maxime Coquelin
Headers
Series [v5] vhost: add support for large buffers |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-compilation success Compile Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/Intel-compilation success Compilation OK
ci/iol-mellanox-Performance success Performance Testing PASS

Commit Message

Flavio Leitner Oct. 15, 2019, 6:59 p.m. UTC
  The rte_vhost_dequeue_burst supports two ways of dequeuing data.
If the data fits into a buffer, then all data is copied and a
single linear buffer is returned. Otherwise it allocates
additional mbufs and chains them together to return a multiple
segments mbuf.

While that covers most use cases, it forces applications that
need to work with larger data sizes to support multiple segments
mbufs. The non-linear characteristic brings complexity and
performance implications to the application.

To resolve the issue, add support to attach external buffer
to a pktmbuf and let the host provide during registration if
attaching an external buffer to pktmbuf is supported and if
only linear buffer are supported.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
---
 doc/guides/prog_guide/vhost_lib.rst |  35 +++++++++
 lib/librte_vhost/rte_vhost.h        |   4 +
 lib/librte_vhost/socket.c           |  22 ++++++
 lib/librte_vhost/vhost.c            |  22 ++++++
 lib/librte_vhost/vhost.h            |   4 +
 lib/librte_vhost/virtio_net.c       | 109 ++++++++++++++++++++++++----
 6 files changed, 182 insertions(+), 14 deletions(-)

- Changelog:
  v5:
    - fixed to destroy mutex if incompatible flags
  v4:
    - allow to use pktmbuf if there is exact space
    - removed log message if the buffer is too big
    - fixed the length to include align padding
    - free allocated buf if shinfo fails
  v3:
    - prevent the new features to be used with zero copy
    - fixed sizeof() usage
    - fixed log msg indentation
    - removed/replaced asserts
    - used the correct virt2iova function
    - fixed the patch's title
    - OvS PoC code:
      https://github.com/fleitner/ovs/tree/rte_malloc-v3
  v2:
    - Used rte_malloc() instead of another mempool as suggested by Shahaf.
    - Added the documentation section.
    - Using driver registration to negotiate the features.
    - OvS PoC code:
      https://github.com/fleitner/ovs/commit/8fc197c40b1d4fda331686a7b919e9e2b670dda7
  

Comments

Maxime Coquelin Oct. 16, 2019, 10:02 a.m. UTC | #1
Hi Flavio,

On 10/15/19 8:59 PM, Flavio Leitner wrote:
> The rte_vhost_dequeue_burst supports two ways of dequeuing data.
> If the data fits into a buffer, then all data is copied and a
> single linear buffer is returned. Otherwise it allocates
> additional mbufs and chains them together to return a multiple
> segments mbuf.
> 
> While that covers most use cases, it forces applications that
> need to work with larger data sizes to support multiple segments
> mbufs. The non-linear characteristic brings complexity and
> performance implications to the application.
> 
> To resolve the issue, add support to attach external buffer
> to a pktmbuf and let the host provide during registration if
> attaching an external buffer to pktmbuf is supported and if
> only linear buffer are supported.
> 
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> ---
>  doc/guides/prog_guide/vhost_lib.rst |  35 +++++++++
>  lib/librte_vhost/rte_vhost.h        |   4 +
>  lib/librte_vhost/socket.c           |  22 ++++++
>  lib/librte_vhost/vhost.c            |  22 ++++++
>  lib/librte_vhost/vhost.h            |   4 +
>  lib/librte_vhost/virtio_net.c       | 109 ++++++++++++++++++++++++----
>  6 files changed, 182 insertions(+), 14 deletions(-)
> 
> - Changelog:
>   v5:
>     - fixed to destroy mutex if incompatible flags
>   v4:
>     - allow to use pktmbuf if there is exact space
>     - removed log message if the buffer is too big
>     - fixed the length to include align padding
>     - free allocated buf if shinfo fails
>   v3:
>     - prevent the new features to be used with zero copy
>     - fixed sizeof() usage
>     - fixed log msg indentation
>     - removed/replaced asserts
>     - used the correct virt2iova function
>     - fixed the patch's title
>     - OvS PoC code:
>       https://github.com/fleitner/ovs/tree/rte_malloc-v3
>   v2:
>     - Used rte_malloc() instead of another mempool as suggested by Shahaf.
>     - Added the documentation section.
>     - Using driver registration to negotiate the features.
>     - OvS PoC code:
>       https://github.com/fleitner/ovs/commit/8fc197c40b1d4fda331686a7b919e9e2b670dda7
> 

I went through the patch, and it looks good to me.
I appreciate the fact that there is no rte_vhost_dequeue_burst API
change.

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime
  
Maxime Coquelin Oct. 16, 2019, 11:13 a.m. UTC | #2
On 10/15/19 8:59 PM, Flavio Leitner wrote:
> The rte_vhost_dequeue_burst supports two ways of dequeuing data.
> If the data fits into a buffer, then all data is copied and a
> single linear buffer is returned. Otherwise it allocates
> additional mbufs and chains them together to return a multiple
> segments mbuf.
> 
> While that covers most use cases, it forces applications that
> need to work with larger data sizes to support multiple segments
> mbufs. The non-linear characteristic brings complexity and
> performance implications to the application.
> 
> To resolve the issue, add support to attach external buffer
> to a pktmbuf and let the host provide during registration if
> attaching an external buffer to pktmbuf is supported and if
> only linear buffer are supported.
> 
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> ---
>  doc/guides/prog_guide/vhost_lib.rst |  35 +++++++++
>  lib/librte_vhost/rte_vhost.h        |   4 +
>  lib/librte_vhost/socket.c           |  22 ++++++
>  lib/librte_vhost/vhost.c            |  22 ++++++
>  lib/librte_vhost/vhost.h            |   4 +
>  lib/librte_vhost/virtio_net.c       | 109 ++++++++++++++++++++++++----
>  6 files changed, 182 insertions(+), 14 deletions(-)
> 
> - Changelog:
>   v5:
>     - fixed to destroy mutex if incompatible flags
>   v4:
>     - allow to use pktmbuf if there is exact space
>     - removed log message if the buffer is too big
>     - fixed the length to include align padding
>     - free allocated buf if shinfo fails
>   v3:
>     - prevent the new features to be used with zero copy
>     - fixed sizeof() usage
>     - fixed log msg indentation
>     - removed/replaced asserts
>     - used the correct virt2iova function
>     - fixed the patch's title
>     - OvS PoC code:
>       https://github.com/fleitner/ovs/tree/rte_malloc-v3
>   v2:
>     - Used rte_malloc() instead of another mempool as suggested by Shahaf.
>     - Added the documentation section.
>     - Using driver registration to negotiate the features.
>     - OvS PoC code:
>       https://github.com/fleitner/ovs/commit/8fc197c40b1d4fda331686a7b919e9e2b670dda7

Applied to dpdk-next-virtio/master.

Thanks,
Maxime
  
Ilya Maximets Oct. 16, 2019, 1:32 p.m. UTC | #3
On 16.10.2019 13:13, Maxime Coquelin wrote:
> 
> 
> On 10/15/19 8:59 PM, Flavio Leitner wrote:
>> The rte_vhost_dequeue_burst supports two ways of dequeuing data.
>> If the data fits into a buffer, then all data is copied and a
>> single linear buffer is returned. Otherwise it allocates
>> additional mbufs and chains them together to return a multiple
>> segments mbuf.
>>
>> While that covers most use cases, it forces applications that
>> need to work with larger data sizes to support multiple segments
>> mbufs. The non-linear characteristic brings complexity and
>> performance implications to the application.
>>
>> To resolve the issue, add support to attach external buffer
>> to a pktmbuf and let the host provide during registration if
>> attaching an external buffer to pktmbuf is supported and if
>> only linear buffer are supported.
>>
>> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
>> ---
>>   doc/guides/prog_guide/vhost_lib.rst |  35 +++++++++
>>   lib/librte_vhost/rte_vhost.h        |   4 +
>>   lib/librte_vhost/socket.c           |  22 ++++++
>>   lib/librte_vhost/vhost.c            |  22 ++++++
>>   lib/librte_vhost/vhost.h            |   4 +
>>   lib/librte_vhost/virtio_net.c       | 109 ++++++++++++++++++++++++----
>>   6 files changed, 182 insertions(+), 14 deletions(-)
>>
>> - Changelog:
>>    v5:
>>      - fixed to destroy mutex if incompatible flags
>>    v4:
>>      - allow to use pktmbuf if there is exact space
>>      - removed log message if the buffer is too big
>>      - fixed the length to include align padding
>>      - free allocated buf if shinfo fails
>>    v3:
>>      - prevent the new features to be used with zero copy
>>      - fixed sizeof() usage
>>      - fixed log msg indentation
>>      - removed/replaced asserts
>>      - used the correct virt2iova function
>>      - fixed the patch's title
>>      - OvS PoC code:
>>        https://github.com/fleitner/ovs/tree/rte_malloc-v3
>>    v2:
>>      - Used rte_malloc() instead of another mempool as suggested by Shahaf.
>>      - Added the documentation section.
>>      - Using driver registration to negotiate the features.
>>      - OvS PoC code:
>>        https://github.com/fleitner/ovs/commit/8fc197c40b1d4fda331686a7b919e9e2b670dda7
> 
> Applied to dpdk-next-virtio/master.

Thanks Maxime,

But can we return back the mbuf allocation failure message?

I mean:

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 66f0c7206..f8af4e0b3 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1354,8 +1354,11 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
  {
  	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
  
-	if (unlikely(pkt == NULL))
+	if (unlikely(pkt == NULL)) {
+		RTE_LOG(ERR, VHOST_DATA,
+			"Failed to allocate memory for mbuf.\n");
  		return NULL;
+	}
  
  	if (rte_pktmbuf_tailroom(pkt) >= data_len)
  		return pkt;
---

It's a hard failure that highlights some significant issues with
memory pool size or a mbuf leak.

We still have the message for subsequent chained mbufs, but not
for the first one.  Without this error message we could never
notice that something is wrong with our memory pool.  Only the
network traffic will stop flowing.
The message was very useful previously while catching the root
cause of the mbuf leak in OVS.

I could send a separate patch for this if needed.

Best regards, Ilya Maximets.
  
Maxime Coquelin Oct. 16, 2019, 1:46 p.m. UTC | #4
On 10/16/19 3:32 PM, Ilya Maximets wrote:
> On 16.10.2019 13:13, Maxime Coquelin wrote:
>>
>>
>> On 10/15/19 8:59 PM, Flavio Leitner wrote:
>>> The rte_vhost_dequeue_burst supports two ways of dequeuing data.
>>> If the data fits into a buffer, then all data is copied and a
>>> single linear buffer is returned. Otherwise it allocates
>>> additional mbufs and chains them together to return a multiple
>>> segments mbuf.
>>>
>>> While that covers most use cases, it forces applications that
>>> need to work with larger data sizes to support multiple segments
>>> mbufs. The non-linear characteristic brings complexity and
>>> performance implications to the application.
>>>
>>> To resolve the issue, add support to attach external buffer
>>> to a pktmbuf and let the host provide during registration if
>>> attaching an external buffer to pktmbuf is supported and if
>>> only linear buffer are supported.
>>>
>>> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
>>> ---
>>>   doc/guides/prog_guide/vhost_lib.rst |  35 +++++++++
>>>   lib/librte_vhost/rte_vhost.h        |   4 +
>>>   lib/librte_vhost/socket.c           |  22 ++++++
>>>   lib/librte_vhost/vhost.c            |  22 ++++++
>>>   lib/librte_vhost/vhost.h            |   4 +
>>>   lib/librte_vhost/virtio_net.c       | 109 ++++++++++++++++++++++++----
>>>   6 files changed, 182 insertions(+), 14 deletions(-)
>>>
>>> - Changelog:
>>>    v5:
>>>      - fixed to destroy mutex if incompatible flags
>>>    v4:
>>>      - allow to use pktmbuf if there is exact space
>>>      - removed log message if the buffer is too big
>>>      - fixed the length to include align padding
>>>      - free allocated buf if shinfo fails
>>>    v3:
>>>      - prevent the new features to be used with zero copy
>>>      - fixed sizeof() usage
>>>      - fixed log msg indentation
>>>      - removed/replaced asserts
>>>      - used the correct virt2iova function
>>>      - fixed the patch's title
>>>      - OvS PoC code:
>>>        https://github.com/fleitner/ovs/tree/rte_malloc-v3
>>>    v2:
>>>      - Used rte_malloc() instead of another mempool as suggested by
>>> Shahaf.
>>>      - Added the documentation section.
>>>      - Using driver registration to negotiate the features.
>>>      - OvS PoC code:
>>>       
>>> https://github.com/fleitner/ovs/commit/8fc197c40b1d4fda331686a7b919e9e2b670dda7
>>>
>>
>> Applied to dpdk-next-virtio/master.
> 
> Thanks Maxime,
> 
> But can we return back the mbuf allocation failure message?

Good point, I missed it.

> I mean:
> 
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 66f0c7206..f8af4e0b3 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -1354,8 +1354,11 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev,
> struct rte_mempool *mp,
>  {
>      struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
>  
> -    if (unlikely(pkt == NULL))
> +    if (unlikely(pkt == NULL)) {
> +        RTE_LOG(ERR, VHOST_DATA,
> +            "Failed to allocate memory for mbuf.\n");
>          return NULL;
> +    }
>  
>      if (rte_pktmbuf_tailroom(pkt) >= data_len)
>          return pkt;
> ---
> 
> It's a hard failure that highlights some significant issues with
> memory pool size or a mbuf leak.

I agree, we need this error message.

> We still have the message for subsequent chained mbufs, but not
> for the first one.  Without this error message we could never
> notice that something is wrong with our memory pool.  Only the
> network traffic will stop flowing.
> The message was very useful previously while catching the root
> cause of the mbuf leak in OVS.
> 
> I could send a separate patch for this if needed.

We need a separate patch.
If you want to do it, please do! Otherwise I'll do it.

Thanks!
Maxime
> Best regards, Ilya Maximets.
  
Flavio Leitner Oct. 16, 2019, 2:02 p.m. UTC | #5
On Wed, 16 Oct 2019 15:46:15 +0200
Maxime Coquelin <maxime.coquelin@redhat.com> wrote:

> On 10/16/19 3:32 PM, Ilya Maximets wrote:
> > On 16.10.2019 13:13, Maxime Coquelin wrote:  
> >>
> >>
> >> On 10/15/19 8:59 PM, Flavio Leitner wrote:  
> >>> The rte_vhost_dequeue_burst supports two ways of dequeuing data.
> >>> If the data fits into a buffer, then all data is copied and a
> >>> single linear buffer is returned. Otherwise it allocates
> >>> additional mbufs and chains them together to return a multiple
> >>> segments mbuf.
> >>>
> >>> While that covers most use cases, it forces applications that
> >>> need to work with larger data sizes to support multiple segments
> >>> mbufs. The non-linear characteristic brings complexity and
> >>> performance implications to the application.
> >>>
> >>> To resolve the issue, add support to attach external buffer
> >>> to a pktmbuf and let the host provide during registration if
> >>> attaching an external buffer to pktmbuf is supported and if
> >>> only linear buffer are supported.
> >>>
> >>> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> >>> ---
> >>>   doc/guides/prog_guide/vhost_lib.rst |  35 +++++++++
> >>>   lib/librte_vhost/rte_vhost.h        |   4 +
> >>>   lib/librte_vhost/socket.c           |  22 ++++++
> >>>   lib/librte_vhost/vhost.c            |  22 ++++++
> >>>   lib/librte_vhost/vhost.h            |   4 +
> >>>   lib/librte_vhost/virtio_net.c       | 109
> >>> ++++++++++++++++++++++++---- 6 files changed, 182 insertions(+),
> >>> 14 deletions(-)
> >>>
> >>> - Changelog:
> >>>    v5:
> >>>      - fixed to destroy mutex if incompatible flags
> >>>    v4:
> >>>      - allow to use pktmbuf if there is exact space
> >>>      - removed log message if the buffer is too big
> >>>      - fixed the length to include align padding
> >>>      - free allocated buf if shinfo fails
> >>>    v3:
> >>>      - prevent the new features to be used with zero copy
> >>>      - fixed sizeof() usage
> >>>      - fixed log msg indentation
> >>>      - removed/replaced asserts
> >>>      - used the correct virt2iova function
> >>>      - fixed the patch's title
> >>>      - OvS PoC code:
> >>>        https://github.com/fleitner/ovs/tree/rte_malloc-v3
> >>>    v2:
> >>>      - Used rte_malloc() instead of another mempool as suggested
> >>> by Shahaf.
> >>>      - Added the documentation section.
> >>>      - Using driver registration to negotiate the features.
> >>>      - OvS PoC code:
> >>>       
> >>> https://github.com/fleitner/ovs/commit/8fc197c40b1d4fda331686a7b919e9e2b670dda7
> >>>  
> >>
> >> Applied to dpdk-next-virtio/master.  
> > 
> > Thanks Maxime,
> > 
> > But can we return back the mbuf allocation failure message?  
> 
> Good point, I missed it.
> 
> > I mean:
> > 
> > diff --git a/lib/librte_vhost/virtio_net.c
> > b/lib/librte_vhost/virtio_net.c index 66f0c7206..f8af4e0b3 100644
> > --- a/lib/librte_vhost/virtio_net.c
> > +++ b/lib/librte_vhost/virtio_net.c
> > @@ -1354,8 +1354,11 @@ virtio_dev_pktmbuf_alloc(struct virtio_net
> > *dev, struct rte_mempool *mp,
> >  {
> >      struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> >  
> > -    if (unlikely(pkt == NULL))
> > +    if (unlikely(pkt == NULL)) {
> > +        RTE_LOG(ERR, VHOST_DATA,
> > +            "Failed to allocate memory for mbuf.\n");
> >          return NULL;
> > +    }
> >  
> >      if (rte_pktmbuf_tailroom(pkt) >= data_len)
> >          return pkt;
> > ---
> > 
> > It's a hard failure that highlights some significant issues with
> > memory pool size or a mbuf leak.  
> 
> I agree, we need this error message.

If it runs out of memory and there are many packets coming, then it
will flood the logs. Maybe we could use something to report in a more
moderated way, for example:

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index da69ab1db..b1572b3a4 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1354,8 +1354,14 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
 {
 	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
 
-	if (unlikely(pkt == NULL))
+	if (unlikely(pkt == NULL)) {
+		static int rate_lim = 0;
+
+		if (rate_lim++ % 1000 == 0)
+			RTE_LOG...
+
 		return NULL;
+	}
 
 	if (rte_pktmbuf_tailroom(pkt) >= data_len)
 		return pkt;


Or track this at mempool level and keep stats of failed requests and
report that in OvS. That would cover other points too.

fbl
  
Ilya Maximets Oct. 16, 2019, 2:05 p.m. UTC | #6
On 16.10.2019 15:46, Maxime Coquelin wrote:
> 
> 
> On 10/16/19 3:32 PM, Ilya Maximets wrote:
>> On 16.10.2019 13:13, Maxime Coquelin wrote:
>>>
>>>
>>> On 10/15/19 8:59 PM, Flavio Leitner wrote:
>>>> The rte_vhost_dequeue_burst supports two ways of dequeuing data.
>>>> If the data fits into a buffer, then all data is copied and a
>>>> single linear buffer is returned. Otherwise it allocates
>>>> additional mbufs and chains them together to return a multiple
>>>> segments mbuf.
>>>>
>>>> While that covers most use cases, it forces applications that
>>>> need to work with larger data sizes to support multiple segments
>>>> mbufs. The non-linear characteristic brings complexity and
>>>> performance implications to the application.
>>>>
>>>> To resolve the issue, add support to attach external buffer
>>>> to a pktmbuf and let the host provide during registration if
>>>> attaching an external buffer to pktmbuf is supported and if
>>>> only linear buffer are supported.
>>>>
>>>> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
>>>> ---
>>>>    doc/guides/prog_guide/vhost_lib.rst |  35 +++++++++
>>>>    lib/librte_vhost/rte_vhost.h        |   4 +
>>>>    lib/librte_vhost/socket.c           |  22 ++++++
>>>>    lib/librte_vhost/vhost.c            |  22 ++++++
>>>>    lib/librte_vhost/vhost.h            |   4 +
>>>>    lib/librte_vhost/virtio_net.c       | 109 ++++++++++++++++++++++++----
>>>>    6 files changed, 182 insertions(+), 14 deletions(-)
>>>>
>>>> - Changelog:
>>>>     v5:
>>>>       - fixed to destroy mutex if incompatible flags
>>>>     v4:
>>>>       - allow to use pktmbuf if there is exact space
>>>>       - removed log message if the buffer is too big
>>>>       - fixed the length to include align padding
>>>>       - free allocated buf if shinfo fails
>>>>     v3:
>>>>       - prevent the new features to be used with zero copy
>>>>       - fixed sizeof() usage
>>>>       - fixed log msg indentation
>>>>       - removed/replaced asserts
>>>>       - used the correct virt2iova function
>>>>       - fixed the patch's title
>>>>       - OvS PoC code:
>>>>         https://github.com/fleitner/ovs/tree/rte_malloc-v3
>>>>     v2:
>>>>       - Used rte_malloc() instead of another mempool as suggested by
>>>> Shahaf.
>>>>       - Added the documentation section.
>>>>       - Using driver registration to negotiate the features.
>>>>       - OvS PoC code:
>>>>        
>>>> https://github.com/fleitner/ovs/commit/8fc197c40b1d4fda331686a7b919e9e2b670dda7
>>>>
>>>
>>> Applied to dpdk-next-virtio/master.
>>
>> Thanks Maxime,
>>
>> But can we return back the mbuf allocation failure message?
> 
> Good point, I missed it.
> 
>> I mean:
>>
>> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
>> index 66f0c7206..f8af4e0b3 100644
>> --- a/lib/librte_vhost/virtio_net.c
>> +++ b/lib/librte_vhost/virtio_net.c
>> @@ -1354,8 +1354,11 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev,
>> struct rte_mempool *mp,
>>   {
>>       struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
>>   
>> -    if (unlikely(pkt == NULL))
>> +    if (unlikely(pkt == NULL)) {
>> +        RTE_LOG(ERR, VHOST_DATA,
>> +            "Failed to allocate memory for mbuf.\n");
>>           return NULL;
>> +    }
>>   
>>       if (rte_pktmbuf_tailroom(pkt) >= data_len)
>>           return pkt;
>> ---
>>
>> It's a hard failure that highlights some significant issues with
>> memory pool size or a mbuf leak.
> 
> I agree, we need this error message.
> 
>> We still have the message for subsequent chained mbufs, but not
>> for the first one.  Without this error message we could never
>> notice that something is wrong with our memory pool.  Only the
>> network traffic will stop flowing.
>> The message was very useful previously while catching the root
>> cause of the mbuf leak in OVS.
>>
>> I could send a separate patch for this if needed.
> 
> We need a separate patch.
> If you want to do it, please do! Otherwise I'll do it.

OK. I'll send a patch soon.

Best regards, Ilya Maximets.
  
Ilya Maximets Oct. 16, 2019, 2:08 p.m. UTC | #7
On 16.10.2019 16:02, Flavio Leitner wrote:
> On Wed, 16 Oct 2019 15:46:15 +0200
> Maxime Coquelin <maxime.coquelin@redhat.com> wrote:
> 
>> On 10/16/19 3:32 PM, Ilya Maximets wrote:
>>> On 16.10.2019 13:13, Maxime Coquelin wrote:
>>>>
>>>>
>>>> On 10/15/19 8:59 PM, Flavio Leitner wrote:
>>>>> The rte_vhost_dequeue_burst supports two ways of dequeuing data.
>>>>> If the data fits into a buffer, then all data is copied and a
>>>>> single linear buffer is returned. Otherwise it allocates
>>>>> additional mbufs and chains them together to return a multiple
>>>>> segments mbuf.
>>>>>
>>>>> While that covers most use cases, it forces applications that
>>>>> need to work with larger data sizes to support multiple segments
>>>>> mbufs. The non-linear characteristic brings complexity and
>>>>> performance implications to the application.
>>>>>
>>>>> To resolve the issue, add support to attach external buffer
>>>>> to a pktmbuf and let the host provide during registration if
>>>>> attaching an external buffer to pktmbuf is supported and if
>>>>> only linear buffer are supported.
>>>>>
>>>>> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
>>>>> ---
>>>>>    doc/guides/prog_guide/vhost_lib.rst |  35 +++++++++
>>>>>    lib/librte_vhost/rte_vhost.h        |   4 +
>>>>>    lib/librte_vhost/socket.c           |  22 ++++++
>>>>>    lib/librte_vhost/vhost.c            |  22 ++++++
>>>>>    lib/librte_vhost/vhost.h            |   4 +
>>>>>    lib/librte_vhost/virtio_net.c       | 109
>>>>> ++++++++++++++++++++++++---- 6 files changed, 182 insertions(+),
>>>>> 14 deletions(-)
>>>>>
>>>>> - Changelog:
>>>>>     v5:
>>>>>       - fixed to destroy mutex if incompatible flags
>>>>>     v4:
>>>>>       - allow to use pktmbuf if there is exact space
>>>>>       - removed log message if the buffer is too big
>>>>>       - fixed the length to include align padding
>>>>>       - free allocated buf if shinfo fails
>>>>>     v3:
>>>>>       - prevent the new features to be used with zero copy
>>>>>       - fixed sizeof() usage
>>>>>       - fixed log msg indentation
>>>>>       - removed/replaced asserts
>>>>>       - used the correct virt2iova function
>>>>>       - fixed the patch's title
>>>>>       - OvS PoC code:
>>>>>         https://github.com/fleitner/ovs/tree/rte_malloc-v3
>>>>>     v2:
>>>>>       - Used rte_malloc() instead of another mempool as suggested
>>>>> by Shahaf.
>>>>>       - Added the documentation section.
>>>>>       - Using driver registration to negotiate the features.
>>>>>       - OvS PoC code:
>>>>>        
>>>>> https://github.com/fleitner/ovs/commit/8fc197c40b1d4fda331686a7b919e9e2b670dda7
>>>>>   
>>>>
>>>> Applied to dpdk-next-virtio/master.
>>>
>>> Thanks Maxime,
>>>
>>> But can we return back the mbuf allocation failure message?
>>
>> Good point, I missed it.
>>
>>> I mean:
>>>
>>> diff --git a/lib/librte_vhost/virtio_net.c
>>> b/lib/librte_vhost/virtio_net.c index 66f0c7206..f8af4e0b3 100644
>>> --- a/lib/librte_vhost/virtio_net.c
>>> +++ b/lib/librte_vhost/virtio_net.c
>>> @@ -1354,8 +1354,11 @@ virtio_dev_pktmbuf_alloc(struct virtio_net
>>> *dev, struct rte_mempool *mp,
>>>   {
>>>       struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
>>>   
>>> -    if (unlikely(pkt == NULL))
>>> +    if (unlikely(pkt == NULL)) {
>>> +        RTE_LOG(ERR, VHOST_DATA,
>>> +            "Failed to allocate memory for mbuf.\n");
>>>           return NULL;
>>> +    }
>>>   
>>>       if (rte_pktmbuf_tailroom(pkt) >= data_len)
>>>           return pkt;
>>> ---
>>>
>>> It's a hard failure that highlights some significant issues with
>>> memory pool size or a mbuf leak.
>>
>> I agree, we need this error message.
> 
> If it runs out of memory and there are many packets coming, then it
> will flood the logs. Maybe we could use something to report in a more
> moderated way, for example:
> 
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index da69ab1db..b1572b3a4 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -1354,8 +1354,14 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
>   {
>   	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
>   
> -	if (unlikely(pkt == NULL))
> +	if (unlikely(pkt == NULL)) {
> +		static int rate_lim = 0;
> +
> +		if (rate_lim++ % 1000 == 0)
> +			RTE_LOG...
> +
>   		return NULL;
> +	}
>   
>   	if (rte_pktmbuf_tailroom(pkt) >= data_len)
>   		return pkt;
> 
> 
> Or track this at mempool level and keep stats of failed requests and
> report that in OvS. That would cover other points too.

OVS is already rete-limiting DPDK logs.  Basically, I introduced this
functionality to limit exactly this message.
See: 9fd38f6867df ("netdev-dpdk: Limit rate of DPDK logs.")

Best regards, Ilya Maximets.
  
Flavio Leitner Oct. 16, 2019, 2:14 p.m. UTC | #8
On Wed, 16 Oct 2019 16:08:54 +0200
Ilya Maximets <i.maximets@ovn.org> wrote:

> On 16.10.2019 16:02, Flavio Leitner wrote:
> > On Wed, 16 Oct 2019 15:46:15 +0200
> > Maxime Coquelin <maxime.coquelin@redhat.com> wrote:
> >   
> >> On 10/16/19 3:32 PM, Ilya Maximets wrote:  
> >>> On 16.10.2019 13:13, Maxime Coquelin wrote:  
> >>>>
> >>>>
> >>>> On 10/15/19 8:59 PM, Flavio Leitner wrote:  
> >>>>> The rte_vhost_dequeue_burst supports two ways of dequeuing data.
> >>>>> If the data fits into a buffer, then all data is copied and a
> >>>>> single linear buffer is returned. Otherwise it allocates
> >>>>> additional mbufs and chains them together to return a multiple
> >>>>> segments mbuf.
> >>>>>
> >>>>> While that covers most use cases, it forces applications that
> >>>>> need to work with larger data sizes to support multiple segments
> >>>>> mbufs. The non-linear characteristic brings complexity and
> >>>>> performance implications to the application.
> >>>>>
> >>>>> To resolve the issue, add support to attach external buffer
> >>>>> to a pktmbuf and let the host provide during registration if
> >>>>> attaching an external buffer to pktmbuf is supported and if
> >>>>> only linear buffer are supported.
> >>>>>
> >>>>> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> >>>>> ---
> >>>>>    doc/guides/prog_guide/vhost_lib.rst |  35 +++++++++
> >>>>>    lib/librte_vhost/rte_vhost.h        |   4 +
> >>>>>    lib/librte_vhost/socket.c           |  22 ++++++
> >>>>>    lib/librte_vhost/vhost.c            |  22 ++++++
> >>>>>    lib/librte_vhost/vhost.h            |   4 +
> >>>>>    lib/librte_vhost/virtio_net.c       | 109
> >>>>> ++++++++++++++++++++++++---- 6 files changed, 182 insertions(+),
> >>>>> 14 deletions(-)
> >>>>>
> >>>>> - Changelog:
> >>>>>     v5:
> >>>>>       - fixed to destroy mutex if incompatible flags
> >>>>>     v4:
> >>>>>       - allow to use pktmbuf if there is exact space
> >>>>>       - removed log message if the buffer is too big
> >>>>>       - fixed the length to include align padding
> >>>>>       - free allocated buf if shinfo fails
> >>>>>     v3:
> >>>>>       - prevent the new features to be used with zero copy
> >>>>>       - fixed sizeof() usage
> >>>>>       - fixed log msg indentation
> >>>>>       - removed/replaced asserts
> >>>>>       - used the correct virt2iova function
> >>>>>       - fixed the patch's title
> >>>>>       - OvS PoC code:
> >>>>>         https://github.com/fleitner/ovs/tree/rte_malloc-v3
> >>>>>     v2:
> >>>>>       - Used rte_malloc() instead of another mempool as
> >>>>> suggested by Shahaf.
> >>>>>       - Added the documentation section.
> >>>>>       - Using driver registration to negotiate the features.
> >>>>>       - OvS PoC code:
> >>>>>        
> >>>>> https://github.com/fleitner/ovs/commit/8fc197c40b1d4fda331686a7b919e9e2b670dda7
> >>>>>     
> >>>>
> >>>> Applied to dpdk-next-virtio/master.  
> >>>
> >>> Thanks Maxime,
> >>>
> >>> But can we return back the mbuf allocation failure message?  
> >>
> >> Good point, I missed it.
> >>  
> >>> I mean:
> >>>
> >>> diff --git a/lib/librte_vhost/virtio_net.c
> >>> b/lib/librte_vhost/virtio_net.c index 66f0c7206..f8af4e0b3 100644
> >>> --- a/lib/librte_vhost/virtio_net.c
> >>> +++ b/lib/librte_vhost/virtio_net.c
> >>> @@ -1354,8 +1354,11 @@ virtio_dev_pktmbuf_alloc(struct virtio_net
> >>> *dev, struct rte_mempool *mp,
> >>>   {
> >>>       struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> >>>   
> >>> -    if (unlikely(pkt == NULL))
> >>> +    if (unlikely(pkt == NULL)) {
> >>> +        RTE_LOG(ERR, VHOST_DATA,
> >>> +            "Failed to allocate memory for mbuf.\n");
> >>>           return NULL;
> >>> +    }
> >>>   
> >>>       if (rte_pktmbuf_tailroom(pkt) >= data_len)
> >>>           return pkt;
> >>> ---
> >>>
> >>> It's a hard failure that highlights some significant issues with
> >>> memory pool size or a mbuf leak.  
> >>
> >> I agree, we need this error message.  
> > 
> > If it runs out of memory and there are many packets coming, then it
> > will flood the logs. Maybe we could use something to report in a
> > more moderated way, for example:
> > 
> > diff --git a/lib/librte_vhost/virtio_net.c
> > b/lib/librte_vhost/virtio_net.c index da69ab1db..b1572b3a4 100644
> > --- a/lib/librte_vhost/virtio_net.c
> > +++ b/lib/librte_vhost/virtio_net.c
> > @@ -1354,8 +1354,14 @@ virtio_dev_pktmbuf_alloc(struct virtio_net
> > *dev, struct rte_mempool *mp, {
> >   	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> >   
> > -	if (unlikely(pkt == NULL))
> > +	if (unlikely(pkt == NULL)) {
> > +		static int rate_lim = 0;
> > +
> > +		if (rate_lim++ % 1000 == 0)
> > +			RTE_LOG...
> > +
> >   		return NULL;
> > +	}
> >   
> >   	if (rte_pktmbuf_tailroom(pkt) >= data_len)
> >   		return pkt;
> > 
> > 
> > Or track this at mempool level and keep stats of failed requests and
> > report that in OvS. That would cover other points too.  
> 
> OVS is already rete-limiting DPDK logs.  Basically, I introduced this
> functionality to limit exactly this message.
> See: 9fd38f6867df ("netdev-dpdk: Limit rate of DPDK logs.")

Thanks, I missed that commit.
fbl
  
David Marchand Oct. 29, 2019, 9:02 a.m. UTC | #9
On Tue, Oct 15, 2019 at 9:00 PM Flavio Leitner <fbl@sysclose.org> wrote:
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 5b85b832d..da69ab1db 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -1289,6 +1289,93 @@ get_zmbuf(struct vhost_virtqueue *vq)
>         return NULL;
>  }
>
> +static void
> +virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
> +{
> +       rte_free(opaque);
> +}
> +
> +static int
> +virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
> +{
> +       struct rte_mbuf_ext_shared_info *shinfo = NULL;
> +       uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
> +       uint16_t buf_len;
> +       rte_iova_t iova;
> +       void *buf;
> +
> +       /* Try to use pkt buffer to store shinfo to reduce the amount of memory
> +        * required, otherwise store shinfo in the new buffer.
> +        */
> +       if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
> +               shinfo = rte_pktmbuf_mtod(pkt,
> +                                         struct rte_mbuf_ext_shared_info *);
> +       else {
> +               total_len += sizeof(*shinfo) + sizeof(uintptr_t);
> +               total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> +       }
> +
> +       if (unlikely(total_len > UINT16_MAX))
> +               return -ENOSPC;
> +
> +       buf_len = total_len;
> +       buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);

Using rte_malloc() means that the allocation can end up on any numa node.
This external buffer might end up on a different node than the mbuf
(which resides on mp->socket_id node).
  
Flavio Leitner Oct. 29, 2019, 12:21 p.m. UTC | #10
On Tue, 29 Oct 2019 10:02:57 +0100
David Marchand <david.marchand@redhat.com> wrote:

> On Tue, Oct 15, 2019 at 9:00 PM Flavio Leitner <fbl@sysclose.org>
> wrote:
> > diff --git a/lib/librte_vhost/virtio_net.c
> > b/lib/librte_vhost/virtio_net.c index 5b85b832d..da69ab1db 100644
> > --- a/lib/librte_vhost/virtio_net.c
> > +++ b/lib/librte_vhost/virtio_net.c
> > @@ -1289,6 +1289,93 @@ get_zmbuf(struct vhost_virtqueue *vq)
> >         return NULL;
> >  }
> >
> > +static void
> > +virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
> > +{
> > +       rte_free(opaque);
> > +}
> > +
> > +static int
> > +virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
> > +{
> > +       struct rte_mbuf_ext_shared_info *shinfo = NULL;
> > +       uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
> > +       uint16_t buf_len;
> > +       rte_iova_t iova;
> > +       void *buf;
> > +
> > +       /* Try to use pkt buffer to store shinfo to reduce the
> > amount of memory
> > +        * required, otherwise store shinfo in the new buffer.
> > +        */
> > +       if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
> > +               shinfo = rte_pktmbuf_mtod(pkt,
> > +                                         struct
> > rte_mbuf_ext_shared_info *);
> > +       else {
> > +               total_len += sizeof(*shinfo) + sizeof(uintptr_t);
> > +               total_len = RTE_ALIGN_CEIL(total_len,
> > sizeof(uintptr_t));
> > +       }
> > +
> > +       if (unlikely(total_len > UINT16_MAX))
> > +               return -ENOSPC;
> > +
> > +       buf_len = total_len;
> > +       buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);  
> 
> Using rte_malloc() means that the allocation can end up on any numa
> node. This external buffer might end up on a different node than the
> mbuf (which resides on mp->socket_id node).

Only if there is no memory in the local socket. It seems better than a
failure to me.

fbl
  
David Marchand Oct. 29, 2019, 4:19 p.m. UTC | #11
On Tue, Oct 29, 2019 at 1:21 PM Flavio Leitner <fbl@sysclose.org> wrote:
> On Tue, 29 Oct 2019 10:02:57 +0100
> David Marchand <david.marchand@redhat.com> wrote:
> > Using rte_malloc() means that the allocation can end up on any numa
> > node. This external buffer might end up on a different node than the
> > mbuf (which resides on mp->socket_id node).
>
> Only if there is no memory in the local socket. It seems better than a
> failure to me.

This won't fail, but you don't know how it will impact the application.

We are moving from a model where the vhost library used a mempool with
a fixed number of buffers on a known socket to a model where we can't
guarantee how much memory vhost will eat, and where the allocations
happen.
I am just scared we will get bitten by random allocation failures on a
system that was "used to run fine".


--
David Marchand
  

Patch

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index fc3ee4353..07e40e3c5 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -117,6 +117,41 @@  The following is an overview of some key Vhost API functions:
     Enabling this flag should only be done when the calling application does
     not pre-fault the guest shared memory, otherwise migration would fail.
 
+  - ``RTE_VHOST_USER_LINEARBUF_SUPPORT``
+
+    Enabling this flag forces vhost dequeue function to only provide linear
+    pktmbuf (no multi-segmented pktmbuf).
+
+    The vhost library by default provides a single pktmbuf for given a
+    packet, but if for some reason the data doesn't fit into a single
+    pktmbuf (e.g., TSO is enabled), the library will allocate additional
+    pktmbufs from the same mempool and chain them together to create a
+    multi-segmented pktmbuf.
+
+    However, the vhost application needs to support multi-segmented format.
+    If the vhost application does not support that format and requires large
+    buffers to be dequeue, this flag should be enabled to force only linear
+    buffers (see RTE_VHOST_USER_EXTBUF_SUPPORT) or drop the packet.
+
+    It is disabled by default.
+
+  - ``RTE_VHOST_USER_EXTBUF_SUPPORT``
+
+    Enabling this flag allows vhost dequeue function to allocate and attach
+    an external buffer to a pktmbuf if the pkmbuf doesn't provide enough
+    space to store all data.
+
+    This is useful when the vhost application wants to support large packets
+    but doesn't want to increase the default mempool object size nor to
+    support multi-segmented mbufs (non-linear). In this case, a fresh buffer
+    is allocated using rte_malloc() which gets attached to a pktmbuf using
+    rte_pktmbuf_attach_extbuf().
+
+    See RTE_VHOST_USER_LINEARBUF_SUPPORT as well to disable multi-segmented
+    mbufs for application that doesn't support chained mbufs.
+
+    It is disabled by default.
+
 * ``rte_vhost_driver_set_features(path, features)``
 
   This function sets the feature bits the vhost-user driver supports. The
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index 19474bca0..b821b5df4 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -30,6 +30,10 @@  extern "C" {
 #define RTE_VHOST_USER_DEQUEUE_ZERO_COPY	(1ULL << 2)
 #define RTE_VHOST_USER_IOMMU_SUPPORT	(1ULL << 3)
 #define RTE_VHOST_USER_POSTCOPY_SUPPORT		(1ULL << 4)
+/* support mbuf with external buffer attached */
+#define RTE_VHOST_USER_EXTBUF_SUPPORT	(1ULL << 5)
+/* support only linear buffers (no chained mbufs) */
+#define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
 
 /** Protocol features. */
 #ifndef VHOST_USER_PROTOCOL_F_MQ
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 274988c4d..a8d433c49 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -40,6 +40,8 @@  struct vhost_user_socket {
 	bool dequeue_zero_copy;
 	bool iommu_support;
 	bool use_builtin_virtio_net;
+	bool extbuf;
+	bool linearbuf;
 
 	/*
 	 * The "supported_features" indicates the feature bits the
@@ -232,6 +234,12 @@  vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 	if (vsocket->dequeue_zero_copy)
 		vhost_enable_dequeue_zero_copy(vid);
 
+	if (vsocket->extbuf)
+		vhost_enable_extbuf(vid);
+
+	if (vsocket->linearbuf)
+		vhost_enable_linearbuf(vid);
+
 	RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);
 
 	if (vsocket->notify_ops->new_connection) {
@@ -870,6 +878,8 @@  rte_vhost_driver_register(const char *path, uint64_t flags)
 		goto out_free;
 	}
 	vsocket->dequeue_zero_copy = flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
+	vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
+	vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
 
 	/*
 	 * Set the supported features correctly for the builtin vhost-user
@@ -894,6 +904,18 @@  rte_vhost_driver_register(const char *path, uint64_t flags)
 	 * not compatible with postcopy.
 	 */
 	if (vsocket->dequeue_zero_copy) {
+		if (vsocket->extbuf) {
+			RTE_LOG(ERR, VHOST_CONFIG,
+			"error: zero copy is incompatible with external buffers\n");
+			ret = -1;
+			goto out_mutex;
+		}
+		if (vsocket->linearbuf) {
+			RTE_LOG(ERR, VHOST_CONFIG,
+			"error: zero copy is incompatible with linear buffers\n");
+			ret = -1;
+			goto out_mutex;
+		}
 		vsocket->supported_features &= ~(1ULL << VIRTIO_F_IN_ORDER);
 		vsocket->features &= ~(1ULL << VIRTIO_F_IN_ORDER);
 
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index cea44df8c..77457f538 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -605,6 +605,28 @@  vhost_set_builtin_virtio_net(int vid, bool enable)
 		dev->flags &= ~VIRTIO_DEV_BUILTIN_VIRTIO_NET;
 }
 
+void
+vhost_enable_extbuf(int vid)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (dev == NULL)
+		return;
+
+	dev->extbuf = 1;
+}
+
+void
+vhost_enable_linearbuf(int vid)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (dev == NULL)
+		return;
+
+	dev->linearbuf = 1;
+}
+
 int
 rte_vhost_get_mtu(int vid, uint16_t *mtu)
 {
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 5131a97a3..0346bd118 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -302,6 +302,8 @@  struct virtio_net {
 	rte_atomic16_t		broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
+	int			extbuf;
+	int			linearbuf;
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
 #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
 	char			ifname[IF_NAME_SZ];
@@ -476,6 +478,8 @@  void vhost_attach_vdpa_device(int vid, int did);
 void vhost_set_ifname(int, const char *if_name, unsigned int if_len);
 void vhost_enable_dequeue_zero_copy(int vid);
 void vhost_set_builtin_virtio_net(int vid, bool enable);
+void vhost_enable_extbuf(int vid);
+void vhost_enable_linearbuf(int vid);
 
 struct vhost_device_ops const *vhost_driver_callback_get(const char *path);
 
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5b85b832d..da69ab1db 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1289,6 +1289,93 @@  get_zmbuf(struct vhost_virtqueue *vq)
 	return NULL;
 }
 
+static void
+virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
+{
+	rte_free(opaque);
+}
+
+static int
+virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
+{
+	struct rte_mbuf_ext_shared_info *shinfo = NULL;
+	uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
+	uint16_t buf_len;
+	rte_iova_t iova;
+	void *buf;
+
+	/* Try to use pkt buffer to store shinfo to reduce the amount of memory
+	 * required, otherwise store shinfo in the new buffer.
+	 */
+	if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
+		shinfo = rte_pktmbuf_mtod(pkt,
+					  struct rte_mbuf_ext_shared_info *);
+	else {
+		total_len += sizeof(*shinfo) + sizeof(uintptr_t);
+		total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
+	}
+
+	if (unlikely(total_len > UINT16_MAX))
+		return -ENOSPC;
+
+	buf_len = total_len;
+	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
+	if (unlikely(buf == NULL))
+		return -ENOMEM;
+
+	/* Initialize shinfo */
+	if (shinfo) {
+		shinfo->free_cb = virtio_dev_extbuf_free;
+		shinfo->fcb_opaque = buf;
+		rte_mbuf_ext_refcnt_set(shinfo, 1);
+	} else {
+		shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
+					      virtio_dev_extbuf_free, buf);
+		if (unlikely(shinfo == NULL)) {
+			rte_free(buf);
+			RTE_LOG(ERR, VHOST_DATA, "Failed to init shinfo\n");
+			return -1;
+		}
+	}
+
+	iova = rte_malloc_virt2iova(buf);
+	rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
+	rte_pktmbuf_reset_headroom(pkt);
+
+	return 0;
+}
+
+/*
+ * Allocate a host supported pktmbuf.
+ */
+static __rte_always_inline struct rte_mbuf *
+virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
+			 uint32_t data_len)
+{
+	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
+
+	if (unlikely(pkt == NULL))
+		return NULL;
+
+	if (rte_pktmbuf_tailroom(pkt) >= data_len)
+		return pkt;
+
+	/* attach an external buffer if supported */
+	if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
+		return pkt;
+
+	/* check if chained buffers are allowed */
+	if (!dev->linearbuf)
+		return pkt;
+
+	/* Data doesn't fit into the buffer and the host supports
+	 * only linear buffers
+	 */
+	rte_pktmbuf_free(pkt);
+
+	return NULL;
+}
+
 static __rte_noinline uint16_t
 virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
@@ -1343,26 +1430,23 @@  virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	for (i = 0; i < count; i++) {
 		struct buf_vector buf_vec[BUF_VECTOR_MAX];
 		uint16_t head_idx;
-		uint32_t dummy_len;
+		uint32_t buf_len;
 		uint16_t nr_vec = 0;
 		int err;
 
 		if (unlikely(fill_vec_buf_split(dev, vq,
 						vq->last_avail_idx + i,
 						&nr_vec, buf_vec,
-						&head_idx, &dummy_len,
+						&head_idx, &buf_len,
 						VHOST_ACCESS_RO) < 0))
 			break;
 
 		if (likely(dev->dequeue_zero_copy == 0))
 			update_shadow_used_ring_split(vq, head_idx, 0);
 
-		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
-		if (unlikely(pkts[i] == NULL)) {
-			RTE_LOG(ERR, VHOST_DATA,
-				"Failed to allocate memory for mbuf.\n");
+		pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, buf_len);
+		if (unlikely(pkts[i] == NULL))
 			break;
-		}
 
 		err = copy_desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkts[i],
 				mbuf_pool);
@@ -1451,14 +1535,14 @@  virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	for (i = 0; i < count; i++) {
 		struct buf_vector buf_vec[BUF_VECTOR_MAX];
 		uint16_t buf_id;
-		uint32_t dummy_len;
+		uint32_t buf_len;
 		uint16_t desc_count, nr_vec = 0;
 		int err;
 
 		if (unlikely(fill_vec_buf_packed(dev, vq,
 						vq->last_avail_idx, &desc_count,
 						buf_vec, &nr_vec,
-						&buf_id, &dummy_len,
+						&buf_id, &buf_len,
 						VHOST_ACCESS_RO) < 0))
 			break;
 
@@ -1466,12 +1550,9 @@  virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			update_shadow_used_ring_packed(vq, buf_id, 0,
 					desc_count);
 
-		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
-		if (unlikely(pkts[i] == NULL)) {
-			RTE_LOG(ERR, VHOST_DATA,
-				"Failed to allocate memory for mbuf.\n");
+		pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, buf_len);
+		if (unlikely(pkts[i] == NULL))
 			break;
-		}
 
 		err = copy_desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkts[i],
 				mbuf_pool);