mbox series

[0/4] Support DMA-accelerated Tx operations for vhost-user PMD

Message ID 1584436885-18651-1-git-send-email-jiayu.hu@intel.com (mailing list archive)
Headers
Series Support DMA-accelerated Tx operations for vhost-user PMD |

Message

Hu, Jiayu March 17, 2020, 9:21 a.m. UTC
  In vhost-user PMD's Tx operations, where data movement is heavily involved,
performing large memory copies usually takes up a major part of CPU cycles
and becomes the hot spot. To offload expensive memory operations from the
CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a DMA
engine in the Intel's processor, to accelerate large copies for vhost-user.

Large copies are offloaded from the CPU to the DMA in an asynchronous
manner. The CPU just submits copy jobs to the DMA but without waiting
for its copy completion. Thus, there is no CPU intervention during data
transfer; we can save precious CPU cycles and improve the overall
throughput for vhost-user PMD based applications, like OVS. During
packet transmission, it offloads large copies to the DMA and performs
small copies by the CPU, due to startup overheads associated with the DMA.

vhost-user PMD is able to support various DMA engines, but it just
supports I/OAT devices currently. In addition, I/OAT acceleration is only
enabled for Tx operations of split rings. Users can explicitly assign a
I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT device
can only be used by one queue, and a queue can use one I/OAT device at a
time.

We measure the performance in testpmd. With 1024 bytes packets, compared
with the original SW data path, DMA-enabled vhost-user PMD can improve
the throughput around 20%~30% in the VM2VM and PVP cases. Furthermore,
with larger packets, the throughput improvement will be higher.

Jiayu Hu (4):
  vhost: populate guest memory for DMA-accelerated vhost-user
  net/vhost: setup vrings for DMA-accelerated datapath
  net/vhost: leverage DMA engines to accelerate Tx operations
  doc: add I/OAT acceleration support for vhost-user PMD

 doc/guides/nics/vhost.rst         |  14 +
 drivers/Makefile                  |   2 +-
 drivers/net/vhost/Makefile        |   6 +-
 drivers/net/vhost/internal.h      | 160 +++++++
 drivers/net/vhost/meson.build     |   5 +-
 drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++---
 drivers/net/vhost/virtio_net.c    | 861 ++++++++++++++++++++++++++++++++++++++
 drivers/net/vhost/virtio_net.h    | 288 +++++++++++++
 lib/librte_vhost/rte_vhost.h      |   1 +
 lib/librte_vhost/socket.c         |  20 +
 lib/librte_vhost/vhost.h          |   2 +
 lib/librte_vhost/vhost_user.c     |   3 +-
 12 files changed, 1597 insertions(+), 73 deletions(-)
 create mode 100644 drivers/net/vhost/internal.h
 create mode 100644 drivers/net/vhost/virtio_net.c
 create mode 100644 drivers/net/vhost/virtio_net.h
  

Comments

Maxime Coquelin March 17, 2020, 9:53 a.m. UTC | #1
Hi Jiayu,

On 3/17/20 10:21 AM, Jiayu Hu wrote:
> In vhost-user PMD's Tx operations, where data movement is heavily involved,
> performing large memory copies usually takes up a major part of CPU cycles
> and becomes the hot spot. To offload expensive memory operations from the
> CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a DMA
> engine in the Intel's processor, to accelerate large copies for vhost-user.
> 
> Large copies are offloaded from the CPU to the DMA in an asynchronous
> manner. The CPU just submits copy jobs to the DMA but without waiting
> for its copy completion. Thus, there is no CPU intervention during data
> transfer; we can save precious CPU cycles and improve the overall
> throughput for vhost-user PMD based applications, like OVS. During
> packet transmission, it offloads large copies to the DMA and performs
> small copies by the CPU, due to startup overheads associated with the DMA.
> 
> vhost-user PMD is able to support various DMA engines, but it just
> supports I/OAT devices currently. In addition, I/OAT acceleration is only
> enabled for Tx operations of split rings. Users can explicitly assign a
> I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT device
> can only be used by one queue, and a queue can use one I/OAT device at a
> time.
> 
> We measure the performance in testpmd. With 1024 bytes packets, compared
> with the original SW data path, DMA-enabled vhost-user PMD can improve
> the throughput around 20%~30% in the VM2VM and PVP cases. Furthermore,
> with larger packets, the throughput improvement will be higher.


I'm not sure it should be done like that for several reasons.

First, it seems really complex for the user to get the command line
right. There is no mention in the doc patch on how to bind the DMAs to
the DPDK application. Are all the DMAs on the system capable of doing
it?
I think it should be made transparent to the user, who should not have
to specify the DMA device address in command line. The user should just
pass a devarg specifying he wants to use DMAs, if available.

Second, it looks too much vendor-specific. IMHO, we should have a DMA
framework, so that the driver can request DMA channels based on
capabilities.

Also, I don't think implementing ring processing in the Vhost PMD is
welcome, Vhost PMD should just be a wrapper for the Vhost library. Doing
that in Vhost PMD causes code duplication, and will be a maintenance
burden on the long run.

As IOAT is a kind of acceleration, why not implement it through the vDPA
framework? vDPA framework should be extended to support this kind of
acceleration which requires some CPU processing, as opposed to full
offload of the ring processing as it only supports today.

Kind regards,
Maxime

> Jiayu Hu (4):
>   vhost: populate guest memory for DMA-accelerated vhost-user
>   net/vhost: setup vrings for DMA-accelerated datapath
>   net/vhost: leverage DMA engines to accelerate Tx operations
>   doc: add I/OAT acceleration support for vhost-user PMD
> 
>  doc/guides/nics/vhost.rst         |  14 +
>  drivers/Makefile                  |   2 +-
>  drivers/net/vhost/Makefile        |   6 +-
>  drivers/net/vhost/internal.h      | 160 +++++++
>  drivers/net/vhost/meson.build     |   5 +-
>  drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++---
>  drivers/net/vhost/virtio_net.c    | 861 ++++++++++++++++++++++++++++++++++++++
>  drivers/net/vhost/virtio_net.h    | 288 +++++++++++++
>  lib/librte_vhost/rte_vhost.h      |   1 +
>  lib/librte_vhost/socket.c         |  20 +
>  lib/librte_vhost/vhost.h          |   2 +
>  lib/librte_vhost/vhost_user.c     |   3 +-
>  12 files changed, 1597 insertions(+), 73 deletions(-)
>  create mode 100644 drivers/net/vhost/internal.h
>  create mode 100644 drivers/net/vhost/virtio_net.c
>  create mode 100644 drivers/net/vhost/virtio_net.h
>
  
Hu, Jiayu March 19, 2020, 7:33 a.m. UTC | #2
Hi Maxime,

Thanks for your comments. Replies are inline.

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, March 17, 2020 5:54 PM
> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong
> <zhihong.wang@intel.com>
> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-
> user PMD
> 
> Hi Jiayu,
> 
> On 3/17/20 10:21 AM, Jiayu Hu wrote:
> > In vhost-user PMD's Tx operations, where data movement is heavily
> involved,
> > performing large memory copies usually takes up a major part of CPU
> cycles
> > and becomes the hot spot. To offload expensive memory operations from
> the
> > CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a DMA
> > engine in the Intel's processor, to accelerate large copies for vhost-user.
> >
> > Large copies are offloaded from the CPU to the DMA in an asynchronous
> > manner. The CPU just submits copy jobs to the DMA but without waiting
> > for its copy completion. Thus, there is no CPU intervention during data
> > transfer; we can save precious CPU cycles and improve the overall
> > throughput for vhost-user PMD based applications, like OVS. During
> > packet transmission, it offloads large copies to the DMA and performs
> > small copies by the CPU, due to startup overheads associated with the DMA.
> >
> > vhost-user PMD is able to support various DMA engines, but it just
> > supports I/OAT devices currently. In addition, I/OAT acceleration is only
> > enabled for Tx operations of split rings. Users can explicitly assign a
> > I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT
> device
> > can only be used by one queue, and a queue can use one I/OAT device at a
> > time.
> >
> > We measure the performance in testpmd. With 1024 bytes packets,
> compared
> > with the original SW data path, DMA-enabled vhost-user PMD can improve
> > the throughput around 20%~30% in the VM2VM and PVP cases.
> Furthermore,
> > with larger packets, the throughput improvement will be higher.
> 
> 
> I'm not sure it should be done like that for several reasons.
> 
> First, it seems really complex for the user to get the command line
> right. There is no mention in the doc patch on how to bind the DMAs to
> the DPDK application. Are all the DMAs on the system capable of doing
> it?

DMA engines in Intel CPU are able to move data within system memory.
Currently, we have I/OAT and we will have DSA in the future.

> I think it should be made transparent to the user, who should not have
> to specify the DMA device address in command line. The user should just
> pass a devarg specifying he wants to use DMAs, if available.

How do you think of replacing DMA address with specific DMA capabilities, like
"dmas=[txq0@DMACOPY]". As I/OAT only supports data movement, we can
just provide basic DMA copy ability now. But when there are more DMA devices,
 we can add capabilities in devargs later.

> 
> Second, it looks too much vendor-specific. IMHO, we should have a DMA
> framework, so that the driver can request DMA channels based on
> capabilities.

We only have one DMA engine, I/OAT, in DPDK, and it is implemented as
a rawdev. IMO, it will be very hard to provide a generic DMA abstraction
currently. In addition, I/OAT specific API is called inside vhost-user PMD,
we can replace these function calls when we have a DMA framework in
the future. Users are unaware of the changes. Does it make sense to you?

> 
> Also, I don't think implementing ring processing in the Vhost PMD is
> welcome, Vhost PMD should just be a wrapper for the Vhost library. Doing
> that in Vhost PMD causes code duplication, and will be a maintenance
> burden on the long run.
> 
> As IOAT is a kind of acceleration, why not implement it through the vDPA
> framework? vDPA framework should be extended to support this kind of
> acceleration which requires some CPU processing, as opposed to full
> offload of the ring processing as it only supports today.

The main reason of implementing data path in vhost PMD is to avoid impacting
SW data path in vhost library. Even if we implement it as an instance of vDPA,
we also have to implement data path in a new vdev PMD, as DMA just accelerates
memory copy and all ring operations have to be done by the CPU. There is still the
code duplication issue.

Thanks,
Jiayu

> 
> Kind regards,
> Maxime
> 
> > Jiayu Hu (4):
> >   vhost: populate guest memory for DMA-accelerated vhost-user
> >   net/vhost: setup vrings for DMA-accelerated datapath
> >   net/vhost: leverage DMA engines to accelerate Tx operations
> >   doc: add I/OAT acceleration support for vhost-user PMD
> >
> >  doc/guides/nics/vhost.rst         |  14 +
> >  drivers/Makefile                  |   2 +-
> >  drivers/net/vhost/Makefile        |   6 +-
> >  drivers/net/vhost/internal.h      | 160 +++++++
> >  drivers/net/vhost/meson.build     |   5 +-
> >  drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++---
> >  drivers/net/vhost/virtio_net.c    | 861
> ++++++++++++++++++++++++++++++++++++++
> >  drivers/net/vhost/virtio_net.h    | 288 +++++++++++++
> >  lib/librte_vhost/rte_vhost.h      |   1 +
> >  lib/librte_vhost/socket.c         |  20 +
> >  lib/librte_vhost/vhost.h          |   2 +
> >  lib/librte_vhost/vhost_user.c     |   3 +-
> >  12 files changed, 1597 insertions(+), 73 deletions(-)
> >  create mode 100644 drivers/net/vhost/internal.h
> >  create mode 100644 drivers/net/vhost/virtio_net.c
> >  create mode 100644 drivers/net/vhost/virtio_net.h
> >
  
Maxime Coquelin March 19, 2020, 9:10 a.m. UTC | #3
Hi Jiayu,

On 3/19/20 8:33 AM, Hu, Jiayu wrote:
> Hi Maxime,
> 
> Thanks for your comments. Replies are inline.
> 
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Tuesday, March 17, 2020 5:54 PM
>> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
>> Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong
>> <zhihong.wang@intel.com>
>> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-
>> user PMD
>>
>> Hi Jiayu,
>>
>> On 3/17/20 10:21 AM, Jiayu Hu wrote:
>>> In vhost-user PMD's Tx operations, where data movement is heavily
>> involved,
>>> performing large memory copies usually takes up a major part of CPU
>> cycles
>>> and becomes the hot spot. To offload expensive memory operations from
>> the
>>> CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a DMA
>>> engine in the Intel's processor, to accelerate large copies for vhost-user.
>>>
>>> Large copies are offloaded from the CPU to the DMA in an asynchronous
>>> manner. The CPU just submits copy jobs to the DMA but without waiting
>>> for its copy completion. Thus, there is no CPU intervention during data
>>> transfer; we can save precious CPU cycles and improve the overall
>>> throughput for vhost-user PMD based applications, like OVS. During
>>> packet transmission, it offloads large copies to the DMA and performs
>>> small copies by the CPU, due to startup overheads associated with the DMA.
>>>
>>> vhost-user PMD is able to support various DMA engines, but it just
>>> supports I/OAT devices currently. In addition, I/OAT acceleration is only
>>> enabled for Tx operations of split rings. Users can explicitly assign a
>>> I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT
>> device
>>> can only be used by one queue, and a queue can use one I/OAT device at a
>>> time.
>>>
>>> We measure the performance in testpmd. With 1024 bytes packets,
>> compared
>>> with the original SW data path, DMA-enabled vhost-user PMD can improve
>>> the throughput around 20%~30% in the VM2VM and PVP cases.
>> Furthermore,
>>> with larger packets, the throughput improvement will be higher.
>>
>>
>> I'm not sure it should be done like that for several reasons.
>>
>> First, it seems really complex for the user to get the command line
>> right. There is no mention in the doc patch on how to bind the DMAs to
>> the DPDK application. Are all the DMAs on the system capable of doing
>> it?
> 
> DMA engines in Intel CPU are able to move data within system memory.
> Currently, we have I/OAT and we will have DSA in the future.

OK, can you give me an example of how many I/OAT instances on a given
CPU?

>> I think it should be made transparent to the user, who should not have
>> to specify the DMA device address in command line. The user should just
>> pass a devarg specifying he wants to use DMAs, if available.
> 
> How do you think of replacing DMA address with specific DMA capabilities, like
> "dmas=[txq0@DMACOPY]". As I/OAT only supports data movement, we can
> just provide basic DMA copy ability now. But when there are more DMA devices,
>  we can add capabilities in devargs later.
"dmas=[txq0@DMACOPY]" is still too complex IMHO. We should just have a
flag to enable or not DMA (tx_dma=1 / tx_dma=0), and this would be used
for all queues as we do for zero-copy.

>>
>> Second, it looks too much vendor-specific. IMHO, we should have a DMA
>> framework, so that the driver can request DMA channels based on
>> capabilities.
> 
> We only have one DMA engine, I/OAT, in DPDK, and it is implemented as
> a rawdev. IMO, it will be very hard to provide a generic DMA abstraction
> currently. In addition, I/OAT specific API is called inside vhost-user PMD,
> we can replace these function calls when we have a DMA framework in
> the future. Users are unaware of the changes. Does it make sense to you?

Having an abstraction might be hard, but it does not seem impossible.
Such DMA abstraction has been done in the Kernel for IOAT. For example:
https://lore.kernel.org/patchwork/cover/56714/

>>
>> Also, I don't think implementing ring processing in the Vhost PMD is
>> welcome, Vhost PMD should just be a wrapper for the Vhost library. Doing
>> that in Vhost PMD causes code duplication, and will be a maintenance
>> burden on the long run.
>>
>> As IOAT is a kind of acceleration, why not implement it through the vDPA
>> framework? vDPA framework should be extended to support this kind of
>> acceleration which requires some CPU processing, as opposed to full
>> offload of the ring processing as it only supports today.
> 
> The main reason of implementing data path in vhost PMD is to avoid impacting
> SW data path in vhost library. Even if we implement it as an instance of vDPA,
> we also have to implement data path in a new vdev PMD, as DMA just accelerates
> memory copy and all ring operations have to be done by the CPU. There is still the
> code duplication issue.

Ok, so what about:

Introducing a pair of callbacks in struct virtio_net for DMA enqueue and
dequeue.

lib/librte_vhost/ioat.c which would implement dma_enqueue and
dma_dequeue callback for IOAT. As it will live in the vhost lib
directory, it will be easy to refactor the code to share as much as
possible and so avoid code duplication.

In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then
call it instead of the SW datapath. It adds a few cycle, but this is
much more sane IMHO.

What do you think?

Thanks,
Maxime
> Thanks,
> Jiayu
> 
>>
>> Kind regards,
>> Maxime
>>
>>> Jiayu Hu (4):
>>>   vhost: populate guest memory for DMA-accelerated vhost-user
>>>   net/vhost: setup vrings for DMA-accelerated datapath
>>>   net/vhost: leverage DMA engines to accelerate Tx operations
>>>   doc: add I/OAT acceleration support for vhost-user PMD
>>>
>>>  doc/guides/nics/vhost.rst         |  14 +
>>>  drivers/Makefile                  |   2 +-
>>>  drivers/net/vhost/Makefile        |   6 +-
>>>  drivers/net/vhost/internal.h      | 160 +++++++
>>>  drivers/net/vhost/meson.build     |   5 +-
>>>  drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++---
>>>  drivers/net/vhost/virtio_net.c    | 861
>> ++++++++++++++++++++++++++++++++++++++
>>>  drivers/net/vhost/virtio_net.h    | 288 +++++++++++++
>>>  lib/librte_vhost/rte_vhost.h      |   1 +
>>>  lib/librte_vhost/socket.c         |  20 +
>>>  lib/librte_vhost/vhost.h          |   2 +
>>>  lib/librte_vhost/vhost_user.c     |   3 +-
>>>  12 files changed, 1597 insertions(+), 73 deletions(-)
>>>  create mode 100644 drivers/net/vhost/internal.h
>>>  create mode 100644 drivers/net/vhost/virtio_net.c
>>>  create mode 100644 drivers/net/vhost/virtio_net.h
>>>
>
  
Hu, Jiayu March 19, 2020, 11:47 a.m. UTC | #4
Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Thursday, March 19, 2020 5:10 PM
> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong
> <zhihong.wang@intel.com>
> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-
> user PMD
> 
> Hi Jiayu,
> 
> On 3/19/20 8:33 AM, Hu, Jiayu wrote:
> > Hi Maxime,
> >
> > Thanks for your comments. Replies are inline.
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >> Sent: Tuesday, March 17, 2020 5:54 PM
> >> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> >> Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong
> >> <zhihong.wang@intel.com>
> >> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for
> vhost-
> >> user PMD
> >>
> >> Hi Jiayu,
> >>
> >> On 3/17/20 10:21 AM, Jiayu Hu wrote:
> >>> In vhost-user PMD's Tx operations, where data movement is heavily
> >> involved,
> >>> performing large memory copies usually takes up a major part of CPU
> >> cycles
> >>> and becomes the hot spot. To offload expensive memory operations
> from
> >> the
> >>> CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a
> DMA
> >>> engine in the Intel's processor, to accelerate large copies for vhost-user.
> >>>
> >>> Large copies are offloaded from the CPU to the DMA in an asynchronous
> >>> manner. The CPU just submits copy jobs to the DMA but without waiting
> >>> for its copy completion. Thus, there is no CPU intervention during data
> >>> transfer; we can save precious CPU cycles and improve the overall
> >>> throughput for vhost-user PMD based applications, like OVS. During
> >>> packet transmission, it offloads large copies to the DMA and performs
> >>> small copies by the CPU, due to startup overheads associated with the
> DMA.
> >>>
> >>> vhost-user PMD is able to support various DMA engines, but it just
> >>> supports I/OAT devices currently. In addition, I/OAT acceleration is only
> >>> enabled for Tx operations of split rings. Users can explicitly assign a
> >>> I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT
> >> device
> >>> can only be used by one queue, and a queue can use one I/OAT device at
> a
> >>> time.
> >>>
> >>> We measure the performance in testpmd. With 1024 bytes packets,
> >> compared
> >>> with the original SW data path, DMA-enabled vhost-user PMD can
> improve
> >>> the throughput around 20%~30% in the VM2VM and PVP cases.
> >> Furthermore,
> >>> with larger packets, the throughput improvement will be higher.
> >>
> >>
> >> I'm not sure it should be done like that for several reasons.
> >>
> >> First, it seems really complex for the user to get the command line
> >> right. There is no mention in the doc patch on how to bind the DMAs to
> >> the DPDK application. Are all the DMAs on the system capable of doing
> >> it?
> >
> > DMA engines in Intel CPU are able to move data within system memory.
> > Currently, we have I/OAT and we will have DSA in the future.
> 
> OK, can you give me an example of how many I/OAT instances on a given
> CPU?

One Xeon Platinum 8180 CPU has 8 I/OAT instances.

> 
> >> I think it should be made transparent to the user, who should not have
> >> to specify the DMA device address in command line. The user should just
> >> pass a devarg specifying he wants to use DMAs, if available.
> >
> > How do you think of replacing DMA address with specific DMA capabilities,
> like
> > "dmas=[txq0@DMACOPY]". As I/OAT only supports data movement, we
> can
> > just provide basic DMA copy ability now. But when there are more DMA
> devices,
> >  we can add capabilities in devargs later.
> "dmas=[txq0@DMACOPY]" is still too complex IMHO. We should just have a
> flag to enable or not DMA (tx_dma=1 / tx_dma=0), and this would be used
> for all queues as we do for zero-copy.
> 
> >>
> >> Second, it looks too much vendor-specific. IMHO, we should have a DMA
> >> framework, so that the driver can request DMA channels based on
> >> capabilities.
> >
> > We only have one DMA engine, I/OAT, in DPDK, and it is implemented as
> > a rawdev. IMO, it will be very hard to provide a generic DMA abstraction
> > currently. In addition, I/OAT specific API is called inside vhost-user PMD,
> > we can replace these function calls when we have a DMA framework in
> > the future. Users are unaware of the changes. Does it make sense to you?
> 
> Having an abstraction might be hard, but it does not seem impossible.
> Such DMA abstraction has been done in the Kernel for IOAT. For example:
> https://lore.kernel.org/patchwork/cover/56714/
> 
> >>
> >> Also, I don't think implementing ring processing in the Vhost PMD is
> >> welcome, Vhost PMD should just be a wrapper for the Vhost library.
> Doing
> >> that in Vhost PMD causes code duplication, and will be a maintenance
> >> burden on the long run.
> >>
> >> As IOAT is a kind of acceleration, why not implement it through the vDPA
> >> framework? vDPA framework should be extended to support this kind of
> >> acceleration which requires some CPU processing, as opposed to full
> >> offload of the ring processing as it only supports today.
> >
> > The main reason of implementing data path in vhost PMD is to avoid
> impacting
> > SW data path in vhost library. Even if we implement it as an instance of
> vDPA,
> > we also have to implement data path in a new vdev PMD, as DMA just
> accelerates
> > memory copy and all ring operations have to be done by the CPU. There is
> still the
> > code duplication issue.
> 
> Ok, so what about:
> 
> Introducing a pair of callbacks in struct virtio_net for DMA enqueue and
> dequeue.
> 
> lib/librte_vhost/ioat.c which would implement dma_enqueue and
> dma_dequeue callback for IOAT. As it will live in the vhost lib
> directory, it will be easy to refactor the code to share as much as
> possible and so avoid code duplication.
> 
> In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then
> call it instead of the SW datapath. It adds a few cycle, but this is
> much more sane IMHO.

The problem is that current semantics of rte_vhost_enqueue/dequeue API
are conflict with I/OAT accelerated data path. To improve the performance,
the I/OAT works in an asynchronous manner, where the CPU just submits
copy jobs to the I/OAT without waiting for its copy completion. For
rte_vhost_enqueue_burst, users cannot reuse enqueued pktmbufs when it
returns, as the I/OAT may still use them. For rte_vhost_dequeue_burst,
users will not get incoming packets as the I/OAT is still performing packet
copies. As you can see, when enabling I/OAT acceleration, the semantics of
the two API are changed. If we keep the same API name but changing their
semantic, this may confuse users, IMHO.

Thanks,
Jiayu

> 
> What do you think?
> 
> Thanks,
> Maxime
> > Thanks,
> > Jiayu
> >
> >>
> >> Kind regards,
> >> Maxime
> >>
> >>> Jiayu Hu (4):
> >>>   vhost: populate guest memory for DMA-accelerated vhost-user
> >>>   net/vhost: setup vrings for DMA-accelerated datapath
> >>>   net/vhost: leverage DMA engines to accelerate Tx operations
> >>>   doc: add I/OAT acceleration support for vhost-user PMD
> >>>
> >>>  doc/guides/nics/vhost.rst         |  14 +
> >>>  drivers/Makefile                  |   2 +-
> >>>  drivers/net/vhost/Makefile        |   6 +-
> >>>  drivers/net/vhost/internal.h      | 160 +++++++
> >>>  drivers/net/vhost/meson.build     |   5 +-
> >>>  drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++---
> >>>  drivers/net/vhost/virtio_net.c    | 861
> >> ++++++++++++++++++++++++++++++++++++++
> >>>  drivers/net/vhost/virtio_net.h    | 288 +++++++++++++
> >>>  lib/librte_vhost/rte_vhost.h      |   1 +
> >>>  lib/librte_vhost/socket.c         |  20 +
> >>>  lib/librte_vhost/vhost.h          |   2 +
> >>>  lib/librte_vhost/vhost_user.c     |   3 +-
> >>>  12 files changed, 1597 insertions(+), 73 deletions(-)
> >>>  create mode 100644 drivers/net/vhost/internal.h
> >>>  create mode 100644 drivers/net/vhost/virtio_net.c
> >>>  create mode 100644 drivers/net/vhost/virtio_net.h
> >>>
> >
  
Maxime Coquelin March 26, 2020, 7:52 a.m. UTC | #5
Hi Jiayu,

On 3/19/20 12:47 PM, Hu, Jiayu wrote:

>>
>> Ok, so what about:
>>
>> Introducing a pair of callbacks in struct virtio_net for DMA enqueue and
>> dequeue.
>>
>> lib/librte_vhost/ioat.c which would implement dma_enqueue and
>> dma_dequeue callback for IOAT. As it will live in the vhost lib
>> directory, it will be easy to refactor the code to share as much as
>> possible and so avoid code duplication.
>>
>> In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then
>> call it instead of the SW datapath. It adds a few cycle, but this is
>> much more sane IMHO.
> 
> The problem is that current semantics of rte_vhost_enqueue/dequeue API
> are conflict with I/OAT accelerated data path. To improve the performance,
> the I/OAT works in an asynchronous manner, where the CPU just submits
> copy jobs to the I/OAT without waiting for its copy completion. For
> rte_vhost_enqueue_burst, users cannot reuse enqueued pktmbufs when it
> returns, as the I/OAT may still use them. For rte_vhost_dequeue_burst,
> users will not get incoming packets as the I/OAT is still performing packet
> copies. As you can see, when enabling I/OAT acceleration, the semantics of
> the two API are changed. If we keep the same API name but changing their
> semantic, this may confuse users, IMHO.

Ok, so it is basically the same as zero-copy for dequeue path, right?
If a new API is necessary, then it would be better to add it in Vhost
library for async enqueue/dequeue.
It could be used also for Tx zero-copy, and so the sync version would
save some cycles as we could remove the zero-copy support there.

What do you think?

I really object to implement vring handling into the Vhost PMD, this is
the role of the Vhost library.

Thanks,
Maxime
  
Hu, Jiayu March 26, 2020, 8:25 a.m. UTC | #6
Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Thursday, March 26, 2020 3:53 PM
> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong
> <zhihong.wang@intel.com>
> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-
> user PMD
> 
> Hi Jiayu,
> 
> On 3/19/20 12:47 PM, Hu, Jiayu wrote:
> 
> >>
> >> Ok, so what about:
> >>
> >> Introducing a pair of callbacks in struct virtio_net for DMA enqueue and
> >> dequeue.
> >>
> >> lib/librte_vhost/ioat.c which would implement dma_enqueue and
> >> dma_dequeue callback for IOAT. As it will live in the vhost lib
> >> directory, it will be easy to refactor the code to share as much as
> >> possible and so avoid code duplication.
> >>
> >> In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then
> >> call it instead of the SW datapath. It adds a few cycle, but this is
> >> much more sane IMHO.
> >
> > The problem is that current semantics of rte_vhost_enqueue/dequeue API
> > are conflict with I/OAT accelerated data path. To improve the performance,
> > the I/OAT works in an asynchronous manner, where the CPU just submits
> > copy jobs to the I/OAT without waiting for its copy completion. For
> > rte_vhost_enqueue_burst, users cannot reuse enqueued pktmbufs when
> it
> > returns, as the I/OAT may still use them. For rte_vhost_dequeue_burst,
> > users will not get incoming packets as the I/OAT is still performing packet
> > copies. As you can see, when enabling I/OAT acceleration, the semantics of
> > the two API are changed. If we keep the same API name but changing their
> > semantic, this may confuse users, IMHO.
> 
> Ok, so it is basically the same as zero-copy for dequeue path, right?
> If a new API is necessary, then it would be better to add it in Vhost
> library for async enqueue/dequeue.
> It could be used also for Tx zero-copy, and so the sync version would
> save some cycles as we could remove the zero-copy support there.
> 
> What do you think?

Yes, you are right. The better way is to provide new API with asynchronous
semantics in vhost library. In addition, the vhost library better provides DMA
operation callbacks to avoid using vender specific API. The asynchronous API may
look like rte_vhost_try_enqueue_burst() and rte_vhost_get_completed_packets().
The first one is to perform enqueue logic, and the second one is to return
pktmbufs whose all copies are completed to users. How do you think?

Thanks,
Jiayu

> 
> I really object to implement vring handling into the Vhost PMD, this is
> the role of the Vhost library.
> 
> Thanks,
> Maxime
  
Maxime Coquelin March 26, 2020, 8:47 a.m. UTC | #7
On 3/26/20 9:25 AM, Hu, Jiayu wrote:
> Hi Maxime,
> 
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Thursday, March 26, 2020 3:53 PM
>> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
>> Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong
>> <zhihong.wang@intel.com>
>> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-
>> user PMD
>>
>> Hi Jiayu,
>>
>> On 3/19/20 12:47 PM, Hu, Jiayu wrote:
>>
>>>>
>>>> Ok, so what about:
>>>>
>>>> Introducing a pair of callbacks in struct virtio_net for DMA enqueue and
>>>> dequeue.
>>>>
>>>> lib/librte_vhost/ioat.c which would implement dma_enqueue and
>>>> dma_dequeue callback for IOAT. As it will live in the vhost lib
>>>> directory, it will be easy to refactor the code to share as much as
>>>> possible and so avoid code duplication.
>>>>
>>>> In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then
>>>> call it instead of the SW datapath. It adds a few cycle, but this is
>>>> much more sane IMHO.
>>>
>>> The problem is that current semantics of rte_vhost_enqueue/dequeue API
>>> are conflict with I/OAT accelerated data path. To improve the performance,
>>> the I/OAT works in an asynchronous manner, where the CPU just submits
>>> copy jobs to the I/OAT without waiting for its copy completion. For
>>> rte_vhost_enqueue_burst, users cannot reuse enqueued pktmbufs when
>> it
>>> returns, as the I/OAT may still use them. For rte_vhost_dequeue_burst,
>>> users will not get incoming packets as the I/OAT is still performing packet
>>> copies. As you can see, when enabling I/OAT acceleration, the semantics of
>>> the two API are changed. If we keep the same API name but changing their
>>> semantic, this may confuse users, IMHO.
>>
>> Ok, so it is basically the same as zero-copy for dequeue path, right?
>> If a new API is necessary, then it would be better to add it in Vhost
>> library for async enqueue/dequeue.
>> It could be used also for Tx zero-copy, and so the sync version would
>> save some cycles as we could remove the zero-copy support there.
>>
>> What do you think?
> 
> Yes, you are right. The better way is to provide new API with asynchronous
> semantics in vhost library. In addition, the vhost library better provides DMA
> operation callbacks to avoid using vender specific API. The asynchronous API may
> look like rte_vhost_try_enqueue_burst() and rte_vhost_get_completed_packets().
> The first one is to perform enqueue logic, and the second one is to return
> pktmbufs whose all copies are completed to users. How do you think?

That looks good to me, great!
The only think is the naming of the API. I need t think more about it,
but it does not prevent to start working on the implementation.

Regarding the initialization, I was thinking we could introduce new
flags to rte_vhost_driver_register:
- RTE_VHOST_USER_TX_DMA
- RTE_VHOST_USER_RX_DMA

Well, only Tx can be implemented for now, but the Rx flag can be
reserved.

The thing I'm not clear is when no DMA is available, how do we fallback
to the sync API.

Should the user still call rte_vhost_try_enqueue_burst(), but if no DMA,
it will call the rte_vhost_enqueue_burst() directly and then
rte_vhost_get_completed_packets() will return all the mbufs?

Thanks,
Maxime

> Thanks,
> Jiayu
> 
>>
>> I really object to implement vring handling into the Vhost PMD, this is
>> the role of the Vhost library.
>>
>> Thanks,
>> Maxime
>