Message ID | 1584436885-18651-1-git-send-email-jiayu.hu@intel.com (mailing list archive) |
---|---|
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id DECD6A0559; Tue, 17 Mar 2020 03:43:10 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id AD0421C0AA; Tue, 17 Mar 2020 03:43:09 +0100 (CET) Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id 9E1F61C08C for <dev@dpdk.org>; Tue, 17 Mar 2020 03:43:07 +0100 (CET) IronPort-SDR: VqXhvxMP4YNH1+Lde78SNAyHLPpxJGAU7zw3QeyqMZg1ZNZ+gl7lnSStG6A+RuXf0WUOoumftB DKxfs2Yr2Slg== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Mar 2020 19:43:06 -0700 IronPort-SDR: yl2fTkZ47a8Yn+hVIhfczgbUL70dVOayzK0EY3sYJoxOSn4zGdOCyD6eb+NYxaSAexQNpYwQoW znUuhRNVSELA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,562,1574150400"; d="scan'208";a="417390164" Received: from npg_dpdk_virtio_jiayuhu_07.sh.intel.com ([10.67.119.35]) by orsmga005.jf.intel.com with ESMTP; 16 Mar 2020 19:43:04 -0700 From: Jiayu Hu <jiayu.hu@intel.com> To: dev@dpdk.org Cc: maxime.coquelin@redhat.com, xiaolong.ye@intel.com, zhihong.wang@intel.com, Jiayu Hu <jiayu.hu@intel.com> Date: Tue, 17 Mar 2020 05:21:21 -0400 Message-Id: <1584436885-18651-1-git-send-email-jiayu.hu@intel.com> X-Mailer: git-send-email 2.7.4 Subject: [dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-user PMD X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org> |
Series |
Support DMA-accelerated Tx operations for vhost-user PMD
|
|
Message
Hu, Jiayu
March 17, 2020, 9:21 a.m. UTC
In vhost-user PMD's Tx operations, where data movement is heavily involved, performing large memory copies usually takes up a major part of CPU cycles and becomes the hot spot. To offload expensive memory operations from the CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a DMA engine in the Intel's processor, to accelerate large copies for vhost-user. Large copies are offloaded from the CPU to the DMA in an asynchronous manner. The CPU just submits copy jobs to the DMA but without waiting for its copy completion. Thus, there is no CPU intervention during data transfer; we can save precious CPU cycles and improve the overall throughput for vhost-user PMD based applications, like OVS. During packet transmission, it offloads large copies to the DMA and performs small copies by the CPU, due to startup overheads associated with the DMA. vhost-user PMD is able to support various DMA engines, but it just supports I/OAT devices currently. In addition, I/OAT acceleration is only enabled for Tx operations of split rings. Users can explicitly assign a I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT device can only be used by one queue, and a queue can use one I/OAT device at a time. We measure the performance in testpmd. With 1024 bytes packets, compared with the original SW data path, DMA-enabled vhost-user PMD can improve the throughput around 20%~30% in the VM2VM and PVP cases. Furthermore, with larger packets, the throughput improvement will be higher. Jiayu Hu (4): vhost: populate guest memory for DMA-accelerated vhost-user net/vhost: setup vrings for DMA-accelerated datapath net/vhost: leverage DMA engines to accelerate Tx operations doc: add I/OAT acceleration support for vhost-user PMD doc/guides/nics/vhost.rst | 14 + drivers/Makefile | 2 +- drivers/net/vhost/Makefile | 6 +- drivers/net/vhost/internal.h | 160 +++++++ drivers/net/vhost/meson.build | 5 +- drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++--- drivers/net/vhost/virtio_net.c | 861 ++++++++++++++++++++++++++++++++++++++ drivers/net/vhost/virtio_net.h | 288 +++++++++++++ lib/librte_vhost/rte_vhost.h | 1 + lib/librte_vhost/socket.c | 20 + lib/librte_vhost/vhost.h | 2 + lib/librte_vhost/vhost_user.c | 3 +- 12 files changed, 1597 insertions(+), 73 deletions(-) create mode 100644 drivers/net/vhost/internal.h create mode 100644 drivers/net/vhost/virtio_net.c create mode 100644 drivers/net/vhost/virtio_net.h
Comments
Hi Jiayu, On 3/17/20 10:21 AM, Jiayu Hu wrote: > In vhost-user PMD's Tx operations, where data movement is heavily involved, > performing large memory copies usually takes up a major part of CPU cycles > and becomes the hot spot. To offload expensive memory operations from the > CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a DMA > engine in the Intel's processor, to accelerate large copies for vhost-user. > > Large copies are offloaded from the CPU to the DMA in an asynchronous > manner. The CPU just submits copy jobs to the DMA but without waiting > for its copy completion. Thus, there is no CPU intervention during data > transfer; we can save precious CPU cycles and improve the overall > throughput for vhost-user PMD based applications, like OVS. During > packet transmission, it offloads large copies to the DMA and performs > small copies by the CPU, due to startup overheads associated with the DMA. > > vhost-user PMD is able to support various DMA engines, but it just > supports I/OAT devices currently. In addition, I/OAT acceleration is only > enabled for Tx operations of split rings. Users can explicitly assign a > I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT device > can only be used by one queue, and a queue can use one I/OAT device at a > time. > > We measure the performance in testpmd. With 1024 bytes packets, compared > with the original SW data path, DMA-enabled vhost-user PMD can improve > the throughput around 20%~30% in the VM2VM and PVP cases. Furthermore, > with larger packets, the throughput improvement will be higher. I'm not sure it should be done like that for several reasons. First, it seems really complex for the user to get the command line right. There is no mention in the doc patch on how to bind the DMAs to the DPDK application. Are all the DMAs on the system capable of doing it? I think it should be made transparent to the user, who should not have to specify the DMA device address in command line. The user should just pass a devarg specifying he wants to use DMAs, if available. Second, it looks too much vendor-specific. IMHO, we should have a DMA framework, so that the driver can request DMA channels based on capabilities. Also, I don't think implementing ring processing in the Vhost PMD is welcome, Vhost PMD should just be a wrapper for the Vhost library. Doing that in Vhost PMD causes code duplication, and will be a maintenance burden on the long run. As IOAT is a kind of acceleration, why not implement it through the vDPA framework? vDPA framework should be extended to support this kind of acceleration which requires some CPU processing, as opposed to full offload of the ring processing as it only supports today. Kind regards, Maxime > Jiayu Hu (4): > vhost: populate guest memory for DMA-accelerated vhost-user > net/vhost: setup vrings for DMA-accelerated datapath > net/vhost: leverage DMA engines to accelerate Tx operations > doc: add I/OAT acceleration support for vhost-user PMD > > doc/guides/nics/vhost.rst | 14 + > drivers/Makefile | 2 +- > drivers/net/vhost/Makefile | 6 +- > drivers/net/vhost/internal.h | 160 +++++++ > drivers/net/vhost/meson.build | 5 +- > drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++--- > drivers/net/vhost/virtio_net.c | 861 ++++++++++++++++++++++++++++++++++++++ > drivers/net/vhost/virtio_net.h | 288 +++++++++++++ > lib/librte_vhost/rte_vhost.h | 1 + > lib/librte_vhost/socket.c | 20 + > lib/librte_vhost/vhost.h | 2 + > lib/librte_vhost/vhost_user.c | 3 +- > 12 files changed, 1597 insertions(+), 73 deletions(-) > create mode 100644 drivers/net/vhost/internal.h > create mode 100644 drivers/net/vhost/virtio_net.c > create mode 100644 drivers/net/vhost/virtio_net.h >
Hi Maxime, Thanks for your comments. Replies are inline. > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Tuesday, March 17, 2020 5:54 PM > To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org > Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong > <zhihong.wang@intel.com> > Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost- > user PMD > > Hi Jiayu, > > On 3/17/20 10:21 AM, Jiayu Hu wrote: > > In vhost-user PMD's Tx operations, where data movement is heavily > involved, > > performing large memory copies usually takes up a major part of CPU > cycles > > and becomes the hot spot. To offload expensive memory operations from > the > > CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a DMA > > engine in the Intel's processor, to accelerate large copies for vhost-user. > > > > Large copies are offloaded from the CPU to the DMA in an asynchronous > > manner. The CPU just submits copy jobs to the DMA but without waiting > > for its copy completion. Thus, there is no CPU intervention during data > > transfer; we can save precious CPU cycles and improve the overall > > throughput for vhost-user PMD based applications, like OVS. During > > packet transmission, it offloads large copies to the DMA and performs > > small copies by the CPU, due to startup overheads associated with the DMA. > > > > vhost-user PMD is able to support various DMA engines, but it just > > supports I/OAT devices currently. In addition, I/OAT acceleration is only > > enabled for Tx operations of split rings. Users can explicitly assign a > > I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT > device > > can only be used by one queue, and a queue can use one I/OAT device at a > > time. > > > > We measure the performance in testpmd. With 1024 bytes packets, > compared > > with the original SW data path, DMA-enabled vhost-user PMD can improve > > the throughput around 20%~30% in the VM2VM and PVP cases. > Furthermore, > > with larger packets, the throughput improvement will be higher. > > > I'm not sure it should be done like that for several reasons. > > First, it seems really complex for the user to get the command line > right. There is no mention in the doc patch on how to bind the DMAs to > the DPDK application. Are all the DMAs on the system capable of doing > it? DMA engines in Intel CPU are able to move data within system memory. Currently, we have I/OAT and we will have DSA in the future. > I think it should be made transparent to the user, who should not have > to specify the DMA device address in command line. The user should just > pass a devarg specifying he wants to use DMAs, if available. How do you think of replacing DMA address with specific DMA capabilities, like "dmas=[txq0@DMACOPY]". As I/OAT only supports data movement, we can just provide basic DMA copy ability now. But when there are more DMA devices, we can add capabilities in devargs later. > > Second, it looks too much vendor-specific. IMHO, we should have a DMA > framework, so that the driver can request DMA channels based on > capabilities. We only have one DMA engine, I/OAT, in DPDK, and it is implemented as a rawdev. IMO, it will be very hard to provide a generic DMA abstraction currently. In addition, I/OAT specific API is called inside vhost-user PMD, we can replace these function calls when we have a DMA framework in the future. Users are unaware of the changes. Does it make sense to you? > > Also, I don't think implementing ring processing in the Vhost PMD is > welcome, Vhost PMD should just be a wrapper for the Vhost library. Doing > that in Vhost PMD causes code duplication, and will be a maintenance > burden on the long run. > > As IOAT is a kind of acceleration, why not implement it through the vDPA > framework? vDPA framework should be extended to support this kind of > acceleration which requires some CPU processing, as opposed to full > offload of the ring processing as it only supports today. The main reason of implementing data path in vhost PMD is to avoid impacting SW data path in vhost library. Even if we implement it as an instance of vDPA, we also have to implement data path in a new vdev PMD, as DMA just accelerates memory copy and all ring operations have to be done by the CPU. There is still the code duplication issue. Thanks, Jiayu > > Kind regards, > Maxime > > > Jiayu Hu (4): > > vhost: populate guest memory for DMA-accelerated vhost-user > > net/vhost: setup vrings for DMA-accelerated datapath > > net/vhost: leverage DMA engines to accelerate Tx operations > > doc: add I/OAT acceleration support for vhost-user PMD > > > > doc/guides/nics/vhost.rst | 14 + > > drivers/Makefile | 2 +- > > drivers/net/vhost/Makefile | 6 +- > > drivers/net/vhost/internal.h | 160 +++++++ > > drivers/net/vhost/meson.build | 5 +- > > drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++--- > > drivers/net/vhost/virtio_net.c | 861 > ++++++++++++++++++++++++++++++++++++++ > > drivers/net/vhost/virtio_net.h | 288 +++++++++++++ > > lib/librte_vhost/rte_vhost.h | 1 + > > lib/librte_vhost/socket.c | 20 + > > lib/librte_vhost/vhost.h | 2 + > > lib/librte_vhost/vhost_user.c | 3 +- > > 12 files changed, 1597 insertions(+), 73 deletions(-) > > create mode 100644 drivers/net/vhost/internal.h > > create mode 100644 drivers/net/vhost/virtio_net.c > > create mode 100644 drivers/net/vhost/virtio_net.h > >
Hi Jiayu, On 3/19/20 8:33 AM, Hu, Jiayu wrote: > Hi Maxime, > > Thanks for your comments. Replies are inline. > >> -----Original Message----- >> From: Maxime Coquelin <maxime.coquelin@redhat.com> >> Sent: Tuesday, March 17, 2020 5:54 PM >> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org >> Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong >> <zhihong.wang@intel.com> >> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost- >> user PMD >> >> Hi Jiayu, >> >> On 3/17/20 10:21 AM, Jiayu Hu wrote: >>> In vhost-user PMD's Tx operations, where data movement is heavily >> involved, >>> performing large memory copies usually takes up a major part of CPU >> cycles >>> and becomes the hot spot. To offload expensive memory operations from >> the >>> CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a DMA >>> engine in the Intel's processor, to accelerate large copies for vhost-user. >>> >>> Large copies are offloaded from the CPU to the DMA in an asynchronous >>> manner. The CPU just submits copy jobs to the DMA but without waiting >>> for its copy completion. Thus, there is no CPU intervention during data >>> transfer; we can save precious CPU cycles and improve the overall >>> throughput for vhost-user PMD based applications, like OVS. During >>> packet transmission, it offloads large copies to the DMA and performs >>> small copies by the CPU, due to startup overheads associated with the DMA. >>> >>> vhost-user PMD is able to support various DMA engines, but it just >>> supports I/OAT devices currently. In addition, I/OAT acceleration is only >>> enabled for Tx operations of split rings. Users can explicitly assign a >>> I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT >> device >>> can only be used by one queue, and a queue can use one I/OAT device at a >>> time. >>> >>> We measure the performance in testpmd. With 1024 bytes packets, >> compared >>> with the original SW data path, DMA-enabled vhost-user PMD can improve >>> the throughput around 20%~30% in the VM2VM and PVP cases. >> Furthermore, >>> with larger packets, the throughput improvement will be higher. >> >> >> I'm not sure it should be done like that for several reasons. >> >> First, it seems really complex for the user to get the command line >> right. There is no mention in the doc patch on how to bind the DMAs to >> the DPDK application. Are all the DMAs on the system capable of doing >> it? > > DMA engines in Intel CPU are able to move data within system memory. > Currently, we have I/OAT and we will have DSA in the future. OK, can you give me an example of how many I/OAT instances on a given CPU? >> I think it should be made transparent to the user, who should not have >> to specify the DMA device address in command line. The user should just >> pass a devarg specifying he wants to use DMAs, if available. > > How do you think of replacing DMA address with specific DMA capabilities, like > "dmas=[txq0@DMACOPY]". As I/OAT only supports data movement, we can > just provide basic DMA copy ability now. But when there are more DMA devices, > we can add capabilities in devargs later. "dmas=[txq0@DMACOPY]" is still too complex IMHO. We should just have a flag to enable or not DMA (tx_dma=1 / tx_dma=0), and this would be used for all queues as we do for zero-copy. >> >> Second, it looks too much vendor-specific. IMHO, we should have a DMA >> framework, so that the driver can request DMA channels based on >> capabilities. > > We only have one DMA engine, I/OAT, in DPDK, and it is implemented as > a rawdev. IMO, it will be very hard to provide a generic DMA abstraction > currently. In addition, I/OAT specific API is called inside vhost-user PMD, > we can replace these function calls when we have a DMA framework in > the future. Users are unaware of the changes. Does it make sense to you? Having an abstraction might be hard, but it does not seem impossible. Such DMA abstraction has been done in the Kernel for IOAT. For example: https://lore.kernel.org/patchwork/cover/56714/ >> >> Also, I don't think implementing ring processing in the Vhost PMD is >> welcome, Vhost PMD should just be a wrapper for the Vhost library. Doing >> that in Vhost PMD causes code duplication, and will be a maintenance >> burden on the long run. >> >> As IOAT is a kind of acceleration, why not implement it through the vDPA >> framework? vDPA framework should be extended to support this kind of >> acceleration which requires some CPU processing, as opposed to full >> offload of the ring processing as it only supports today. > > The main reason of implementing data path in vhost PMD is to avoid impacting > SW data path in vhost library. Even if we implement it as an instance of vDPA, > we also have to implement data path in a new vdev PMD, as DMA just accelerates > memory copy and all ring operations have to be done by the CPU. There is still the > code duplication issue. Ok, so what about: Introducing a pair of callbacks in struct virtio_net for DMA enqueue and dequeue. lib/librte_vhost/ioat.c which would implement dma_enqueue and dma_dequeue callback for IOAT. As it will live in the vhost lib directory, it will be easy to refactor the code to share as much as possible and so avoid code duplication. In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then call it instead of the SW datapath. It adds a few cycle, but this is much more sane IMHO. What do you think? Thanks, Maxime > Thanks, > Jiayu > >> >> Kind regards, >> Maxime >> >>> Jiayu Hu (4): >>> vhost: populate guest memory for DMA-accelerated vhost-user >>> net/vhost: setup vrings for DMA-accelerated datapath >>> net/vhost: leverage DMA engines to accelerate Tx operations >>> doc: add I/OAT acceleration support for vhost-user PMD >>> >>> doc/guides/nics/vhost.rst | 14 + >>> drivers/Makefile | 2 +- >>> drivers/net/vhost/Makefile | 6 +- >>> drivers/net/vhost/internal.h | 160 +++++++ >>> drivers/net/vhost/meson.build | 5 +- >>> drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++--- >>> drivers/net/vhost/virtio_net.c | 861 >> ++++++++++++++++++++++++++++++++++++++ >>> drivers/net/vhost/virtio_net.h | 288 +++++++++++++ >>> lib/librte_vhost/rte_vhost.h | 1 + >>> lib/librte_vhost/socket.c | 20 + >>> lib/librte_vhost/vhost.h | 2 + >>> lib/librte_vhost/vhost_user.c | 3 +- >>> 12 files changed, 1597 insertions(+), 73 deletions(-) >>> create mode 100644 drivers/net/vhost/internal.h >>> create mode 100644 drivers/net/vhost/virtio_net.c >>> create mode 100644 drivers/net/vhost/virtio_net.h >>> >
Hi Maxime, > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Thursday, March 19, 2020 5:10 PM > To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org > Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong > <zhihong.wang@intel.com> > Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost- > user PMD > > Hi Jiayu, > > On 3/19/20 8:33 AM, Hu, Jiayu wrote: > > Hi Maxime, > > > > Thanks for your comments. Replies are inline. > > > >> -----Original Message----- > >> From: Maxime Coquelin <maxime.coquelin@redhat.com> > >> Sent: Tuesday, March 17, 2020 5:54 PM > >> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org > >> Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong > >> <zhihong.wang@intel.com> > >> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for > vhost- > >> user PMD > >> > >> Hi Jiayu, > >> > >> On 3/17/20 10:21 AM, Jiayu Hu wrote: > >>> In vhost-user PMD's Tx operations, where data movement is heavily > >> involved, > >>> performing large memory copies usually takes up a major part of CPU > >> cycles > >>> and becomes the hot spot. To offload expensive memory operations > from > >> the > >>> CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a > DMA > >>> engine in the Intel's processor, to accelerate large copies for vhost-user. > >>> > >>> Large copies are offloaded from the CPU to the DMA in an asynchronous > >>> manner. The CPU just submits copy jobs to the DMA but without waiting > >>> for its copy completion. Thus, there is no CPU intervention during data > >>> transfer; we can save precious CPU cycles and improve the overall > >>> throughput for vhost-user PMD based applications, like OVS. During > >>> packet transmission, it offloads large copies to the DMA and performs > >>> small copies by the CPU, due to startup overheads associated with the > DMA. > >>> > >>> vhost-user PMD is able to support various DMA engines, but it just > >>> supports I/OAT devices currently. In addition, I/OAT acceleration is only > >>> enabled for Tx operations of split rings. Users can explicitly assign a > >>> I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT > >> device > >>> can only be used by one queue, and a queue can use one I/OAT device at > a > >>> time. > >>> > >>> We measure the performance in testpmd. With 1024 bytes packets, > >> compared > >>> with the original SW data path, DMA-enabled vhost-user PMD can > improve > >>> the throughput around 20%~30% in the VM2VM and PVP cases. > >> Furthermore, > >>> with larger packets, the throughput improvement will be higher. > >> > >> > >> I'm not sure it should be done like that for several reasons. > >> > >> First, it seems really complex for the user to get the command line > >> right. There is no mention in the doc patch on how to bind the DMAs to > >> the DPDK application. Are all the DMAs on the system capable of doing > >> it? > > > > DMA engines in Intel CPU are able to move data within system memory. > > Currently, we have I/OAT and we will have DSA in the future. > > OK, can you give me an example of how many I/OAT instances on a given > CPU? One Xeon Platinum 8180 CPU has 8 I/OAT instances. > > >> I think it should be made transparent to the user, who should not have > >> to specify the DMA device address in command line. The user should just > >> pass a devarg specifying he wants to use DMAs, if available. > > > > How do you think of replacing DMA address with specific DMA capabilities, > like > > "dmas=[txq0@DMACOPY]". As I/OAT only supports data movement, we > can > > just provide basic DMA copy ability now. But when there are more DMA > devices, > > we can add capabilities in devargs later. > "dmas=[txq0@DMACOPY]" is still too complex IMHO. We should just have a > flag to enable or not DMA (tx_dma=1 / tx_dma=0), and this would be used > for all queues as we do for zero-copy. > > >> > >> Second, it looks too much vendor-specific. IMHO, we should have a DMA > >> framework, so that the driver can request DMA channels based on > >> capabilities. > > > > We only have one DMA engine, I/OAT, in DPDK, and it is implemented as > > a rawdev. IMO, it will be very hard to provide a generic DMA abstraction > > currently. In addition, I/OAT specific API is called inside vhost-user PMD, > > we can replace these function calls when we have a DMA framework in > > the future. Users are unaware of the changes. Does it make sense to you? > > Having an abstraction might be hard, but it does not seem impossible. > Such DMA abstraction has been done in the Kernel for IOAT. For example: > https://lore.kernel.org/patchwork/cover/56714/ > > >> > >> Also, I don't think implementing ring processing in the Vhost PMD is > >> welcome, Vhost PMD should just be a wrapper for the Vhost library. > Doing > >> that in Vhost PMD causes code duplication, and will be a maintenance > >> burden on the long run. > >> > >> As IOAT is a kind of acceleration, why not implement it through the vDPA > >> framework? vDPA framework should be extended to support this kind of > >> acceleration which requires some CPU processing, as opposed to full > >> offload of the ring processing as it only supports today. > > > > The main reason of implementing data path in vhost PMD is to avoid > impacting > > SW data path in vhost library. Even if we implement it as an instance of > vDPA, > > we also have to implement data path in a new vdev PMD, as DMA just > accelerates > > memory copy and all ring operations have to be done by the CPU. There is > still the > > code duplication issue. > > Ok, so what about: > > Introducing a pair of callbacks in struct virtio_net for DMA enqueue and > dequeue. > > lib/librte_vhost/ioat.c which would implement dma_enqueue and > dma_dequeue callback for IOAT. As it will live in the vhost lib > directory, it will be easy to refactor the code to share as much as > possible and so avoid code duplication. > > In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then > call it instead of the SW datapath. It adds a few cycle, but this is > much more sane IMHO. The problem is that current semantics of rte_vhost_enqueue/dequeue API are conflict with I/OAT accelerated data path. To improve the performance, the I/OAT works in an asynchronous manner, where the CPU just submits copy jobs to the I/OAT without waiting for its copy completion. For rte_vhost_enqueue_burst, users cannot reuse enqueued pktmbufs when it returns, as the I/OAT may still use them. For rte_vhost_dequeue_burst, users will not get incoming packets as the I/OAT is still performing packet copies. As you can see, when enabling I/OAT acceleration, the semantics of the two API are changed. If we keep the same API name but changing their semantic, this may confuse users, IMHO. Thanks, Jiayu > > What do you think? > > Thanks, > Maxime > > Thanks, > > Jiayu > > > >> > >> Kind regards, > >> Maxime > >> > >>> Jiayu Hu (4): > >>> vhost: populate guest memory for DMA-accelerated vhost-user > >>> net/vhost: setup vrings for DMA-accelerated datapath > >>> net/vhost: leverage DMA engines to accelerate Tx operations > >>> doc: add I/OAT acceleration support for vhost-user PMD > >>> > >>> doc/guides/nics/vhost.rst | 14 + > >>> drivers/Makefile | 2 +- > >>> drivers/net/vhost/Makefile | 6 +- > >>> drivers/net/vhost/internal.h | 160 +++++++ > >>> drivers/net/vhost/meson.build | 5 +- > >>> drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++--- > >>> drivers/net/vhost/virtio_net.c | 861 > >> ++++++++++++++++++++++++++++++++++++++ > >>> drivers/net/vhost/virtio_net.h | 288 +++++++++++++ > >>> lib/librte_vhost/rte_vhost.h | 1 + > >>> lib/librte_vhost/socket.c | 20 + > >>> lib/librte_vhost/vhost.h | 2 + > >>> lib/librte_vhost/vhost_user.c | 3 +- > >>> 12 files changed, 1597 insertions(+), 73 deletions(-) > >>> create mode 100644 drivers/net/vhost/internal.h > >>> create mode 100644 drivers/net/vhost/virtio_net.c > >>> create mode 100644 drivers/net/vhost/virtio_net.h > >>> > >
Hi Jiayu, On 3/19/20 12:47 PM, Hu, Jiayu wrote: >> >> Ok, so what about: >> >> Introducing a pair of callbacks in struct virtio_net for DMA enqueue and >> dequeue. >> >> lib/librte_vhost/ioat.c which would implement dma_enqueue and >> dma_dequeue callback for IOAT. As it will live in the vhost lib >> directory, it will be easy to refactor the code to share as much as >> possible and so avoid code duplication. >> >> In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then >> call it instead of the SW datapath. It adds a few cycle, but this is >> much more sane IMHO. > > The problem is that current semantics of rte_vhost_enqueue/dequeue API > are conflict with I/OAT accelerated data path. To improve the performance, > the I/OAT works in an asynchronous manner, where the CPU just submits > copy jobs to the I/OAT without waiting for its copy completion. For > rte_vhost_enqueue_burst, users cannot reuse enqueued pktmbufs when it > returns, as the I/OAT may still use them. For rte_vhost_dequeue_burst, > users will not get incoming packets as the I/OAT is still performing packet > copies. As you can see, when enabling I/OAT acceleration, the semantics of > the two API are changed. If we keep the same API name but changing their > semantic, this may confuse users, IMHO. Ok, so it is basically the same as zero-copy for dequeue path, right? If a new API is necessary, then it would be better to add it in Vhost library for async enqueue/dequeue. It could be used also for Tx zero-copy, and so the sync version would save some cycles as we could remove the zero-copy support there. What do you think? I really object to implement vring handling into the Vhost PMD, this is the role of the Vhost library. Thanks, Maxime
Hi Maxime, > -----Original Message----- > From: Maxime Coquelin <maxime.coquelin@redhat.com> > Sent: Thursday, March 26, 2020 3:53 PM > To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org > Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong > <zhihong.wang@intel.com> > Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost- > user PMD > > Hi Jiayu, > > On 3/19/20 12:47 PM, Hu, Jiayu wrote: > > >> > >> Ok, so what about: > >> > >> Introducing a pair of callbacks in struct virtio_net for DMA enqueue and > >> dequeue. > >> > >> lib/librte_vhost/ioat.c which would implement dma_enqueue and > >> dma_dequeue callback for IOAT. As it will live in the vhost lib > >> directory, it will be easy to refactor the code to share as much as > >> possible and so avoid code duplication. > >> > >> In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then > >> call it instead of the SW datapath. It adds a few cycle, but this is > >> much more sane IMHO. > > > > The problem is that current semantics of rte_vhost_enqueue/dequeue API > > are conflict with I/OAT accelerated data path. To improve the performance, > > the I/OAT works in an asynchronous manner, where the CPU just submits > > copy jobs to the I/OAT without waiting for its copy completion. For > > rte_vhost_enqueue_burst, users cannot reuse enqueued pktmbufs when > it > > returns, as the I/OAT may still use them. For rte_vhost_dequeue_burst, > > users will not get incoming packets as the I/OAT is still performing packet > > copies. As you can see, when enabling I/OAT acceleration, the semantics of > > the two API are changed. If we keep the same API name but changing their > > semantic, this may confuse users, IMHO. > > Ok, so it is basically the same as zero-copy for dequeue path, right? > If a new API is necessary, then it would be better to add it in Vhost > library for async enqueue/dequeue. > It could be used also for Tx zero-copy, and so the sync version would > save some cycles as we could remove the zero-copy support there. > > What do you think? Yes, you are right. The better way is to provide new API with asynchronous semantics in vhost library. In addition, the vhost library better provides DMA operation callbacks to avoid using vender specific API. The asynchronous API may look like rte_vhost_try_enqueue_burst() and rte_vhost_get_completed_packets(). The first one is to perform enqueue logic, and the second one is to return pktmbufs whose all copies are completed to users. How do you think? Thanks, Jiayu > > I really object to implement vring handling into the Vhost PMD, this is > the role of the Vhost library. > > Thanks, > Maxime
On 3/26/20 9:25 AM, Hu, Jiayu wrote: > Hi Maxime, > >> -----Original Message----- >> From: Maxime Coquelin <maxime.coquelin@redhat.com> >> Sent: Thursday, March 26, 2020 3:53 PM >> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org >> Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong >> <zhihong.wang@intel.com> >> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost- >> user PMD >> >> Hi Jiayu, >> >> On 3/19/20 12:47 PM, Hu, Jiayu wrote: >> >>>> >>>> Ok, so what about: >>>> >>>> Introducing a pair of callbacks in struct virtio_net for DMA enqueue and >>>> dequeue. >>>> >>>> lib/librte_vhost/ioat.c which would implement dma_enqueue and >>>> dma_dequeue callback for IOAT. As it will live in the vhost lib >>>> directory, it will be easy to refactor the code to share as much as >>>> possible and so avoid code duplication. >>>> >>>> In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then >>>> call it instead of the SW datapath. It adds a few cycle, but this is >>>> much more sane IMHO. >>> >>> The problem is that current semantics of rte_vhost_enqueue/dequeue API >>> are conflict with I/OAT accelerated data path. To improve the performance, >>> the I/OAT works in an asynchronous manner, where the CPU just submits >>> copy jobs to the I/OAT without waiting for its copy completion. For >>> rte_vhost_enqueue_burst, users cannot reuse enqueued pktmbufs when >> it >>> returns, as the I/OAT may still use them. For rte_vhost_dequeue_burst, >>> users will not get incoming packets as the I/OAT is still performing packet >>> copies. As you can see, when enabling I/OAT acceleration, the semantics of >>> the two API are changed. If we keep the same API name but changing their >>> semantic, this may confuse users, IMHO. >> >> Ok, so it is basically the same as zero-copy for dequeue path, right? >> If a new API is necessary, then it would be better to add it in Vhost >> library for async enqueue/dequeue. >> It could be used also for Tx zero-copy, and so the sync version would >> save some cycles as we could remove the zero-copy support there. >> >> What do you think? > > Yes, you are right. The better way is to provide new API with asynchronous > semantics in vhost library. In addition, the vhost library better provides DMA > operation callbacks to avoid using vender specific API. The asynchronous API may > look like rte_vhost_try_enqueue_burst() and rte_vhost_get_completed_packets(). > The first one is to perform enqueue logic, and the second one is to return > pktmbufs whose all copies are completed to users. How do you think? That looks good to me, great! The only think is the naming of the API. I need t think more about it, but it does not prevent to start working on the implementation. Regarding the initialization, I was thinking we could introduce new flags to rte_vhost_driver_register: - RTE_VHOST_USER_TX_DMA - RTE_VHOST_USER_RX_DMA Well, only Tx can be implemented for now, but the Rx flag can be reserved. The thing I'm not clear is when no DMA is available, how do we fallback to the sync API. Should the user still call rte_vhost_try_enqueue_burst(), but if no DMA, it will call the rte_vhost_enqueue_burst() directly and then rte_vhost_get_completed_packets() will return all the mbufs? Thanks, Maxime > Thanks, > Jiayu > >> >> I really object to implement vring handling into the Vhost PMD, this is >> the role of the Vhost library. >> >> Thanks, >> Maxime >