diff mbox series

[RFC] dmadev: introduce DMA device library

Message ID 1623763327-30987-1-git-send-email-fengchengwen@huawei.com (mailing list archive)
State Superseded
Delegated to: Thomas Monjalon
Headers show
Series [RFC] dmadev: introduce DMA device library | expand

Checks

Context Check Description
ci/intel-Testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/checkpatch warning coding style issues

Commit Message

fengchengwen June 15, 2021, 1:22 p.m. UTC
This patch introduces 'dmadevice' which is a generic type of DMA
device.

The APIs of dmadev library exposes some generic operations which can
enable configuration and I/O with the DMA devices.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 lib/dmadev/rte_dmadev.h     | 531 ++++++++++++++++++++++++++++++++++++++++++++
 lib/dmadev/rte_dmadev_pmd.h | 384 ++++++++++++++++++++++++++++++++
 2 files changed, 915 insertions(+)
 create mode 100644 lib/dmadev/rte_dmadev.h
 create mode 100644 lib/dmadev/rte_dmadev_pmd.h

Comments

Bruce Richardson June 15, 2021, 4:38 p.m. UTC | #1
On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> This patch introduces 'dmadevice' which is a generic type of DMA
> device.
> 
> The APIs of dmadev library exposes some generic operations which can
> enable configuration and I/O with the DMA devices.
> 
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> ---
Thanks for sending this.

Of most interest to me right now are the key data-plane APIs. While we are
still in the prototyping phase, below is a draft of what we are thinking
for the key enqueue/perform_ops/completed_ops APIs.

Some key differences I note in below vs your original RFC:
* Use of void pointers rather than iova addresses. While using iova's makes
  sense in the general case when using hardware, in that it can work with
  both physical addresses and virtual addresses, if we change the APIs to use
  void pointers instead it will still work for DPDK in VA mode, while at the
  same time allow use of software fallbacks in error cases, and also a stub
  driver than uses memcpy in the background. Finally, using iova's makes the
  APIs a lot more awkward to use with anything but mbufs or similar buffers
  where we already have a pre-computed physical address.
* Use of id values rather than user-provided handles. Allowing the user/app
  to manage the amount of data stored per operation is a better solution, I
  feel than proscribing a certain about of in-driver tracking. Some apps may
  not care about anything other than a job being completed, while other apps
  may have significant metadata to be tracked. Taking the user-context
  handles out of the API also makes the driver code simpler.
* I've kept a single combined API for completions, which differs from the
  separate error handling completion API you propose. I need to give the
  two function approach a bit of thought, but likely both could work. If we
  (likely) never expect failed ops, then the specifics of error handling
  should not matter that much.

For the rest, the control / setup APIs are likely to be rather
uncontroversial, I suspect. However, I think that rather than xstats APIs,
the library should first provide a set of standardized stats like ethdev
does. If driver-specific stats are needed, we can add xstats later to the
API.

Appreciate your further thoughts on this, thanks.

Regards,
/Bruce

/**
 * @warning
 * @b EXPERIMENTAL: this API may change without prior notice.
 *
 * Enqueue a copy operation onto the DMA device
 *
 * This queues up a copy operation to be performed by hardware, but does not
 * trigger hardware to begin that operation.
 *
 * @param dev_id
 *   The dmadev device id of the DMA instance
 * @param src
 *   The source buffer
 * @param dst
 *   The destination buffer
 * @param length
 *   The length of the data to be copied
 * @return
 *   - On success, id (uint16_t) of job enqueued
 *   - On failure, negative error code
 */
static inline int
__rte_experimental
rte_dmadev_enqueue_copy(uint16_t dev_id, void * src, void * dst, unsigned int length);

/**
 * @warning
 * @b EXPERIMENTAL: this API may change without prior notice.
 *
 * Trigger hardware to begin performing enqueued operations
 *
 * This API is used to write the "doorbell" to the hardware to trigger it
 * to begin the operations previously enqueued by e.g. rte_dmadev_enqueue_copy()
 *
 * @param dev_id
 *   The dmadev device id of the DMA instance
 * @return
 *   0 on success, negative errno on error
 */
static inline int
__rte_experimental
rte_dmadev_perform_ops(uint16_t dev_id);

/**
 * @warning
 * @b EXPERIMENTAL: this API may change without prior notice.
 *
 * Returns details of operations that have been completed
 *
 * In the normal case of no failures in hardware performing the requested jobs,
 * the return value is the ID of the last completed operation, and
 * "num_reported_status" is 0.
 *
 * If errors have occured the status of "num_reported_status" (<= "max_status")
 * operations are reported in the "status" array, with the return value being
 * the ID of the last operation reported in that array.
 *
 * @param dev_id
 *   The dmadev device id of the DMA instance
 * @param max_status
 *   The number of entries which can fit in the status arrays, i.e. max number
 *   of completed operations to report.
 * @param[out] status
 *   Array to hold the status of each completed operation.
 *   A value of RTE_DMA_OP_SKIPPED implies an operation was not attempted,
 *   and any other non-zero value indicates operation failure.
 * @param[out] num_reported_status
 *   Returns the number of elements in status. If this value is returned as
 *   zero (the expected case), the status array will not have been modified
 *   by the function and need not be checked by software
 * @return
 *   On success, ID of the last completed/reported operation.
 *   Negative errno on error, with all parameters unmodified.
 */
static inline int
__rte_experimental
rte_dmadev_completed_ops(uint16_t dev_id, uint8_t max_status,
                uint32_t *status, uint8_t *num_reported_status);
Wang, Haiyue June 16, 2021, 2:17 a.m. UTC | #2
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Chengwen Feng
> Sent: Tuesday, June 15, 2021 21:22
> To: thomas@monjalon.net; Yigit, Ferruh <ferruh.yigit@intel.com>
> Cc: dev@dpdk.org; nipun.gupta@nxp.com; hemant.agrawal@nxp.com; maxime.coquelin@redhat.com;
> honnappa.nagarahalli@arm.com; jerinj@marvell.com; david.marchand@redhat.com; Richardson, Bruce
> <bruce.richardson@intel.com>; jerinjacobk@gmail.com
> Subject: [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library
> 
> This patch introduces 'dmadevice' which is a generic type of DMA
> device.
> 
> The APIs of dmadev library exposes some generic operations which can
> enable configuration and I/O with the DMA devices.
> 
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> ---
>  lib/dmadev/rte_dmadev.h     | 531 ++++++++++++++++++++++++++++++++++++++++++++
>  lib/dmadev/rte_dmadev_pmd.h | 384 ++++++++++++++++++++++++++++++++
>  2 files changed, 915 insertions(+)
>  create mode 100644 lib/dmadev/rte_dmadev.h
>  create mode 100644 lib/dmadev/rte_dmadev_pmd.h
> 
> diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
> new file mode 100644
> index 0000000..ca7c8a8
> --- /dev/null
> +++ b/lib/dmadev/rte_dmadev.h
> @@ -0,0 +1,531 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright 2021 HiSilicon Limited.
> + */
> +
> +#ifndef _RTE_DMADEV_H_
> +#define _RTE_DMADEV_H_
> +
> +/**
> + * @file rte_dmadev.h
> + *
> + * DMA (Direct Memory Access) device APIs.
> + *
> + * Defines RTE DMA Device APIs for DMA operations and its provisioning.
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +#include <rte_memory.h>
> +#include <rte_errno.h>
> +#include <rte_compat.h>
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Get the total number of DMA devices that have been successfully
> + * initialised.
> + *
> + * @return
> + *   The total number of usable DMA devices.
> + */
> +__rte_experimental
> +uint16_t
> +rte_dmadev_count(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Get the device identifier for the named DMA device.
> + *
> + * @param name
> + *   DMA device name to select the DMA device identifier.
> + *
> + * @return
> + *   Returns DMA device identifier on success.
> + *   - <0: Failure to find named DMA device.
> + */
> +__rte_experimental
> +int
> +rte_dmadev_get_dev_id(const char *name);
> +

Like 'struct rte_pci_device', 'struct rte_vdev_device', and new introduced
'struct rte_auxiliary_device', have the "rte_xxx_device" name style,
How about 'struct rte_dma_device' name ?

The API can be rte_dma_dev_get_dev_id ...

Just a suggestion.  ;-)


> +rte_dmadev_pmd_release(struct rte_dmadev *dev);


> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_DMADEV_PMD_H_ */
> --
> 2.8.1
Morten Brørup June 16, 2021, 7:09 a.m. UTC | #3
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Tuesday, 15 June 2021 18.39
> 
> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > This patch introduces 'dmadevice' which is a generic type of DMA
> > device.
> >
> > The APIs of dmadev library exposes some generic operations which can
> > enable configuration and I/O with the DMA devices.
> >
> > Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > ---
> Thanks for sending this.
> 
> Of most interest to me right now are the key data-plane APIs. While we
> are
> still in the prototyping phase, below is a draft of what we are
> thinking
> for the key enqueue/perform_ops/completed_ops APIs.
> 
> Some key differences I note in below vs your original RFC:
> * Use of void pointers rather than iova addresses. While using iova's
> makes
>   sense in the general case when using hardware, in that it can work
> with
>   both physical addresses and virtual addresses, if we change the APIs
> to use
>   void pointers instead it will still work for DPDK in VA mode, while
> at the
>   same time allow use of software fallbacks in error cases, and also a
> stub
>   driver than uses memcpy in the background. Finally, using iova's
> makes the
>   APIs a lot more awkward to use with anything but mbufs or similar
> buffers
>   where we already have a pre-computed physical address.
> * Use of id values rather than user-provided handles. Allowing the
> user/app
>   to manage the amount of data stored per operation is a better
> solution, I
>   feel than proscribing a certain about of in-driver tracking. Some
> apps may
>   not care about anything other than a job being completed, while other
> apps
>   may have significant metadata to be tracked. Taking the user-context
>   handles out of the API also makes the driver code simpler.
> * I've kept a single combined API for completions, which differs from
> the
>   separate error handling completion API you propose. I need to give
> the
>   two function approach a bit of thought, but likely both could work.
> If we
>   (likely) never expect failed ops, then the specifics of error
> handling
>   should not matter that much.
> 
> For the rest, the control / setup APIs are likely to be rather
> uncontroversial, I suspect. However, I think that rather than xstats
> APIs,
> the library should first provide a set of standardized stats like
> ethdev
> does. If driver-specific stats are needed, we can add xstats later to
> the
> API.
> 
> Appreciate your further thoughts on this, thanks.
> 
> Regards,
> /Bruce

I generally agree with Bruce's points above.

I would like to share a couple of ideas for further discussion:

1. API for bulk operations.
The ability to prepare a vector of DMA operations, and then post it to the DMA driver.

2. Prepare the API for more complex DMA operations than just copy/fill.
E.g. blitter operations like "copy A bytes from the source starting at address X, to the destination starting at address Y, masked with the bytes starting at address Z, then skip B bytes at the source and C bytes at the destination, rewind the mask to the beginning of Z, and repeat D times". This is just an example.
I'm suggesting to use a "DMA operation" union structure as parameter to the command enqueue function, rather than having individual functions for each possible DMA operation.
I know I'm not the only one old enough on the mailing list to have worked with the Commodore Amiga's blitter. :-)
DPDK has lots of code using CPU vector instructions to shuffle bytes around. I can easily imagine a DMA engine doing similar jobs, possibly implemented in an FPGA or some other coprocessor.

-Morten
Bruce Richardson June 16, 2021, 8:04 a.m. UTC | #4
On Wed, Jun 16, 2021 at 03:17:51AM +0100, Wang, Haiyue wrote:
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Chengwen Feng
> > Sent: Tuesday, June 15, 2021 21:22
> > To: thomas@monjalon.net; Yigit, Ferruh <ferruh.yigit@intel.com>
> > Cc: dev@dpdk.org; nipun.gupta@nxp.com; hemant.agrawal@nxp.com; maxime.coquelin@redhat.com;
> > honnappa.nagarahalli@arm.com; jerinj@marvell.com; david.marchand@redhat.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; jerinjacobk@gmail.com
> > Subject: [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library
> >
> > This patch introduces 'dmadevice' which is a generic type of DMA
> > device.
> >
> > The APIs of dmadev library exposes some generic operations which can
> > enable configuration and I/O with the DMA devices.
> >
> > Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > ---
> >  lib/dmadev/rte_dmadev.h     | 531 ++++++++++++++++++++++++++++++++++++++++++++
> >  lib/dmadev/rte_dmadev_pmd.h | 384 ++++++++++++++++++++++++++++++++
> >  2 files changed, 915 insertions(+)
> >  create mode 100644 lib/dmadev/rte_dmadev.h
> >  create mode 100644 lib/dmadev/rte_dmadev_pmd.h
> >
> > diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
> > new file mode 100644
> > index 0000000..ca7c8a8
> > --- /dev/null
> > +++ b/lib/dmadev/rte_dmadev.h
> > @@ -0,0 +1,531 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright 2021 HiSilicon Limited.
> > + */
> > +
> > +#ifndef _RTE_DMADEV_H_
> > +#define _RTE_DMADEV_H_
> > +
> > +/**
> > + * @file rte_dmadev.h
> > + *
> > + * DMA (Direct Memory Access) device APIs.
> > + *
> > + * Defines RTE DMA Device APIs for DMA operations and its provisioning.
> > + */
> > +
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> > +
> > +#include <rte_common.h>
> > +#include <rte_memory.h>
> > +#include <rte_errno.h>
> > +#include <rte_compat.h>
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice.
> > + *
> > + * Get the total number of DMA devices that have been successfully
> > + * initialised.
> > + *
> > + * @return
> > + *   The total number of usable DMA devices.
> > + */
> > +__rte_experimental
> > +uint16_t
> > +rte_dmadev_count(void);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice.
> > + *
> > + * Get the device identifier for the named DMA device.
> > + *
> > + * @param name
> > + *   DMA device name to select the DMA device identifier.
> > + *
> > + * @return
> > + *   Returns DMA device identifier on success.
> > + *   - <0: Failure to find named DMA device.
> > + */
> > +__rte_experimental
> > +int
> > +rte_dmadev_get_dev_id(const char *name);
> > +
> 
> Like 'struct rte_pci_device', 'struct rte_vdev_device', and new introduced
> 'struct rte_auxiliary_device', have the "rte_xxx_device" name style,
> How about 'struct rte_dma_device' name ?

One difference is that the pci, vdev and auxiliary devices are all devices
types on a bus, rather than devices in a functional class like ethdev,
rawdev, eventdev. I think what is here is fine for now - if you feel
strongly we can revisit later.
Wang, Haiyue June 16, 2021, 8:16 a.m. UTC | #5
> -----Original Message-----
> From: Richardson, Bruce <bruce.richardson@intel.com>
> Sent: Wednesday, June 16, 2021 16:05
> To: Wang, Haiyue <haiyue.wang@intel.com>
> Cc: Chengwen Feng <fengchengwen@huawei.com>; thomas@monjalon.net; Yigit, Ferruh
> <ferruh.yigit@intel.com>; dev@dpdk.org; nipun.gupta@nxp.com; hemant.agrawal@nxp.com;
> maxime.coquelin@redhat.com; honnappa.nagarahalli@arm.com; jerinj@marvell.com;
> david.marchand@redhat.com; jerinjacobk@gmail.com; Xia, Chenbo <chenbo.xia@intel.com>
> Subject: Re: [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library
> 
> On Wed, Jun 16, 2021 at 03:17:51AM +0100, Wang, Haiyue wrote:
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of Chengwen Feng
> > > Sent: Tuesday, June 15, 2021 21:22
> > > To: thomas@monjalon.net; Yigit, Ferruh <ferruh.yigit@intel.com>
> > > Cc: dev@dpdk.org; nipun.gupta@nxp.com; hemant.agrawal@nxp.com; maxime.coquelin@redhat.com;
> > > honnappa.nagarahalli@arm.com; jerinj@marvell.com; david.marchand@redhat.com; Richardson, Bruce
> > > <bruce.richardson@intel.com>; jerinjacobk@gmail.com
> > > Subject: [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library
> > >
> > > This patch introduces 'dmadevice' which is a generic type of DMA
> > > device.
> > >
> > > The APIs of dmadev library exposes some generic operations which can
> > > enable configuration and I/O with the DMA devices.
> > >
> > > Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > > ---
> > >  lib/dmadev/rte_dmadev.h     | 531 ++++++++++++++++++++++++++++++++++++++++++++
> > >  lib/dmadev/rte_dmadev_pmd.h | 384 ++++++++++++++++++++++++++++++++
> > >  2 files changed, 915 insertions(+)
> > >  create mode 100644 lib/dmadev/rte_dmadev.h
> > >  create mode 100644 lib/dmadev/rte_dmadev_pmd.h
> > >
> > > diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
> > > new file mode 100644
> > > index 0000000..ca7c8a8
> > > --- /dev/null
> > > +++ b/lib/dmadev/rte_dmadev.h
> > > @@ -0,0 +1,531 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright 2021 HiSilicon Limited.
> > > + */
> > > +
> > > +#ifndef _RTE_DMADEV_H_
> > > +#define _RTE_DMADEV_H_
> > > +
> > > +/**
> > > + * @file rte_dmadev.h
> > > + *
> > > + * DMA (Direct Memory Access) device APIs.
> > > + *
> > > + * Defines RTE DMA Device APIs for DMA operations and its provisioning.
> > > + */
> > > +
> > > +#ifdef __cplusplus
> > > +extern "C" {
> > > +#endif
> > > +
> > > +#include <rte_common.h>
> > > +#include <rte_memory.h>
> > > +#include <rte_errno.h>
> > > +#include <rte_compat.h>
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice.
> > > + *
> > > + * Get the total number of DMA devices that have been successfully
> > > + * initialised.
> > > + *
> > > + * @return
> > > + *   The total number of usable DMA devices.
> > > + */
> > > +__rte_experimental
> > > +uint16_t
> > > +rte_dmadev_count(void);
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice.
> > > + *
> > > + * Get the device identifier for the named DMA device.
> > > + *
> > > + * @param name
> > > + *   DMA device name to select the DMA device identifier.
> > > + *
> > > + * @return
> > > + *   Returns DMA device identifier on success.
> > > + *   - <0: Failure to find named DMA device.
> > > + */
> > > +__rte_experimental
> > > +int
> > > +rte_dmadev_get_dev_id(const char *name);
> > > +
> >
> > Like 'struct rte_pci_device', 'struct rte_vdev_device', and new introduced
> > 'struct rte_auxiliary_device', have the "rte_xxx_device" name style,
> > How about 'struct rte_dma_device' name ?
> 
> One difference is that the pci, vdev and auxiliary devices are all devices
> types on a bus, rather than devices in a functional class like ethdev,
> rawdev, eventdev. I think what is here is fine for now - if you feel

From this point of view, yes, it's fine. Thanks, Bruce.

> strongly we can revisit later.
fengchengwen June 16, 2021, 9:41 a.m. UTC | #6
On 2021/6/16 0:38, Bruce Richardson wrote:
> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
>> This patch introduces 'dmadevice' which is a generic type of DMA
>> device.
>>
>> The APIs of dmadev library exposes some generic operations which can
>> enable configuration and I/O with the DMA devices.
>>
>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>> ---
> Thanks for sending this.
> 
> Of most interest to me right now are the key data-plane APIs. While we are
> still in the prototyping phase, below is a draft of what we are thinking
> for the key enqueue/perform_ops/completed_ops APIs.
> 
> Some key differences I note in below vs your original RFC:
> * Use of void pointers rather than iova addresses. While using iova's makes
>   sense in the general case when using hardware, in that it can work with
>   both physical addresses and virtual addresses, if we change the APIs to use
>   void pointers instead it will still work for DPDK in VA mode, while at the
>   same time allow use of software fallbacks in error cases, and also a stub
>   driver than uses memcpy in the background. Finally, using iova's makes the
>   APIs a lot more awkward to use with anything but mbufs or similar buffers
>   where we already have a pre-computed physical address.

The iova is an hint to application, and widely used in DPDK.
If switch to void, how to pass the address (iova or just va ?)
this may introduce implementation dependencies here.

Or always pass the va, and the driver performs address translation, and this
translation may cost too much cpu I think.

> * Use of id values rather than user-provided handles. Allowing the user/app
>   to manage the amount of data stored per operation is a better solution, I
>   feel than proscribing a certain about of in-driver tracking. Some apps may
>   not care about anything other than a job being completed, while other apps
>   may have significant metadata to be tracked. Taking the user-context
>   handles out of the API also makes the driver code simpler.

The user-provided handle was mainly used to simply application implementation,
It provides the ability to quickly locate contexts.

The "use of id values" seem like the dma_cookie of Linux DMA engine framework,
user will get a unique dma_cookie after calling dmaengine_submit(), and then
could use it to call dma_async_is_tx_complete() to get completion status.

How about define the copy prototype as following:
  dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx)
while the dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
enqueue successful else fail.
when complete the dmadev will return latest completed dma_cookie, and the
application could use the dma_cookie to quick locate contexts.

> * I've kept a single combined API for completions, which differs from the
>   separate error handling completion API you propose. I need to give the
>   two function approach a bit of thought, but likely both could work. If we
>   (likely) never expect failed ops, then the specifics of error handling
>   should not matter that much.

The rte_ioat_completed_ops API is too complex, and consider some applications
may never copy fail, so split them as two API.
It's indeed not friendly to other scenarios that always require error handling.

I prefer use completed operations number as return value other than the ID so
that application could simple judge whether have new completed operations, and
the new prototype:
 uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint32_t *status, uint16_t max_status, uint16_t *num_fails);

1) for normal case which never expect failed ops:
   just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL);
2) for other case:
   ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status, &fails);
   at this point the fails <= ret <= max_status

> 
> For the rest, the control / setup APIs are likely to be rather
> uncontroversial, I suspect. However, I think that rather than xstats APIs,
> the library should first provide a set of standardized stats like ethdev
> does. If driver-specific stats are needed, we can add xstats later to the
> API.

Agree, will fix in v2

> 
> Appreciate your further thoughts on this, thanks.
> 
> Regards,
> /Bruce
> 
> /**
>  * @warning
>  * @b EXPERIMENTAL: this API may change without prior notice.
>  *
>  * Enqueue a copy operation onto the DMA device
>  *
>  * This queues up a copy operation to be performed by hardware, but does not
>  * trigger hardware to begin that operation.
>  *
>  * @param dev_id
>  *   The dmadev device id of the DMA instance
>  * @param src
>  *   The source buffer
>  * @param dst
>  *   The destination buffer
>  * @param length
>  *   The length of the data to be copied
>  * @return
>  *   - On success, id (uint16_t) of job enqueued
>  *   - On failure, negative error code
>  */
> static inline int
> __rte_experimental
> rte_dmadev_enqueue_copy(uint16_t dev_id, void * src, void * dst, unsigned int length);
> 
> /**
>  * @warning
>  * @b EXPERIMENTAL: this API may change without prior notice.
>  *
>  * Trigger hardware to begin performing enqueued operations
>  *
>  * This API is used to write the "doorbell" to the hardware to trigger it
>  * to begin the operations previously enqueued by e.g. rte_dmadev_enqueue_copy()
>  *
>  * @param dev_id
>  *   The dmadev device id of the DMA instance
>  * @return
>  *   0 on success, negative errno on error
>  */
> static inline int
> __rte_experimental
> rte_dmadev_perform_ops(uint16_t dev_id);
> 
> /**
>  * @warning
>  * @b EXPERIMENTAL: this API may change without prior notice.
>  *
>  * Returns details of operations that have been completed
>  *
>  * In the normal case of no failures in hardware performing the requested jobs,
>  * the return value is the ID of the last completed operation, and
>  * "num_reported_status" is 0.
>  *
>  * If errors have occured the status of "num_reported_status" (<= "max_status")
>  * operations are reported in the "status" array, with the return value being
>  * the ID of the last operation reported in that array.
>  *
>  * @param dev_id
>  *   The dmadev device id of the DMA instance
>  * @param max_status
>  *   The number of entries which can fit in the status arrays, i.e. max number
>  *   of completed operations to report.
>  * @param[out] status
>  *   Array to hold the status of each completed operation.
>  *   A value of RTE_DMA_OP_SKIPPED implies an operation was not attempted,
>  *   and any other non-zero value indicates operation failure.
>  * @param[out] num_reported_status
>  *   Returns the number of elements in status. If this value is returned as
>  *   zero (the expected case), the status array will not have been modified
>  *   by the function and need not be checked by software
>  * @return
>  *   On success, ID of the last completed/reported operation.
>  *   Negative errno on error, with all parameters unmodified.
>  */
> static inline int
> __rte_experimental
> rte_dmadev_completed_ops(uint16_t dev_id, uint8_t max_status,
>                 uint32_t *status, uint8_t *num_reported_status);
> 
> 
> .
>
fengchengwen June 16, 2021, 10:17 a.m. UTC | #7
On 2021/6/16 15:09, Morten Brørup wrote:
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
>> Sent: Tuesday, 15 June 2021 18.39
>>
>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
>>> This patch introduces 'dmadevice' which is a generic type of DMA
>>> device.
>>>
>>> The APIs of dmadev library exposes some generic operations which can
>>> enable configuration and I/O with the DMA devices.
>>>
>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>>> ---
>> Thanks for sending this.
>>
>> Of most interest to me right now are the key data-plane APIs. While we
>> are
>> still in the prototyping phase, below is a draft of what we are
>> thinking
>> for the key enqueue/perform_ops/completed_ops APIs.
>>
>> Some key differences I note in below vs your original RFC:
>> * Use of void pointers rather than iova addresses. While using iova's
>> makes
>>   sense in the general case when using hardware, in that it can work
>> with
>>   both physical addresses and virtual addresses, if we change the APIs
>> to use
>>   void pointers instead it will still work for DPDK in VA mode, while
>> at the
>>   same time allow use of software fallbacks in error cases, and also a
>> stub
>>   driver than uses memcpy in the background. Finally, using iova's
>> makes the
>>   APIs a lot more awkward to use with anything but mbufs or similar
>> buffers
>>   where we already have a pre-computed physical address.
>> * Use of id values rather than user-provided handles. Allowing the
>> user/app
>>   to manage the amount of data stored per operation is a better
>> solution, I
>>   feel than proscribing a certain about of in-driver tracking. Some
>> apps may
>>   not care about anything other than a job being completed, while other
>> apps
>>   may have significant metadata to be tracked. Taking the user-context
>>   handles out of the API also makes the driver code simpler.
>> * I've kept a single combined API for completions, which differs from
>> the
>>   separate error handling completion API you propose. I need to give
>> the
>>   two function approach a bit of thought, but likely both could work.
>> If we
>>   (likely) never expect failed ops, then the specifics of error
>> handling
>>   should not matter that much.
>>
>> For the rest, the control / setup APIs are likely to be rather
>> uncontroversial, I suspect. However, I think that rather than xstats
>> APIs,
>> the library should first provide a set of standardized stats like
>> ethdev
>> does. If driver-specific stats are needed, we can add xstats later to
>> the
>> API.
>>
>> Appreciate your further thoughts on this, thanks.
>>
>> Regards,
>> /Bruce
> 
> I generally agree with Bruce's points above.
> 
> I would like to share a couple of ideas for further discussion:
> 
> 1. API for bulk operations.
> The ability to prepare a vector of DMA operations, and then post it to the DMA driver.

We consider bulk operation and final decide not to support:
1. The DMA engine don't applicable to small-packet scenarios which have high PPS.
   PS: The vector is suitable for high PPS.
2. To support post bulk ops, we need define standard struct like rte_mbuf, and
   application may nned init the struct field and pass them as pointer array,
   this may cost too much CPU.
3. The post request was simple than process completed operations, The CPU write
   performance is also good. ---driver could use vectors to accelerate the process
   of completed operations.

> 
> 2. Prepare the API for more complex DMA operations than just copy/fill.
> E.g. blitter operations like "copy A bytes from the source starting at address X, to the destination starting at address Y, masked with the bytes starting at address Z, then skip B bytes at the source and C bytes at the destination, rewind the mask to the beginning of Z, and repeat D times". This is just an example.
> I'm suggesting to use a "DMA operation" union structure as parameter to the command enqueue function, rather than having individual functions for each possible DMA operation.

There are many sisution which may hard to define such structure, I prefer separates API like copy/fill/...
PS: I saw struct dma_device (Linux dmaengine.h) also support various prep_xxx API.

> I know I'm not the only one old enough on the mailing list to have worked with the Commodore Amiga's blitter. :-)
> DPDK has lots of code using CPU vector instructions to shuffle bytes around. I can easily imagine a DMA engine doing similar jobs, possibly implemented in an FPGA or some other coprocessor.
> 
> -Morten
> 
> 
> .
>
Morten Brørup June 16, 2021, 12:09 p.m. UTC | #8
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of fengchengwen
> Sent: Wednesday, 16 June 2021 12.17
> 
> On 2021/6/16 15:09, Morten Brørup wrote:
> > I would like to share a couple of ideas for further discussion:
> >
> > 1. API for bulk operations.
> > The ability to prepare a vector of DMA operations, and then post it
> to the DMA driver.
> 
> We consider bulk operation and final decide not to support:
> 1. The DMA engine don't applicable to small-packet scenarios which have
> high PPS.
>    PS: The vector is suitable for high PPS.
> 2. To support post bulk ops, we need define standard struct like
> rte_mbuf, and
>    application may nned init the struct field and pass them as pointer
> array,
>    this may cost too much CPU.
> 3. The post request was simple than process completed operations, The
> CPU write
>    performance is also good. ---driver could use vectors to accelerate
> the process
>    of completed operations.

OK. Thank you for elaborating.

> >
> > 2. Prepare the API for more complex DMA operations than just
> copy/fill.
> > E.g. blitter operations like "copy A bytes from the source starting
> at address X, to the destination starting at address Y, masked with the
> bytes starting at address Z, then skip B bytes at the source and C
> bytes at the destination, rewind the mask to the beginning of Z, and
> repeat D times". This is just an example.
> > I'm suggesting to use a "DMA operation" union structure as parameter
> to the command enqueue function, rather than having individual
> functions for each possible DMA operation.
> 
> There are many sisution which may hard to define such structure, I
> prefer separates API like copy/fill/...
> PS: I saw struct dma_device (Linux dmaengine.h) also support various
> prep_xxx API.

OK. Separate functions make sense if the DMA driver does not support a large variety of operations, but only copy and fill.
David Marchand June 16, 2021, 12:14 p.m. UTC | #9
On Tue, Jun 15, 2021 at 3:25 PM Chengwen Feng <fengchengwen@huawei.com> wrote:
> +
> +#define RTE_DMADEV_NAME_MAX_LEN        (64)
> +/**< @internal Max length of name of DMA PMD */
> +
> +/** @internal
> + * The data structure associated with each DMA device.
> + */
> +struct rte_dmadev {
> +       /**< Device ID for this instance */
> +       uint16_t dev_id;
> +       /**< Functions exported by PMD */
> +       const struct rte_dmadev_ops *dev_ops;
> +       /**< Device info. supplied during device initialization */
> +       struct rte_device *device;
> +       /**< Driver info. supplied by probing */
> +       const char *driver_name;
> +
> +       /**< Device name */
> +       char name[RTE_DMADEV_NAME_MAX_LEN];
> +} __rte_cache_aligned;
> +

I see no queue/channel notion.
How does a rte_dmadev object relate to a physical hw engine?
Bruce Richardson June 16, 2021, 1:06 p.m. UTC | #10
On Wed, Jun 16, 2021 at 06:17:07PM +0800, fengchengwen wrote:
> On 2021/6/16 15:09, Morten Brørup wrote:
> >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> >> Sent: Tuesday, 15 June 2021 18.39
> >>
> >> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> >>> This patch introduces 'dmadevice' which is a generic type of DMA
> >>> device.
> >>>
> >>> The APIs of dmadev library exposes some generic operations which can
> >>> enable configuration and I/O with the DMA devices.
> >>>
> >>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> >>> ---
> >> Thanks for sending this.
> >>
> >> Of most interest to me right now are the key data-plane APIs. While we
> >> are
> >> still in the prototyping phase, below is a draft of what we are
> >> thinking
> >> for the key enqueue/perform_ops/completed_ops APIs.
> >>
> >> Some key differences I note in below vs your original RFC:
> >> * Use of void pointers rather than iova addresses. While using iova's
> >> makes
> >>   sense in the general case when using hardware, in that it can work
> >> with
> >>   both physical addresses and virtual addresses, if we change the APIs
> >> to use
> >>   void pointers instead it will still work for DPDK in VA mode, while
> >> at the
> >>   same time allow use of software fallbacks in error cases, and also a
> >> stub
> >>   driver than uses memcpy in the background. Finally, using iova's
> >> makes the
> >>   APIs a lot more awkward to use with anything but mbufs or similar
> >> buffers
> >>   where we already have a pre-computed physical address.
> >> * Use of id values rather than user-provided handles. Allowing the
> >> user/app
> >>   to manage the amount of data stored per operation is a better
> >> solution, I
> >>   feel than proscribing a certain about of in-driver tracking. Some
> >> apps may
> >>   not care about anything other than a job being completed, while other
> >> apps
> >>   may have significant metadata to be tracked. Taking the user-context
> >>   handles out of the API also makes the driver code simpler.
> >> * I've kept a single combined API for completions, which differs from
> >> the
> >>   separate error handling completion API you propose. I need to give
> >> the
> >>   two function approach a bit of thought, but likely both could work.
> >> If we
> >>   (likely) never expect failed ops, then the specifics of error
> >> handling
> >>   should not matter that much.
> >>
> >> For the rest, the control / setup APIs are likely to be rather
> >> uncontroversial, I suspect. However, I think that rather than xstats
> >> APIs,
> >> the library should first provide a set of standardized stats like
> >> ethdev
> >> does. If driver-specific stats are needed, we can add xstats later to
> >> the
> >> API.
> >>
> >> Appreciate your further thoughts on this, thanks.
> >>
> >> Regards,
> >> /Bruce
> > 
> > I generally agree with Bruce's points above.
> > 
> > I would like to share a couple of ideas for further discussion:
> > 
> > 1. API for bulk operations.
> > The ability to prepare a vector of DMA operations, and then post it to the DMA driver.
> 
> We consider bulk operation and final decide not to support:
> 1. The DMA engine don't applicable to small-packet scenarios which have high PPS.
>    PS: The vector is suitable for high PPS.
> 2. To support post bulk ops, we need define standard struct like rte_mbuf, and
>    application may nned init the struct field and pass them as pointer array,
>    this may cost too much CPU.
> 3. The post request was simple than process completed operations, The CPU write
>    performance is also good. ---driver could use vectors to accelerate the process
>    of completed operations.
> 

+1 to this. We also looked previously at using bulk APIs for dma offload,
but the cost of building up the structs to pass in, only to have those
structs decomposed again inside the function was adding a lot of
unnecessary overhead. By using individual functions per op, all parameters
are passed via registers, and we can write descriptors faster from those
registers than having to do cache reads.

> > 
> > 2. Prepare the API for more complex DMA operations than just copy/fill.
> > E.g. blitter operations like "copy A bytes from the source starting at address X, to the destination starting at address Y, masked with the bytes starting at address Z, then skip B bytes at the source and C bytes at the destination, rewind the mask to the beginning of Z, and repeat D times". This is just an example.
> > I'm suggesting to use a "DMA operation" union structure as parameter to the command enqueue function, rather than having individual functions for each possible DMA operation.
> 
> There are many sisution which may hard to define such structure, I prefer separates API like copy/fill/...
> PS: I saw struct dma_device (Linux dmaengine.h) also support various prep_xxx API.
> 

I think the API set will be defined by what the various hardware drivers
need to support. Therefore, I think starting with a minimal set of
copy/fill is best and we can iterate from there.

> > I know I'm not the only one old enough on the mailing list to have worked with the Commodore Amiga's blitter. :-)
> > DPDK has lots of code using CPU vector instructions to shuffle bytes around. I can easily imagine a DMA engine doing similar jobs, possibly implemented in an FPGA or some other coprocessor.
> > 
> > -Morten
> > 
> > 
> > .
> > 
>
Bruce Richardson June 16, 2021, 1:11 p.m. UTC | #11
On Wed, Jun 16, 2021 at 02:14:54PM +0200, David Marchand wrote:
> On Tue, Jun 15, 2021 at 3:25 PM Chengwen Feng <fengchengwen@huawei.com> wrote:
> > +
> > +#define RTE_DMADEV_NAME_MAX_LEN        (64)
> > +/**< @internal Max length of name of DMA PMD */
> > +
> > +/** @internal
> > + * The data structure associated with each DMA device.
> > + */
> > +struct rte_dmadev {
> > +       /**< Device ID for this instance */
> > +       uint16_t dev_id;
> > +       /**< Functions exported by PMD */
> > +       const struct rte_dmadev_ops *dev_ops;
> > +       /**< Device info. supplied during device initialization */
> > +       struct rte_device *device;
> > +       /**< Driver info. supplied by probing */
> > +       const char *driver_name;
> > +
> > +       /**< Device name */
> > +       char name[RTE_DMADEV_NAME_MAX_LEN];
> > +} __rte_cache_aligned;
> > +
> 
> I see no queue/channel notion.
> How does a rte_dmadev object relate to a physical hw engine?
> 
One queue, one device.
When looking to update the ioat driver for 20.11 release when I added the
idxd part, I considered adding a queue parameter to the API to look like
one device with multiple queues. However, since each queue acts completely
independently of each other, there was no benefit to doing so. It's just
easier to have a single id to identify a device queue.
Jerin Jacob June 16, 2021, 2:37 p.m. UTC | #12
On Wed, Jun 16, 2021 at 3:47 PM fengchengwen <fengchengwen@huawei.com> wrote:
>
> On 2021/6/16 15:09, Morten Brørup wrote:
> >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> >> Sent: Tuesday, 15 June 2021 18.39
> >>
> >> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> >>> This patch introduces 'dmadevice' which is a generic type of DMA
> >>> device.
> >>>
> >>> The APIs of dmadev library exposes some generic operations which can
> >>> enable configuration and I/O with the DMA devices.
> >>>
> >>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> >>> ---
> >> Thanks for sending this.
> >>
> >> Of most interest to me right now are the key data-plane APIs. While we
> >> are
> >> still in the prototyping phase, below is a draft of what we are
> >> thinking
> >> for the key enqueue/perform_ops/completed_ops APIs.
> >>
> >> Some key differences I note in below vs your original RFC:
> >> * Use of void pointers rather than iova addresses. While using iova's
> >> makes
> >>   sense in the general case when using hardware, in that it can work
> >> with
> >>   both physical addresses and virtual addresses, if we change the APIs
> >> to use
> >>   void pointers instead it will still work for DPDK in VA mode, while
> >> at the
> >>   same time allow use of software fallbacks in error cases, and also a
> >> stub
> >>   driver than uses memcpy in the background. Finally, using iova's
> >> makes the
> >>   APIs a lot more awkward to use with anything but mbufs or similar
> >> buffers
> >>   where we already have a pre-computed physical address.
> >> * Use of id values rather than user-provided handles. Allowing the
> >> user/app
> >>   to manage the amount of data stored per operation is a better
> >> solution, I
> >>   feel than proscribing a certain about of in-driver tracking. Some
> >> apps may
> >>   not care about anything other than a job being completed, while other
> >> apps
> >>   may have significant metadata to be tracked. Taking the user-context
> >>   handles out of the API also makes the driver code simpler.
> >> * I've kept a single combined API for completions, which differs from
> >> the
> >>   separate error handling completion API you propose. I need to give
> >> the
> >>   two function approach a bit of thought, but likely both could work.
> >> If we
> >>   (likely) never expect failed ops, then the specifics of error
> >> handling
> >>   should not matter that much.
> >>
> >> For the rest, the control / setup APIs are likely to be rather
> >> uncontroversial, I suspect. However, I think that rather than xstats
> >> APIs,
> >> the library should first provide a set of standardized stats like
> >> ethdev
> >> does. If driver-specific stats are needed, we can add xstats later to
> >> the
> >> API.
> >>
> >> Appreciate your further thoughts on this, thanks.
> >>
> >> Regards,
> >> /Bruce
> >
> > I generally agree with Bruce's points above.
> >
> > I would like to share a couple of ideas for further discussion:


I believe some of the other requirements and comments for generic DMA will be

1) Support for the _channel_, Each channel may have different
capabilities and functionalities.
Typical cases are, each channel have separate source and destination
devices like
DMA between PCIe EP to Host memory, Host memory to Host memory, PCIe
EP to PCIe EP.
So we need some notion of the channel in the specification.

2) I assume current data plane APIs are not thread-safe. Is it right?


3) Cookie scheme outlined earlier looks good to me. Instead of having
generic dequeue() API

4) Can split the rte_dmadev_enqueue_copy(uint16_t dev_id, void * src,
void * dst, unsigned int length);
to two stage API like, Where one will be used in fastpath and other
one will use used in slowpath.

- slowpath API will for take channel and take other attributes for transfer

Example syantx will be:

struct rte_dmadev_desc {
           channel id;
           ops ; // copy, xor, fill etc
          other arguments specific to dma transfer // it can be set
based on capability.

};

rte_dmadev_desc_t rte_dmadev_preprare(uint16_t dev_id,  struct
rte_dmadev_desc *dec);

- Fastpath takes arguments that need to change per transfer along with
slow-path handle.

rte_dmadev_enqueue(uint16_t dev_id, void * src, void * dst, unsigned
int length,  rte_dmadev_desc_t desc)

This will help to driver to
-Former API form the device-specific descriptors in slow path  for a
given channel and fixed attributes per transfer
-Later API blend "variable" arguments such as src, dest address with
slow-path created descriptors

The above will give better performance and is the best trade-off
between performance and per transfer variables.
Honnappa Nagarahalli June 16, 2021, 4:48 p.m. UTC | #13
<snip>

> 
> On Wed, Jun 16, 2021 at 02:14:54PM +0200, David Marchand wrote:
> > On Tue, Jun 15, 2021 at 3:25 PM Chengwen Feng
> <fengchengwen@huawei.com> wrote:
> > > +
> > > +#define RTE_DMADEV_NAME_MAX_LEN        (64)
> > > +/**< @internal Max length of name of DMA PMD */
> > > +
> > > +/** @internal
> > > + * The data structure associated with each DMA device.
> > > + */
> > > +struct rte_dmadev {
> > > +       /**< Device ID for this instance */
> > > +       uint16_t dev_id;
> > > +       /**< Functions exported by PMD */
> > > +       const struct rte_dmadev_ops *dev_ops;
> > > +       /**< Device info. supplied during device initialization */
> > > +       struct rte_device *device;
> > > +       /**< Driver info. supplied by probing */
> > > +       const char *driver_name;
> > > +
> > > +       /**< Device name */
> > > +       char name[RTE_DMADEV_NAME_MAX_LEN]; } __rte_cache_aligned;
> > > +
> >
> > I see no queue/channel notion.
> > How does a rte_dmadev object relate to a physical hw engine?
> >
> One queue, one device.
> When looking to update the ioat driver for 20.11 release when I added the
> idxd part, I considered adding a queue parameter to the API to look like one
> device with multiple queues. However, since each queue acts completely
> independently of each other, there was no benefit to doing so. It's just easier
> to have a single id to identify a device queue.
Does it mean, the queue is multi thread safe? Do we need queues per core to avoid locking?
Bruce Richardson June 16, 2021, 5:31 p.m. UTC | #14
On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> On 2021/6/16 0:38, Bruce Richardson wrote:
> > On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> >> This patch introduces 'dmadevice' which is a generic type of DMA
> >> device.
> >>
> >> The APIs of dmadev library exposes some generic operations which can
> >> enable configuration and I/O with the DMA devices.
> >>
> >> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> >> ---
> > Thanks for sending this.
> > 
> > Of most interest to me right now are the key data-plane APIs. While we are
> > still in the prototyping phase, below is a draft of what we are thinking
> > for the key enqueue/perform_ops/completed_ops APIs.
> > 
> > Some key differences I note in below vs your original RFC:
> > * Use of void pointers rather than iova addresses. While using iova's makes
> >   sense in the general case when using hardware, in that it can work with
> >   both physical addresses and virtual addresses, if we change the APIs to use
> >   void pointers instead it will still work for DPDK in VA mode, while at the
> >   same time allow use of software fallbacks in error cases, and also a stub
> >   driver than uses memcpy in the background. Finally, using iova's makes the
> >   APIs a lot more awkward to use with anything but mbufs or similar buffers
> >   where we already have a pre-computed physical address.
> 
> The iova is an hint to application, and widely used in DPDK.
> If switch to void, how to pass the address (iova or just va ?)
> this may introduce implementation dependencies here.
> 
> Or always pass the va, and the driver performs address translation, and this
> translation may cost too much cpu I think.
> 

On the latter point, about driver doing address translation I would agree.
However, we probably need more discussion about the use of iova vs just
virtual addresses. My thinking on this is that if we specify the API using
iovas it will severely hurt usability of the API, since it forces the user
to take more inefficient codepaths in a large number of cases. Given a
pointer to the middle of an mbuf, one cannot just pass that straight as an
iova but must instead do a translation into offset from mbuf pointer and
then readd the offset to the mbuf base address.

My preference therefore is to require the use of an IOMMU when using a
dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
present, DPDK will run in VA mode, allowing virtual addresses to our
hugepage memory to be sent directly to hardware. Also, when using
dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
management for the app, removing further the restrictions on what memory
can be addressed by hardware.

> > * Use of id values rather than user-provided handles. Allowing the user/app
> >   to manage the amount of data stored per operation is a better solution, I
> >   feel than proscribing a certain about of in-driver tracking. Some apps may
> >   not care about anything other than a job being completed, while other apps
> >   may have significant metadata to be tracked. Taking the user-context
> >   handles out of the API also makes the driver code simpler.
> 
> The user-provided handle was mainly used to simply application implementation,
> It provides the ability to quickly locate contexts.
> 
> The "use of id values" seem like the dma_cookie of Linux DMA engine framework,
> user will get a unique dma_cookie after calling dmaengine_submit(), and then
> could use it to call dma_async_is_tx_complete() to get completion status.
> 

Yes, the idea of the id is the same - to locate contexts. The main
difference is that if we have the driver manage contexts or pointer to
contexts, as well as giving more work to the driver, it complicates the APIs
for measuring completions. If we use an ID-based approach, where the app
maintains its own ring of contexts (if any), it avoids the need to have an
"out" parameter array for returning those contexts, which needs to be
appropriately sized. Instead we can just report that all ids up to N are
completed. [This would be similar to your suggestion that N jobs be
reported as done, in that no contexts are provided, it's just that knowing
the ID of what is completed is generally more useful than the number (which
can be obviously got by subtracting the old value)]

We are still working on prototyping all this, but would hope to have a
functional example of all this soon.

> How about define the copy prototype as following:
>   dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx)
> while the dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
> enqueue successful else fail.
> when complete the dmadev will return latest completed dma_cookie, and the
> application could use the dma_cookie to quick locate contexts.
> 

If I understand this correctly, I believe this is largely what I was
suggesting - just with the typedef for the type? In which case it obviously
looks good to me.

> > * I've kept a single combined API for completions, which differs from the
> >   separate error handling completion API you propose. I need to give the
> >   two function approach a bit of thought, but likely both could work. If we
> >   (likely) never expect failed ops, then the specifics of error handling
> >   should not matter that much.
> 
> The rte_ioat_completed_ops API is too complex, and consider some applications
> may never copy fail, so split them as two API.
> It's indeed not friendly to other scenarios that always require error handling.
> 
> I prefer use completed operations number as return value other than the ID so
> that application could simple judge whether have new completed operations, and
> the new prototype:
>  uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint32_t *status, uint16_t max_status, uint16_t *num_fails);
> 
> 1) for normal case which never expect failed ops:
>    just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL);
> 2) for other case:
>    ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status, &fails);
>    at this point the fails <= ret <= max_status
> 
Completely agree that we need to plan for the happy-day case where all is
passing. Looking at the prototypes you have above, I am ok with returning
number of completed ops as the return value with the final completed cookie
as an "out" parameter.
For handling errors, I'm ok with what you propose above, just with one
small adjustment - I would remove the restriction that ret <= max_status.

In case of zero-failures, we can report as many ops succeeding as we like,
and even in case of failure, we can still report as many successful ops as
we like before we start filling in the status field. For example, if 32 ops
are completed, and the last one fails, we can just fill in one entry into
status, and return 32. Alternatively if the 4th last one fails we fill in 4
entries and return 32. The only requirements would be:
* fails <= max_status
* fails <= ret
* cookie holds the id of the last entry in status.

A further possible variation is to have separate "completed" and
"completed_status" APIs, where "completed_status" is as above, but
"completed" skips the final 3 parameters and returns -1 on error. In that
case the user can fall back to the completed_status call.

> > 
> > For the rest, the control / setup APIs are likely to be rather
> > uncontroversial, I suspect. However, I think that rather than xstats APIs,
> > the library should first provide a set of standardized stats like ethdev
> > does. If driver-specific stats are needed, we can add xstats later to the
> > API.
> 
> Agree, will fix in v2
> 
Thanks. In parallel, we will be working on our prototype implementation
too, taking in the feedback here, and hopefully send it as an RFC soon.
Then we can look to compare and contrast and arrive at an agreed API. It
might also be worthwhile to set up a community call for all interested
parties in this API to discuss things with a more rapid turnaround. That
was done in the past for other new device class APIs that were developed,
e.g. eventdev.

Regards,
/Bruce
Jerin Jacob June 16, 2021, 6:08 p.m. UTC | #15
On Wed, Jun 16, 2021 at 11:01 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> > On 2021/6/16 0:38, Bruce Richardson wrote:
> > > On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > >> This patch introduces 'dmadevice' which is a generic type of DMA
> > >> device.
> > >>
> > >> The APIs of dmadev library exposes some generic operations which can
> > >> enable configuration and I/O with the DMA devices.
> > >>
> > >> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > >> ---
> > > Thanks for sending this.
> > >
> > > Of most interest to me right now are the key data-plane APIs. While we are
> > > still in the prototyping phase, below is a draft of what we are thinking
> > > for the key enqueue/perform_ops/completed_ops APIs.
> > >
> > > Some key differences I note in below vs your original RFC:
> > > * Use of void pointers rather than iova addresses. While using iova's makes
> > >   sense in the general case when using hardware, in that it can work with
> > >   both physical addresses and virtual addresses, if we change the APIs to use
> > >   void pointers instead it will still work for DPDK in VA mode, while at the
> > >   same time allow use of software fallbacks in error cases, and also a stub
> > >   driver than uses memcpy in the background. Finally, using iova's makes the
> > >   APIs a lot more awkward to use with anything but mbufs or similar buffers
> > >   where we already have a pre-computed physical address.
> >
> > The iova is an hint to application, and widely used in DPDK.
> > If switch to void, how to pass the address (iova or just va ?)
> > this may introduce implementation dependencies here.
> >
> > Or always pass the va, and the driver performs address translation, and this
> > translation may cost too much cpu I think.
> >
>
> On the latter point, about driver doing address translation I would agree.
> However, we probably need more discussion about the use of iova vs just
> virtual addresses. My thinking on this is that if we specify the API using
> iovas it will severely hurt usability of the API, since it forces the user
> to take more inefficient codepaths in a large number of cases. Given a
> pointer to the middle of an mbuf, one cannot just pass that straight as an
> iova but must instead do a translation into offset from mbuf pointer and
> then readd the offset to the mbuf base address.
>
> My preference therefore is to require the use of an IOMMU when using a
> dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
> present, DPDK will run in VA mode, allowing virtual addresses to our
> hugepage memory to be sent directly to hardware. Also, when using
> dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> management for the app, removing further the restrictions on what memory
> can be addressed by hardware.


One issue of keeping void * is that memory can come from stack or heap .
which HW can not really operate it on.  Considering difficulty to
expressing above constraints,
IMO, iova is good. (So that contract is clear between driver and
application) or have some other
means to express that constrain.
Bruce Richardson June 16, 2021, 7:10 p.m. UTC | #16
On Wed, Jun 16, 2021 at 04:48:59PM +0000, Honnappa Nagarahalli wrote:
> <snip>
> 
> > 
> > On Wed, Jun 16, 2021 at 02:14:54PM +0200, David Marchand wrote:
> > > On Tue, Jun 15, 2021 at 3:25 PM Chengwen Feng
> > <fengchengwen@huawei.com> wrote:
> > > > +
> > > > +#define RTE_DMADEV_NAME_MAX_LEN        (64)
> > > > +/**< @internal Max length of name of DMA PMD */
> > > > +
> > > > +/** @internal
> > > > + * The data structure associated with each DMA device.
> > > > + */
> > > > +struct rte_dmadev {
> > > > +       /**< Device ID for this instance */
> > > > +       uint16_t dev_id;
> > > > +       /**< Functions exported by PMD */
> > > > +       const struct rte_dmadev_ops *dev_ops;
> > > > +       /**< Device info. supplied during device initialization */
> > > > +       struct rte_device *device;
> > > > +       /**< Driver info. supplied by probing */
> > > > +       const char *driver_name;
> > > > +
> > > > +       /**< Device name */
> > > > +       char name[RTE_DMADEV_NAME_MAX_LEN]; } __rte_cache_aligned;
> > > > +
> > >
> > > I see no queue/channel notion.
> > > How does a rte_dmadev object relate to a physical hw engine?
> > >
> > One queue, one device.
> > When looking to update the ioat driver for 20.11 release when I added the
> > idxd part, I considered adding a queue parameter to the API to look like one
> > device with multiple queues. However, since each queue acts completely
> > independently of each other, there was no benefit to doing so. It's just easier
> > to have a single id to identify a device queue.
> Does it mean, the queue is multi thread safe? Do we need queues per core to avoid locking?

The design is for each queue to be like the queue on a NIC, not
thread-safe. However, if the hardware supports thread-safe queues too, that
can be supported. But the API should be like other data-plane ones and be
lock free.

For the DMA devices that I am working on, the number of queues
is not very large, and in most cases each queue appears as a separate
entity, e.g. for ioat each queue/channel appears as a separate PCI ID, and
when using idxd kernel driver each queue is a separate dev node to mmap.
For other cases right now we just create one rawdev instance per queue in
software.

/Bruce
Bruce Richardson June 16, 2021, 7:13 p.m. UTC | #17
On Wed, Jun 16, 2021 at 11:38:08PM +0530, Jerin Jacob wrote:
> On Wed, Jun 16, 2021 at 11:01 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> > > On 2021/6/16 0:38, Bruce Richardson wrote:
> > > > On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > > >> This patch introduces 'dmadevice' which is a generic type of DMA
> > > >> device.
> > > >>
> > > >> The APIs of dmadev library exposes some generic operations which can
> > > >> enable configuration and I/O with the DMA devices.
> > > >>
> > > >> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > > >> ---
> > > > Thanks for sending this.
> > > >
> > > > Of most interest to me right now are the key data-plane APIs. While we are
> > > > still in the prototyping phase, below is a draft of what we are thinking
> > > > for the key enqueue/perform_ops/completed_ops APIs.
> > > >
> > > > Some key differences I note in below vs your original RFC:
> > > > * Use of void pointers rather than iova addresses. While using iova's makes
> > > >   sense in the general case when using hardware, in that it can work with
> > > >   both physical addresses and virtual addresses, if we change the APIs to use
> > > >   void pointers instead it will still work for DPDK in VA mode, while at the
> > > >   same time allow use of software fallbacks in error cases, and also a stub
> > > >   driver than uses memcpy in the background. Finally, using iova's makes the
> > > >   APIs a lot more awkward to use with anything but mbufs or similar buffers
> > > >   where we already have a pre-computed physical address.
> > >
> > > The iova is an hint to application, and widely used in DPDK.
> > > If switch to void, how to pass the address (iova or just va ?)
> > > this may introduce implementation dependencies here.
> > >
> > > Or always pass the va, and the driver performs address translation, and this
> > > translation may cost too much cpu I think.
> > >
> >
> > On the latter point, about driver doing address translation I would agree.
> > However, we probably need more discussion about the use of iova vs just
> > virtual addresses. My thinking on this is that if we specify the API using
> > iovas it will severely hurt usability of the API, since it forces the user
> > to take more inefficient codepaths in a large number of cases. Given a
> > pointer to the middle of an mbuf, one cannot just pass that straight as an
> > iova but must instead do a translation into offset from mbuf pointer and
> > then readd the offset to the mbuf base address.
> >
> > My preference therefore is to require the use of an IOMMU when using a
> > dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
> > present, DPDK will run in VA mode, allowing virtual addresses to our
> > hugepage memory to be sent directly to hardware. Also, when using
> > dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> > management for the app, removing further the restrictions on what memory
> > can be addressed by hardware.
> 
> 
> One issue of keeping void * is that memory can come from stack or heap .
> which HW can not really operate it on.

when kernel driver is managing the IOMMU all process memory can be worked
on, not just hugepage memory, so using iova is wrong in these cases.

As I previously said, using iova prevents the creation of a pure software
dummy driver too using memcpy in the background.

/Bruce
Jerin Jacob June 17, 2021, 7:42 a.m. UTC | #18
On Thu, Jun 17, 2021 at 12:43 AM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Wed, Jun 16, 2021 at 11:38:08PM +0530, Jerin Jacob wrote:
> > On Wed, Jun 16, 2021 at 11:01 PM Bruce Richardson
> > <bruce.richardson@intel.com> wrote:
> > >
> > > On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> > > > On 2021/6/16 0:38, Bruce Richardson wrote:
> > > > > On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > > > >> This patch introduces 'dmadevice' which is a generic type of DMA
> > > > >> device.
> > > > >>
> > > > >> The APIs of dmadev library exposes some generic operations which can
> > > > >> enable configuration and I/O with the DMA devices.
> > > > >>
> > > > >> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > > > >> ---
> > > > > Thanks for sending this.
> > > > >
> > > > > Of most interest to me right now are the key data-plane APIs. While we are
> > > > > still in the prototyping phase, below is a draft of what we are thinking
> > > > > for the key enqueue/perform_ops/completed_ops APIs.
> > > > >
> > > > > Some key differences I note in below vs your original RFC:
> > > > > * Use of void pointers rather than iova addresses. While using iova's makes
> > > > >   sense in the general case when using hardware, in that it can work with
> > > > >   both physical addresses and virtual addresses, if we change the APIs to use
> > > > >   void pointers instead it will still work for DPDK in VA mode, while at the
> > > > >   same time allow use of software fallbacks in error cases, and also a stub
> > > > >   driver than uses memcpy in the background. Finally, using iova's makes the
> > > > >   APIs a lot more awkward to use with anything but mbufs or similar buffers
> > > > >   where we already have a pre-computed physical address.
> > > >
> > > > The iova is an hint to application, and widely used in DPDK.
> > > > If switch to void, how to pass the address (iova or just va ?)
> > > > this may introduce implementation dependencies here.
> > > >
> > > > Or always pass the va, and the driver performs address translation, and this
> > > > translation may cost too much cpu I think.
> > > >
> > >
> > > On the latter point, about driver doing address translation I would agree.
> > > However, we probably need more discussion about the use of iova vs just
> > > virtual addresses. My thinking on this is that if we specify the API using
> > > iovas it will severely hurt usability of the API, since it forces the user
> > > to take more inefficient codepaths in a large number of cases. Given a
> > > pointer to the middle of an mbuf, one cannot just pass that straight as an
> > > iova but must instead do a translation into offset from mbuf pointer and
> > > then readd the offset to the mbuf base address.
> > >
> > > My preference therefore is to require the use of an IOMMU when using a
> > > dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
> > > present, DPDK will run in VA mode, allowing virtual addresses to our
> > > hugepage memory to be sent directly to hardware. Also, when using
> > > dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> > > management for the app, removing further the restrictions on what memory
> > > can be addressed by hardware.
> >
> >
> > One issue of keeping void * is that memory can come from stack or heap .
> > which HW can not really operate it on.
>
> when kernel driver is managing the IOMMU all process memory can be worked
> on, not just hugepage memory, so using iova is wrong in these cases.

But not for stack and heap memory. Right?

>
> As I previously said, using iova prevents the creation of a pure software
> dummy driver too using memcpy in the background.

Why ? the memory alloced uing rte_alloc/rte_memzone etc can be touched by CPU.

Thinking more, Since anyway, we need a separate function for knowing
the completion status,
I think, it can be an opaque object as the completion code. Exposing
directly the status may not help
. As the driver needs a "context" or "call" to change the
driver-specific completion code to DPDK completion code.

>
> /Bruce
Bruce Richardson June 17, 2021, 8 a.m. UTC | #19
On Thu, Jun 17, 2021 at 01:12:22PM +0530, Jerin Jacob wrote:
> On Thu, Jun 17, 2021 at 12:43 AM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Wed, Jun 16, 2021 at 11:38:08PM +0530, Jerin Jacob wrote:
> > > On Wed, Jun 16, 2021 at 11:01 PM Bruce Richardson
> > > <bruce.richardson@intel.com> wrote:
> > > >
> > > > On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> > > > > On 2021/6/16 0:38, Bruce Richardson wrote:
> > > > > > On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > > > > >> This patch introduces 'dmadevice' which is a generic type of DMA
> > > > > >> device.
> > > > > >>
> > > > > >> The APIs of dmadev library exposes some generic operations which can
> > > > > >> enable configuration and I/O with the DMA devices.
> > > > > >>
> > > > > >> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > > > > >> ---
> > > > > > Thanks for sending this.
> > > > > >
> > > > > > Of most interest to me right now are the key data-plane APIs. While we are
> > > > > > still in the prototyping phase, below is a draft of what we are thinking
> > > > > > for the key enqueue/perform_ops/completed_ops APIs.
> > > > > >
> > > > > > Some key differences I note in below vs your original RFC:
> > > > > > * Use of void pointers rather than iova addresses. While using iova's makes
> > > > > >   sense in the general case when using hardware, in that it can work with
> > > > > >   both physical addresses and virtual addresses, if we change the APIs to use
> > > > > >   void pointers instead it will still work for DPDK in VA mode, while at the
> > > > > >   same time allow use of software fallbacks in error cases, and also a stub
> > > > > >   driver than uses memcpy in the background. Finally, using iova's makes the
> > > > > >   APIs a lot more awkward to use with anything but mbufs or similar buffers
> > > > > >   where we already have a pre-computed physical address.
> > > > >
> > > > > The iova is an hint to application, and widely used in DPDK.
> > > > > If switch to void, how to pass the address (iova or just va ?)
> > > > > this may introduce implementation dependencies here.
> > > > >
> > > > > Or always pass the va, and the driver performs address translation, and this
> > > > > translation may cost too much cpu I think.
> > > > >
> > > >
> > > > On the latter point, about driver doing address translation I would agree.
> > > > However, we probably need more discussion about the use of iova vs just
> > > > virtual addresses. My thinking on this is that if we specify the API using
> > > > iovas it will severely hurt usability of the API, since it forces the user
> > > > to take more inefficient codepaths in a large number of cases. Given a
> > > > pointer to the middle of an mbuf, one cannot just pass that straight as an
> > > > iova but must instead do a translation into offset from mbuf pointer and
> > > > then readd the offset to the mbuf base address.
> > > >
> > > > My preference therefore is to require the use of an IOMMU when using a
> > > > dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
> > > > present, DPDK will run in VA mode, allowing virtual addresses to our
> > > > hugepage memory to be sent directly to hardware. Also, when using
> > > > dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> > > > management for the app, removing further the restrictions on what memory
> > > > can be addressed by hardware.
> > >
> > >
> > > One issue of keeping void * is that memory can come from stack or heap .
> > > which HW can not really operate it on.
> >
> > when kernel driver is managing the IOMMU all process memory can be worked
> > on, not just hugepage memory, so using iova is wrong in these cases.
> 
> But not for stack and heap memory. Right?
> 
Yes, even stack and heap can be accessed.

> >
> > As I previously said, using iova prevents the creation of a pure software
> > dummy driver too using memcpy in the background.
> 
> Why ? the memory alloced uing rte_alloc/rte_memzone etc can be touched by CPU.
> 
Yes, but it can't be accessed using physical address, so again only VA mode
where iova's are "void *" make sense.

> Thinking more, Since anyway, we need a separate function for knowing
> the completion status,
> I think, it can be an opaque object as the completion code. Exposing
> directly the status may not help
> . As the driver needs a "context" or "call" to change the
> driver-specific completion code to DPDK completion code.
>
I'm sorry, I didn't follow this. By completion code, you mean the status of
whether a copy job succeeded/failed?
Bruce Richardson June 17, 2021, 9:15 a.m. UTC | #20
On Wed, Jun 16, 2021 at 08:07:26PM +0530, Jerin Jacob wrote:
> On Wed, Jun 16, 2021 at 3:47 PM fengchengwen <fengchengwen@huawei.com> wrote:
> >
> > On 2021/6/16 15:09, Morten Brørup wrote:
> > >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > >> Sent: Tuesday, 15 June 2021 18.39
> > >>
> > >> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > >>> This patch introduces 'dmadevice' which is a generic type of DMA
> > >>> device.
> > >>>
> > >>> The APIs of dmadev library exposes some generic operations which can
> > >>> enable configuration and I/O with the DMA devices.
> > >>>
> > >>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > >>> ---
> > >> Thanks for sending this.
> > >>
> > >> Of most interest to me right now are the key data-plane APIs. While we
> > >> are
> > >> still in the prototyping phase, below is a draft of what we are
> > >> thinking
> > >> for the key enqueue/perform_ops/completed_ops APIs.
> > >>
> > >> Some key differences I note in below vs your original RFC:
> > >> * Use of void pointers rather than iova addresses. While using iova's
> > >> makes
> > >>   sense in the general case when using hardware, in that it can work
> > >> with
> > >>   both physical addresses and virtual addresses, if we change the APIs
> > >> to use
> > >>   void pointers instead it will still work for DPDK in VA mode, while
> > >> at the
> > >>   same time allow use of software fallbacks in error cases, and also a
> > >> stub
> > >>   driver than uses memcpy in the background. Finally, using iova's
> > >> makes the
> > >>   APIs a lot more awkward to use with anything but mbufs or similar
> > >> buffers
> > >>   where we already have a pre-computed physical address.
> > >> * Use of id values rather than user-provided handles. Allowing the
> > >> user/app
> > >>   to manage the amount of data stored per operation is a better
> > >> solution, I
> > >>   feel than proscribing a certain about of in-driver tracking. Some
> > >> apps may
> > >>   not care about anything other than a job being completed, while other
> > >> apps
> > >>   may have significant metadata to be tracked. Taking the user-context
> > >>   handles out of the API also makes the driver code simpler.
> > >> * I've kept a single combined API for completions, which differs from
> > >> the
> > >>   separate error handling completion API you propose. I need to give
> > >> the
> > >>   two function approach a bit of thought, but likely both could work.
> > >> If we
> > >>   (likely) never expect failed ops, then the specifics of error
> > >> handling
> > >>   should not matter that much.
> > >>
> > >> For the rest, the control / setup APIs are likely to be rather
> > >> uncontroversial, I suspect. However, I think that rather than xstats
> > >> APIs,
> > >> the library should first provide a set of standardized stats like
> > >> ethdev
> > >> does. If driver-specific stats are needed, we can add xstats later to
> > >> the
> > >> API.
> > >>
> > >> Appreciate your further thoughts on this, thanks.
> > >>
> > >> Regards,
> > >> /Bruce
> > >
> > > I generally agree with Bruce's points above.
> > >
> > > I would like to share a couple of ideas for further discussion:
> 
> 
> I believe some of the other requirements and comments for generic DMA will be
> 
> 1) Support for the _channel_, Each channel may have different
> capabilities and functionalities.
> Typical cases are, each channel have separate source and destination
> devices like
> DMA between PCIe EP to Host memory, Host memory to Host memory, PCIe
> EP to PCIe EP.
> So we need some notion of the channel in the specification.
>

Can you share a bit more detail on what constitutes a channel in this case?
Is it equivalent to a device queue (which we are flattening to individual
devices in this API), or to a specific configuration on a queue?
 
> 2) I assume current data plane APIs are not thread-safe. Is it right?
> 
Yes.

> 
> 3) Cookie scheme outlined earlier looks good to me. Instead of having
> generic dequeue() API
> 
> 4) Can split the rte_dmadev_enqueue_copy(uint16_t dev_id, void * src,
> void * dst, unsigned int length);
> to two stage API like, Where one will be used in fastpath and other
> one will use used in slowpath.
> 
> - slowpath API will for take channel and take other attributes for transfer
> 
> Example syantx will be:
> 
> struct rte_dmadev_desc {
>            channel id;
>            ops ; // copy, xor, fill etc
>           other arguments specific to dma transfer // it can be set
> based on capability.
> 
> };
> 
> rte_dmadev_desc_t rte_dmadev_preprare(uint16_t dev_id,  struct
> rte_dmadev_desc *dec);
> 
> - Fastpath takes arguments that need to change per transfer along with
> slow-path handle.
> 
> rte_dmadev_enqueue(uint16_t dev_id, void * src, void * dst, unsigned
> int length,  rte_dmadev_desc_t desc)
> 
> This will help to driver to
> -Former API form the device-specific descriptors in slow path  for a
> given channel and fixed attributes per transfer
> -Later API blend "variable" arguments such as src, dest address with
> slow-path created descriptors
> 

This seems like an API for a context-aware device, where the channel is the
config data/context that is preserved across operations - is that correct?
At least from the Intel DMA accelerators side, we have no concept of this
context, and each operation is completely self-described. The location or
type of memory for copies is irrelevant, you just pass the src/dst
addresses to reference.

> The above will give better performance and is the best trade-off
> between performance and per transfer variables.

We may need to have different APIs for context-aware and context-unaware
processing, with which to use determined by the capabilities discovery.
Given that for these DMA devices the offload cost is critical, more so than
any other dev class I've looked at before, I'd like to avoid having APIs
with extra parameters than need to be passed about since that just adds
extra CPU cycles to the offload.

/Bruce
fengchengwen June 17, 2021, 9:48 a.m. UTC | #21
On 2021/6/17 1:31, Bruce Richardson wrote:
> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
>> On 2021/6/16 0:38, Bruce Richardson wrote:
>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
>>>> This patch introduces 'dmadevice' which is a generic type of DMA
>>>> device.
>>>>
>>>> The APIs of dmadev library exposes some generic operations which can
>>>> enable configuration and I/O with the DMA devices.
>>>>
>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>>>> ---
>>> Thanks for sending this.
>>>
>>> Of most interest to me right now are the key data-plane APIs. While we are
>>> still in the prototyping phase, below is a draft of what we are thinking
>>> for the key enqueue/perform_ops/completed_ops APIs.
>>>
>>> Some key differences I note in below vs your original RFC:
>>> * Use of void pointers rather than iova addresses. While using iova's makes
>>>   sense in the general case when using hardware, in that it can work with
>>>   both physical addresses and virtual addresses, if we change the APIs to use
>>>   void pointers instead it will still work for DPDK in VA mode, while at the
>>>   same time allow use of software fallbacks in error cases, and also a stub
>>>   driver than uses memcpy in the background. Finally, using iova's makes the
>>>   APIs a lot more awkward to use with anything but mbufs or similar buffers
>>>   where we already have a pre-computed physical address.
>>
>> The iova is an hint to application, and widely used in DPDK.
>> If switch to void, how to pass the address (iova or just va ?)
>> this may introduce implementation dependencies here.
>>
>> Or always pass the va, and the driver performs address translation, and this
>> translation may cost too much cpu I think.
>>
> 
> On the latter point, about driver doing address translation I would agree.
> However, we probably need more discussion about the use of iova vs just
> virtual addresses. My thinking on this is that if we specify the API using
> iovas it will severely hurt usability of the API, since it forces the user
> to take more inefficient codepaths in a large number of cases. Given a
> pointer to the middle of an mbuf, one cannot just pass that straight as an
> iova but must instead do a translation into offset from mbuf pointer and
> then readd the offset to the mbuf base address.
> 
> My preference therefore is to require the use of an IOMMU when using a
> dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
> present, DPDK will run in VA mode, allowing virtual addresses to our
> hugepage memory to be sent directly to hardware. Also, when using
> dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> management for the app, removing further the restrictions on what memory
> can be addressed by hardware.

Some DMA devices many don't support IOMMU or IOMMU bypass default, so driver may
should call rte_mem_virt2phy() do the address translate, but the rte_mem_virt2phy()
cost too many CPU cycles.

If the API defined as iova, it will work fine in:
1) If DMA don't support IOMMU or IOMMU bypass, then start application with
   --iova-mode=pa
2) If DMA support IOMMU, --iova-mode=pa/va work both fine

> 
>>> * Use of id values rather than user-provided handles. Allowing the user/app
>>>   to manage the amount of data stored per operation is a better solution, I
>>>   feel than proscribing a certain about of in-driver tracking. Some apps may
>>>   not care about anything other than a job being completed, while other apps
>>>   may have significant metadata to be tracked. Taking the user-context
>>>   handles out of the API also makes the driver code simpler.
>>
>> The user-provided handle was mainly used to simply application implementation,
>> It provides the ability to quickly locate contexts.
>>
>> The "use of id values" seem like the dma_cookie of Linux DMA engine framework,
>> user will get a unique dma_cookie after calling dmaengine_submit(), and then
>> could use it to call dma_async_is_tx_complete() to get completion status.
>>
> 
> Yes, the idea of the id is the same - to locate contexts. The main
> difference is that if we have the driver manage contexts or pointer to
> contexts, as well as giving more work to the driver, it complicates the APIs
> for measuring completions. If we use an ID-based approach, where the app
> maintains its own ring of contexts (if any), it avoids the need to have an
> "out" parameter array for returning those contexts, which needs to be
> appropriately sized. Instead we can just report that all ids up to N are
> completed. [This would be similar to your suggestion that N jobs be
> reported as done, in that no contexts are provided, it's just that knowing
> the ID of what is completed is generally more useful than the number (which
> can be obviously got by subtracting the old value)]
> 
> We are still working on prototyping all this, but would hope to have a
> functional example of all this soon.
> 
>> How about define the copy prototype as following:
>>   dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx)
>> while the dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
>> enqueue successful else fail.
>> when complete the dmadev will return latest completed dma_cookie, and the
>> application could use the dma_cookie to quick locate contexts.
>>
> 
> If I understand this correctly, I believe this is largely what I was
> suggesting - just with the typedef for the type? In which case it obviously
> looks good to me.
> 
>>> * I've kept a single combined API for completions, which differs from the
>>>   separate error handling completion API you propose. I need to give the
>>>   two function approach a bit of thought, but likely both could work. If we
>>>   (likely) never expect failed ops, then the specifics of error handling
>>>   should not matter that much.
>>
>> The rte_ioat_completed_ops API is too complex, and consider some applications
>> may never copy fail, so split them as two API.
>> It's indeed not friendly to other scenarios that always require error handling.
>>
>> I prefer use completed operations number as return value other than the ID so
>> that application could simple judge whether have new completed operations, and
>> the new prototype:
>>  uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint32_t *status, uint16_t max_status, uint16_t *num_fails);
>>
>> 1) for normal case which never expect failed ops:
>>    just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL);
>> 2) for other case:
>>    ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status, &fails);
>>    at this point the fails <= ret <= max_status
>>
> Completely agree that we need to plan for the happy-day case where all is
> passing. Looking at the prototypes you have above, I am ok with returning
> number of completed ops as the return value with the final completed cookie
> as an "out" parameter.
> For handling errors, I'm ok with what you propose above, just with one
> small adjustment - I would remove the restriction that ret <= max_status.
> 
> In case of zero-failures, we can report as many ops succeeding as we like,
> and even in case of failure, we can still report as many successful ops as
> we like before we start filling in the status field. For example, if 32 ops
> are completed, and the last one fails, we can just fill in one entry into
> status, and return 32. Alternatively if the 4th last one fails we fill in 4
> entries and return 32. The only requirements would be:
> * fails <= max_status
> * fails <= ret
> * cookie holds the id of the last entry in status.

I think we understand the same:

The fails <= ret <= max_status include following situation:
1) If max_status is 32, and there are 32 completed ops, then the ret will be 32
no matter which ops is failed
2) If max_status is 33, and there are 32 completed ops, then the ret will be 32
3) If max_status is 16, and there are 32 completed ops, then the ret will be 16

and the cookie always hold the id of the last returned completed ops, no matter
it's completed successful or failed

> 
> A further possible variation is to have separate "completed" and
> "completed_status" APIs, where "completed_status" is as above, but
> "completed" skips the final 3 parameters and returns -1 on error. In that
> case the user can fall back to the completed_status call.
> 
>>>
>>> For the rest, the control / setup APIs are likely to be rather
>>> uncontroversial, I suspect. However, I think that rather than xstats APIs,
>>> the library should first provide a set of standardized stats like ethdev
>>> does. If driver-specific stats are needed, we can add xstats later to the
>>> API.
>>
>> Agree, will fix in v2
>>
> Thanks. In parallel, we will be working on our prototype implementation
> too, taking in the feedback here, and hopefully send it as an RFC soon.
> Then we can look to compare and contrast and arrive at an agreed API. It
> might also be worthwhile to set up a community call for all interested
> parties in this API to discuss things with a more rapid turnaround. That
> was done in the past for other new device class APIs that were developed,
> e.g. eventdev.

+1

> 
> Regards,
> /Bruce
> 
> .
>
Bruce Richardson June 17, 2021, 11:02 a.m. UTC | #22
On Thu, Jun 17, 2021 at 05:48:05PM +0800, fengchengwen wrote:
> On 2021/6/17 1:31, Bruce Richardson wrote:
> > On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> >> On 2021/6/16 0:38, Bruce Richardson wrote:
> >>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> >>>> This patch introduces 'dmadevice' which is a generic type of DMA
> >>>> device.
> >>>>
> >>>> The APIs of dmadev library exposes some generic operations which can
> >>>> enable configuration and I/O with the DMA devices.
> >>>>
> >>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> >>>> ---
> >>> Thanks for sending this.
> >>>
> >>> Of most interest to me right now are the key data-plane APIs. While we are
> >>> still in the prototyping phase, below is a draft of what we are thinking
> >>> for the key enqueue/perform_ops/completed_ops APIs.
> >>>
> >>> Some key differences I note in below vs your original RFC:
> >>> * Use of void pointers rather than iova addresses. While using iova's makes
> >>>   sense in the general case when using hardware, in that it can work with
> >>>   both physical addresses and virtual addresses, if we change the APIs to use
> >>>   void pointers instead it will still work for DPDK in VA mode, while at the
> >>>   same time allow use of software fallbacks in error cases, and also a stub
> >>>   driver than uses memcpy in the background. Finally, using iova's makes the
> >>>   APIs a lot more awkward to use with anything but mbufs or similar buffers
> >>>   where we already have a pre-computed physical address.
> >>
> >> The iova is an hint to application, and widely used in DPDK.
> >> If switch to void, how to pass the address (iova or just va ?)
> >> this may introduce implementation dependencies here.
> >>
> >> Or always pass the va, and the driver performs address translation, and this
> >> translation may cost too much cpu I think.
> >>
> > 
> > On the latter point, about driver doing address translation I would agree.
> > However, we probably need more discussion about the use of iova vs just
> > virtual addresses. My thinking on this is that if we specify the API using
> > iovas it will severely hurt usability of the API, since it forces the user
> > to take more inefficient codepaths in a large number of cases. Given a
> > pointer to the middle of an mbuf, one cannot just pass that straight as an
> > iova but must instead do a translation into offset from mbuf pointer and
> > then readd the offset to the mbuf base address.
> > 
> > My preference therefore is to require the use of an IOMMU when using a
> > dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
> > present, DPDK will run in VA mode, allowing virtual addresses to our
> > hugepage memory to be sent directly to hardware. Also, when using
> > dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> > management for the app, removing further the restrictions on what memory
> > can be addressed by hardware.
> 
> Some DMA devices many don't support IOMMU or IOMMU bypass default, so driver may
> should call rte_mem_virt2phy() do the address translate, but the rte_mem_virt2phy()
> cost too many CPU cycles.
> 
> If the API defined as iova, it will work fine in:
> 1) If DMA don't support IOMMU or IOMMU bypass, then start application with
>    --iova-mode=pa
> 2) If DMA support IOMMU, --iova-mode=pa/va work both fine
> 

I suppose if we keep the iova as the datatype, we can just cast "void *"
pointers to that in the case that virtual addresses can be used directly. I
believe your RFC included a capability query API - "uses void * as iova" 
should probably be one of those capabilities, and that would resolve this.
If DPDK is in iova=va mode because of the presence of an iommu, all drivers
could report this capability too.

> > 
> >>> * Use of id values rather than user-provided handles. Allowing the user/app
> >>>   to manage the amount of data stored per operation is a better solution, I
> >>>   feel than proscribing a certain about of in-driver tracking. Some apps may
> >>>   not care about anything other than a job being completed, while other apps
> >>>   may have significant metadata to be tracked. Taking the user-context
> >>>   handles out of the API also makes the driver code simpler.
> >>
> >> The user-provided handle was mainly used to simply application implementation,
> >> It provides the ability to quickly locate contexts.
> >>
> >> The "use of id values" seem like the dma_cookie of Linux DMA engine framework,
> >> user will get a unique dma_cookie after calling dmaengine_submit(), and then
> >> could use it to call dma_async_is_tx_complete() to get completion status.
> >>
> > 
> > Yes, the idea of the id is the same - to locate contexts. The main
> > difference is that if we have the driver manage contexts or pointer to
> > contexts, as well as giving more work to the driver, it complicates the APIs
> > for measuring completions. If we use an ID-based approach, where the app
> > maintains its own ring of contexts (if any), it avoids the need to have an
> > "out" parameter array for returning those contexts, which needs to be
> > appropriately sized. Instead we can just report that all ids up to N are
> > completed. [This would be similar to your suggestion that N jobs be
> > reported as done, in that no contexts are provided, it's just that knowing
> > the ID of what is completed is generally more useful than the number (which
> > can be obviously got by subtracting the old value)]
> > 
> > We are still working on prototyping all this, but would hope to have a
> > functional example of all this soon.
> > 
> >> How about define the copy prototype as following:
> >>   dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx)
> >> while the dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
> >> enqueue successful else fail.
> >> when complete the dmadev will return latest completed dma_cookie, and the
> >> application could use the dma_cookie to quick locate contexts.
> >>
> > 
> > If I understand this correctly, I believe this is largely what I was
> > suggesting - just with the typedef for the type? In which case it obviously
> > looks good to me.
> > 
> >>> * I've kept a single combined API for completions, which differs from the
> >>>   separate error handling completion API you propose. I need to give the
> >>>   two function approach a bit of thought, but likely both could work. If we
> >>>   (likely) never expect failed ops, then the specifics of error handling
> >>>   should not matter that much.
> >>
> >> The rte_ioat_completed_ops API is too complex, and consider some applications
> >> may never copy fail, so split them as two API.
> >> It's indeed not friendly to other scenarios that always require error handling.
> >>
> >> I prefer use completed operations number as return value other than the ID so
> >> that application could simple judge whether have new completed operations, and
> >> the new prototype:
> >>  uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint32_t *status, uint16_t max_status, uint16_t *num_fails);
> >>
> >> 1) for normal case which never expect failed ops:
> >>    just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL);
> >> 2) for other case:
> >>    ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status, &fails);
> >>    at this point the fails <= ret <= max_status
> >>
> > Completely agree that we need to plan for the happy-day case where all is
> > passing. Looking at the prototypes you have above, I am ok with returning
> > number of completed ops as the return value with the final completed cookie
> > as an "out" parameter.
> > For handling errors, I'm ok with what you propose above, just with one
> > small adjustment - I would remove the restriction that ret <= max_status.
> > 
> > In case of zero-failures, we can report as many ops succeeding as we like,
> > and even in case of failure, we can still report as many successful ops as
> > we like before we start filling in the status field. For example, if 32 ops
> > are completed, and the last one fails, we can just fill in one entry into
> > status, and return 32. Alternatively if the 4th last one fails we fill in 4
> > entries and return 32. The only requirements would be:
> > * fails <= max_status
> > * fails <= ret
> > * cookie holds the id of the last entry in status.
> 
> I think we understand the same:
> 
> The fails <= ret <= max_status include following situation:
> 1) If max_status is 32, and there are 32 completed ops, then the ret will be 32
> no matter which ops is failed
> 2) If max_status is 33, and there are 32 completed ops, then the ret will be 32
> 3) If max_status is 16, and there are 32 completed ops, then the ret will be 16
> 
> and the cookie always hold the id of the last returned completed ops, no matter
> it's completed successful or failed
> 

I actually disagree on the #3. If max_status is 16, there are 32 completed
ops, and *no failures* the ret will be 32, not 16, because we are not
returning any status entries so max_status need not apply. Keeping that
same scenario #3, depending on the number of failures and the point of
them, the return value may similarly vary, for example:
* if job #28 fails, then ret could still be 32, cookie would be the cookie
  for that job, "fails" parameter would return as 4, with status holding the
  failure of 28 plus the succeeded status of jobs 29-31, i.e. 4 elements.
* if job #5 fails, then we can't fit the status list from 5 though 31 in an
  array of 16, so "fails" == 16(max_status) and status contains the 16
  statuses starting from #5, which means that cookie contains the value for
  job #20 and ret is 21.

In other words, ignore max_status and status parameters *unless we have an
error to return*, meaning the fast-path/happy-day case works as fast as
possible. You don't need to worry about sizing your status array to be big,
and you always get back a large number of completions when available. Your
fastpath code only need check the "fails" parameter to see if status needs
to ever be consulted, and in normal case it doesn't.

If this is too complicated, maybe we can simplify a little by returning just
one failure at a time, though at the cost of making error handling slower?

rte_dmadev_completed(dev_id, &cookie, &failure_status)

In this case, we always return the number of completed ops on success,
while on failure, we return the first error code. For a single error, this
works fine, but if we get a burst of errors together, things will work
slower - which may be acceptable if errors are very rare. However, for idxd 
at least if a fence occurs after a failure all jobs in the batch after the
fence would be skipped, which would lead to the "burst of errors" case.
Therefore, I'd prefer to have the original suggestion allowing multiple
errors to be reported at a time.

/Bruce
Bruce Richardson June 17, 2021, 2:18 p.m. UTC | #23
On Thu, Jun 17, 2021 at 12:02:00PM +0100, Bruce Richardson wrote:
> On Thu, Jun 17, 2021 at 05:48:05PM +0800, fengchengwen wrote:
> > On 2021/6/17 1:31, Bruce Richardson wrote:
> > > On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> > >> On 2021/6/16 0:38, Bruce Richardson wrote:
> > >>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > >>>> This patch introduces 'dmadevice' which is a generic type of DMA
> > >>>> device.
> > >>>>
> > >>>> The APIs of dmadev library exposes some generic operations which can
> > >>>> enable configuration and I/O with the DMA devices.
> > >>>>
> > >>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > >>>> ---
> > >>> Thanks for sending this.
> > >>>
> > >>> Of most interest to me right now are the key data-plane APIs. While we are
> > >>> still in the prototyping phase, below is a draft of what we are thinking
> > >>> for the key enqueue/perform_ops/completed_ops APIs.
> > >>>
> > >>> Some key differences I note in below vs your original RFC:
> > >>> * Use of void pointers rather than iova addresses. While using iova's makes
> > >>>   sense in the general case when using hardware, in that it can work with
> > >>>   both physical addresses and virtual addresses, if we change the APIs to use
> > >>>   void pointers instead it will still work for DPDK in VA mode, while at the
> > >>>   same time allow use of software fallbacks in error cases, and also a stub
> > >>>   driver than uses memcpy in the background. Finally, using iova's makes the
> > >>>   APIs a lot more awkward to use with anything but mbufs or similar buffers
> > >>>   where we already have a pre-computed physical address.
> > >>
> > >> The iova is an hint to application, and widely used in DPDK.
> > >> If switch to void, how to pass the address (iova or just va ?)
> > >> this may introduce implementation dependencies here.
> > >>
> > >> Or always pass the va, and the driver performs address translation, and this
> > >> translation may cost too much cpu I think.
> > >>
> > > 
> > > On the latter point, about driver doing address translation I would agree.
> > > However, we probably need more discussion about the use of iova vs just
> > > virtual addresses. My thinking on this is that if we specify the API using
> > > iovas it will severely hurt usability of the API, since it forces the user
> > > to take more inefficient codepaths in a large number of cases. Given a
> > > pointer to the middle of an mbuf, one cannot just pass that straight as an
> > > iova but must instead do a translation into offset from mbuf pointer and
> > > then readd the offset to the mbuf base address.
> > > 
> > > My preference therefore is to require the use of an IOMMU when using a
> > > dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
> > > present, DPDK will run in VA mode, allowing virtual addresses to our
> > > hugepage memory to be sent directly to hardware. Also, when using
> > > dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> > > management for the app, removing further the restrictions on what memory
> > > can be addressed by hardware.
> > 
> > Some DMA devices many don't support IOMMU or IOMMU bypass default, so driver may
> > should call rte_mem_virt2phy() do the address translate, but the rte_mem_virt2phy()
> > cost too many CPU cycles.
> > 
> > If the API defined as iova, it will work fine in:
> > 1) If DMA don't support IOMMU or IOMMU bypass, then start application with
> >    --iova-mode=pa
> > 2) If DMA support IOMMU, --iova-mode=pa/va work both fine
> > 
> 
> I suppose if we keep the iova as the datatype, we can just cast "void *"
> pointers to that in the case that virtual addresses can be used directly. I
> believe your RFC included a capability query API - "uses void * as iova" 
> should probably be one of those capabilities, and that would resolve this.
> If DPDK is in iova=va mode because of the presence of an iommu, all drivers
> could report this capability too.
> 
> > > 
> > >>> * Use of id values rather than user-provided handles. Allowing the user/app
> > >>>   to manage the amount of data stored per operation is a better solution, I
> > >>>   feel than proscribing a certain about of in-driver tracking. Some apps may
> > >>>   not care about anything other than a job being completed, while other apps
> > >>>   may have significant metadata to be tracked. Taking the user-context
> > >>>   handles out of the API also makes the driver code simpler.
> > >>
> > >> The user-provided handle was mainly used to simply application implementation,
> > >> It provides the ability to quickly locate contexts.
> > >>
> > >> The "use of id values" seem like the dma_cookie of Linux DMA engine framework,
> > >> user will get a unique dma_cookie after calling dmaengine_submit(), and then
> > >> could use it to call dma_async_is_tx_complete() to get completion status.
> > >>
> > > 
> > > Yes, the idea of the id is the same - to locate contexts. The main
> > > difference is that if we have the driver manage contexts or pointer to
> > > contexts, as well as giving more work to the driver, it complicates the APIs
> > > for measuring completions. If we use an ID-based approach, where the app
> > > maintains its own ring of contexts (if any), it avoids the need to have an
> > > "out" parameter array for returning those contexts, which needs to be
> > > appropriately sized. Instead we can just report that all ids up to N are
> > > completed. [This would be similar to your suggestion that N jobs be
> > > reported as done, in that no contexts are provided, it's just that knowing
> > > the ID of what is completed is generally more useful than the number (which
> > > can be obviously got by subtracting the old value)]
> > > 
> > > We are still working on prototyping all this, but would hope to have a
> > > functional example of all this soon.
> > > 
> > >> How about define the copy prototype as following:
> > >>   dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx)
> > >> while the dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
> > >> enqueue successful else fail.
> > >> when complete the dmadev will return latest completed dma_cookie, and the
> > >> application could use the dma_cookie to quick locate contexts.
> > >>
> > > 
> > > If I understand this correctly, I believe this is largely what I was
> > > suggesting - just with the typedef for the type? In which case it obviously
> > > looks good to me.
> > > 
> > >>> * I've kept a single combined API for completions, which differs from the
> > >>>   separate error handling completion API you propose. I need to give the
> > >>>   two function approach a bit of thought, but likely both could work. If we
> > >>>   (likely) never expect failed ops, then the specifics of error handling
> > >>>   should not matter that much.
> > >>
> > >> The rte_ioat_completed_ops API is too complex, and consider some applications
> > >> may never copy fail, so split them as two API.
> > >> It's indeed not friendly to other scenarios that always require error handling.
> > >>
> > >> I prefer use completed operations number as return value other than the ID so
> > >> that application could simple judge whether have new completed operations, and
> > >> the new prototype:
> > >>  uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint32_t *status, uint16_t max_status, uint16_t *num_fails);
> > >>
> > >> 1) for normal case which never expect failed ops:
> > >>    just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL);
> > >> 2) for other case:
> > >>    ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status, &fails);
> > >>    at this point the fails <= ret <= max_status
> > >>
> > > Completely agree that we need to plan for the happy-day case where all is
> > > passing. Looking at the prototypes you have above, I am ok with returning
> > > number of completed ops as the return value with the final completed cookie
> > > as an "out" parameter.
> > > For handling errors, I'm ok with what you propose above, just with one
> > > small adjustment - I would remove the restriction that ret <= max_status.
> > > 
> > > In case of zero-failures, we can report as many ops succeeding as we like,
> > > and even in case of failure, we can still report as many successful ops as
> > > we like before we start filling in the status field. For example, if 32 ops
> > > are completed, and the last one fails, we can just fill in one entry into
> > > status, and return 32. Alternatively if the 4th last one fails we fill in 4
> > > entries and return 32. The only requirements would be:
> > > * fails <= max_status
> > > * fails <= ret
> > > * cookie holds the id of the last entry in status.
> > 
> > I think we understand the same:
> > 
> > The fails <= ret <= max_status include following situation:
> > 1) If max_status is 32, and there are 32 completed ops, then the ret will be 32
> > no matter which ops is failed
> > 2) If max_status is 33, and there are 32 completed ops, then the ret will be 32
> > 3) If max_status is 16, and there are 32 completed ops, then the ret will be 16
> > 
> > and the cookie always hold the id of the last returned completed ops, no matter
> > it's completed successful or failed
> > 
> 
> I actually disagree on the #3. If max_status is 16, there are 32 completed
> ops, and *no failures* the ret will be 32, not 16, because we are not
> returning any status entries so max_status need not apply. Keeping that
> same scenario #3, depending on the number of failures and the point of
> them, the return value may similarly vary, for example:
> * if job #28 fails, then ret could still be 32, cookie would be the cookie
>   for that job, "fails" parameter would return as 4, with status holding the
>   failure of 28 plus the succeeded status of jobs 29-31, i.e. 4 elements.
> * if job #5 fails, then we can't fit the status list from 5 though 31 in an
>   array of 16, so "fails" == 16(max_status) and status contains the 16
>   statuses starting from #5, which means that cookie contains the value for
>   job #20 and ret is 21.
> 
> In other words, ignore max_status and status parameters *unless we have an
> error to return*, meaning the fast-path/happy-day case works as fast as
> possible. You don't need to worry about sizing your status array to be big,
> and you always get back a large number of completions when available. Your
> fastpath code only need check the "fails" parameter to see if status needs
> to ever be consulted, and in normal case it doesn't.
> 
> If this is too complicated, maybe we can simplify a little by returning just
> one failure at a time, though at the cost of making error handling slower?
> 
> rte_dmadev_completed(dev_id, &cookie, &failure_status)
> 
> In this case, we always return the number of completed ops on success,
> while on failure, we return the first error code. For a single error, this
> works fine, but if we get a burst of errors together, things will work
> slower - which may be acceptable if errors are very rare. However, for idxd 
> at least if a fence occurs after a failure all jobs in the batch after the
> fence would be skipped, which would lead to the "burst of errors" case.
> Therefore, I'd prefer to have the original suggestion allowing multiple
> errors to be reported at a time.
> 
> /Bruce

Apologies for self-reply, but thinking about it more, a combination of
normal-case and error-case APIs may be just simpler:

int rte_dmadev_completed(dev_id, &cookie)

returns number of items completed and cookie of last item. If there is an
error, returns all successfull values up to the error entry and returns -1
on subsequent call.

int rte_dmadev_completed_status(dev_id, &cookie, max_status, status_array,
	&error_count)

this is a slower completion API which behaves like you originally said
above, returning number of completions x, 0 <= x <= max_status, with x
status values filled into array, and the number of unsuccessful values in
the error_count value.

This would allow code to be written in the application to use
rte_dmadev_completed() in the normal case, and on getting a "-1" value, use
rte_dmadev_completed_status() to get the error details. If strings of
errors might be expected, the app can continually use the
completed_status() function until error_count returns 0, and then switch
back to the faster/simpler version.

This two-function approach also allows future support for other DMA
functions such as comparison, where a status value is always required. Any
apps using that functionality would just always use the "_status" function
for completions.

/Bruce
Jerin Jacob June 18, 2021, 5:16 a.m. UTC | #24
On Thu, Jun 17, 2021 at 1:30 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Thu, Jun 17, 2021 at 01:12:22PM +0530, Jerin Jacob wrote:
> > On Thu, Jun 17, 2021 at 12:43 AM Bruce Richardson
> > <bruce.richardson@intel.com> wrote:
> > >
> > > On Wed, Jun 16, 2021 at 11:38:08PM +0530, Jerin Jacob wrote:
> > > > On Wed, Jun 16, 2021 at 11:01 PM Bruce Richardson
> > > > <bruce.richardson@intel.com> wrote:
> > > > >
> > > > > On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> > > > > > On 2021/6/16 0:38, Bruce Richardson wrote:
> > > > > > > On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > > > > > >> This patch introduces 'dmadevice' which is a generic type of DMA
> > > > > > >> device.
> > > > > > >>
> > > > > > >> The APIs of dmadev library exposes some generic operations which can
> > > > > > >> enable configuration and I/O with the DMA devices.
> > > > > > >>
> > > > > > >> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > > > > > >> ---
> > > > > > > Thanks for sending this.
> > > > > > >
> > > > > > > Of most interest to me right now are the key data-plane APIs. While we are
> > > > > > > still in the prototyping phase, below is a draft of what we are thinking
> > > > > > > for the key enqueue/perform_ops/completed_ops APIs.
> > > > > > >
> > > > > > > Some key differences I note in below vs your original RFC:
> > > > > > > * Use of void pointers rather than iova addresses. While using iova's makes
> > > > > > >   sense in the general case when using hardware, in that it can work with
> > > > > > >   both physical addresses and virtual addresses, if we change the APIs to use
> > > > > > >   void pointers instead it will still work for DPDK in VA mode, while at the
> > > > > > >   same time allow use of software fallbacks in error cases, and also a stub
> > > > > > >   driver than uses memcpy in the background. Finally, using iova's makes the
> > > > > > >   APIs a lot more awkward to use with anything but mbufs or similar buffers
> > > > > > >   where we already have a pre-computed physical address.
> > > > > >
> > > > > > The iova is an hint to application, and widely used in DPDK.
> > > > > > If switch to void, how to pass the address (iova or just va ?)
> > > > > > this may introduce implementation dependencies here.
> > > > > >
> > > > > > Or always pass the va, and the driver performs address translation, and this
> > > > > > translation may cost too much cpu I think.
> > > > > >
> > > > >
> > > > > On the latter point, about driver doing address translation I would agree.
> > > > > However, we probably need more discussion about the use of iova vs just
> > > > > virtual addresses. My thinking on this is that if we specify the API using
> > > > > iovas it will severely hurt usability of the API, since it forces the user
> > > > > to take more inefficient codepaths in a large number of cases. Given a
> > > > > pointer to the middle of an mbuf, one cannot just pass that straight as an
> > > > > iova but must instead do a translation into offset from mbuf pointer and
> > > > > then readd the offset to the mbuf base address.
> > > > >
> > > > > My preference therefore is to require the use of an IOMMU when using a
> > > > > dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
> > > > > present, DPDK will run in VA mode, allowing virtual addresses to our
> > > > > hugepage memory to be sent directly to hardware. Also, when using
> > > > > dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> > > > > management for the app, removing further the restrictions on what memory
> > > > > can be addressed by hardware.
> > > >
> > > >
> > > > One issue of keeping void * is that memory can come from stack or heap .
> > > > which HW can not really operate it on.
> > >
> > > when kernel driver is managing the IOMMU all process memory can be worked
> > > on, not just hugepage memory, so using iova is wrong in these cases.
> >
> > But not for stack and heap memory. Right?
> >
> Yes, even stack and heap can be accessed.

The HW device cannot as that memory is NOT mapped to IOMMU. It will
result in the transaction
fault.

At least, In octeon, DMA HW job descriptor will have a pointer (IOVA)
which will be updated by _HW_
upon copy job completion. That memory can not be from the
heap(malloc()) or stack as those are not
mapped by IOMMU.


>
> > >
> > > As I previously said, using iova prevents the creation of a pure software
> > > dummy driver too using memcpy in the background.
> >
> > Why ? the memory alloced uing rte_alloc/rte_memzone etc can be touched by CPU.
> >
> Yes, but it can't be accessed using physical address, so again only VA mode
> where iova's are "void *" make sense.

I agree that it should be a physical address. My only concern that
void * does not express
it can not be from stack/heap. If API tells the memory need to
allotted by rte_alloc() or rte_memzone() etc
is fine with me.

or  it may better that. Have separate API to alloc the handle so based
on the driver, it can be
rte_alloc() or malloc(). It can be burst API in slow path to get
number of status pointers

>
> > Thinking more, Since anyway, we need a separate function for knowing
> > the completion status,
> > I think, it can be an opaque object as the completion code. Exposing
> > directly the status may not help
> > . As the driver needs a "context" or "call" to change the
> > driver-specific completion code to DPDK completion code.
> >
> I'm sorry, I didn't follow this. By completion code, you mean the status of
> whether a copy job succeeded/failed?

Yes, the status of job completion.
Jerin Jacob June 18, 2021, 5:52 a.m. UTC | #25
On Thu, Jun 17, 2021 at 2:46 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Wed, Jun 16, 2021 at 08:07:26PM +0530, Jerin Jacob wrote:
> > On Wed, Jun 16, 2021 at 3:47 PM fengchengwen <fengchengwen@huawei.com> wrote:
> > >
> > > On 2021/6/16 15:09, Morten Brørup wrote:
> > > >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > > >> Sent: Tuesday, 15 June 2021 18.39
> > > >>
> > > >> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > > >>> This patch introduces 'dmadevice' which is a generic type of DMA
> > > >>> device.
> > > >>>
> > > >>> The APIs of dmadev library exposes some generic operations which can
> > > >>> enable configuration and I/O with the DMA devices.
> > > >>>
> > > >>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > > >>> ---
> > > >> Thanks for sending this.
> > > >>
> > > >> Of most interest to me right now are the key data-plane APIs. While we
> > > >> are
> > > >> still in the prototyping phase, below is a draft of what we are
> > > >> thinking
> > > >> for the key enqueue/perform_ops/completed_ops APIs.
> > > >>
> > > >> Some key differences I note in below vs your original RFC:
> > > >> * Use of void pointers rather than iova addresses. While using iova's
> > > >> makes
> > > >>   sense in the general case when using hardware, in that it can work
> > > >> with
> > > >>   both physical addresses and virtual addresses, if we change the APIs
> > > >> to use
> > > >>   void pointers instead it will still work for DPDK in VA mode, while
> > > >> at the
> > > >>   same time allow use of software fallbacks in error cases, and also a
> > > >> stub
> > > >>   driver than uses memcpy in the background. Finally, using iova's
> > > >> makes the
> > > >>   APIs a lot more awkward to use with anything but mbufs or similar
> > > >> buffers
> > > >>   where we already have a pre-computed physical address.
> > > >> * Use of id values rather than user-provided handles. Allowing the
> > > >> user/app
> > > >>   to manage the amount of data stored per operation is a better
> > > >> solution, I
> > > >>   feel than proscribing a certain about of in-driver tracking. Some
> > > >> apps may
> > > >>   not care about anything other than a job being completed, while other
> > > >> apps
> > > >>   may have significant metadata to be tracked. Taking the user-context
> > > >>   handles out of the API also makes the driver code simpler.
> > > >> * I've kept a single combined API for completions, which differs from
> > > >> the
> > > >>   separate error handling completion API you propose. I need to give
> > > >> the
> > > >>   two function approach a bit of thought, but likely both could work.
> > > >> If we
> > > >>   (likely) never expect failed ops, then the specifics of error
> > > >> handling
> > > >>   should not matter that much.
> > > >>
> > > >> For the rest, the control / setup APIs are likely to be rather
> > > >> uncontroversial, I suspect. However, I think that rather than xstats
> > > >> APIs,
> > > >> the library should first provide a set of standardized stats like
> > > >> ethdev
> > > >> does. If driver-specific stats are needed, we can add xstats later to
> > > >> the
> > > >> API.
> > > >>
> > > >> Appreciate your further thoughts on this, thanks.
> > > >>
> > > >> Regards,
> > > >> /Bruce
> > > >
> > > > I generally agree with Bruce's points above.
> > > >
> > > > I would like to share a couple of ideas for further discussion:
> >
> >
> > I believe some of the other requirements and comments for generic DMA will be
> >
> > 1) Support for the _channel_, Each channel may have different
> > capabilities and functionalities.
> > Typical cases are, each channel have separate source and destination
> > devices like
> > DMA between PCIe EP to Host memory, Host memory to Host memory, PCIe
> > EP to PCIe EP.
> > So we need some notion of the channel in the specification.
> >
>
> Can you share a bit more detail on what constitutes a channel in this case?
> Is it equivalent to a device queue (which we are flattening to individual
> devices in this API), or to a specific configuration on a queue?

It not a queue. It is one of the attributes for transfer.
I.e in the same queue, for a given transfer it can specify the
different "source" and "destination" device.
Like CPU to Sound card, CPU to network card etc.


>
> > 2) I assume current data plane APIs are not thread-safe. Is it right?
> >
> Yes.
>
> >
> > 3) Cookie scheme outlined earlier looks good to me. Instead of having
> > generic dequeue() API
> >
> > 4) Can split the rte_dmadev_enqueue_copy(uint16_t dev_id, void * src,
> > void * dst, unsigned int length);
> > to two stage API like, Where one will be used in fastpath and other
> > one will use used in slowpath.
> >
> > - slowpath API will for take channel and take other attributes for transfer
> >
> > Example syantx will be:
> >
> > struct rte_dmadev_desc {
> >            channel id;
> >            ops ; // copy, xor, fill etc
> >           other arguments specific to dma transfer // it can be set
> > based on capability.
> >
> > };
> >
> > rte_dmadev_desc_t rte_dmadev_preprare(uint16_t dev_id,  struct
> > rte_dmadev_desc *dec);
> >
> > - Fastpath takes arguments that need to change per transfer along with
> > slow-path handle.
> >
> > rte_dmadev_enqueue(uint16_t dev_id, void * src, void * dst, unsigned
> > int length,  rte_dmadev_desc_t desc)
> >
> > This will help to driver to
> > -Former API form the device-specific descriptors in slow path  for a
> > given channel and fixed attributes per transfer
> > -Later API blend "variable" arguments such as src, dest address with
> > slow-path created descriptors
> >
>
> This seems like an API for a context-aware device, where the channel is the
> config data/context that is preserved across operations - is that correct?
> At least from the Intel DMA accelerators side, we have no concept of this
> context, and each operation is completely self-described. The location or
> type of memory for copies is irrelevant, you just pass the src/dst
> addresses to reference.

it is not context-aware device. Each HW JOB is self-described.
You can view it different attributes of transfer.


>
> > The above will give better performance and is the best trade-off c
> > between performance and per transfer variables.
>
> We may need to have different APIs for context-aware and context-unaware
> processing, with which to use determined by the capabilities discovery.
> Given that for these DMA devices the offload cost is critical, more so than
> any other dev class I've looked at before, I'd like to avoid having APIs
> with extra parameters than need to be passed about since that just adds
> extra CPU cycles to the offload.

If driver does not support additional attributes and/or the
application does not need it, rte_dmadev_desc_t can be NULL.
So that it won't have any cost in the datapath. I think, we can go to
different API
cases if we can not abstract problems without performance impact.
Otherwise, it will be too much
pain for applications.

Just to understand, I think, we need to HW capabilities and how to
have a common API.
I assume HW will have some HW JOB descriptors which will be filled in
SW and submitted to HW.
In our HW,  Job descriptor has the following main elements

- Channel   // We don't expect the application to change per transfer
- Source address - It can be scatter-gather too - Will be changed per transfer
- Destination address - It can be scatter-gather too - Will be changed
per transfer
- Transfer Length - - It can be scatter-gather too - Will be changed
per transfer
- IOVA address where HW post Job completion status PER Job descriptor
- Will be changed per transfer
- Another sideband information related to channel  // We don't expect
the application to change per transfer
- As an option, Job completion can be posted as an event to
rte_event_queue  too // We don't expect the application to change per
transfer

@Richardson, Bruce @fengchengwen @Hemant Agrawal

Could you share the options for your HW descriptors  which you are
planning to expose through API like above so that we can easily
converge on fastpath API



>
> /Bruce
fengchengwen June 18, 2021, 8:52 a.m. UTC | #26
On 2021/6/17 22:18, Bruce Richardson wrote:
> On Thu, Jun 17, 2021 at 12:02:00PM +0100, Bruce Richardson wrote:
>> On Thu, Jun 17, 2021 at 05:48:05PM +0800, fengchengwen wrote:
>>> On 2021/6/17 1:31, Bruce Richardson wrote:
>>>> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
>>>>> On 2021/6/16 0:38, Bruce Richardson wrote:
>>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
>>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
>>>>>>> device.
>>>>>>>
>>>>>>> The APIs of dmadev library exposes some generic operations which can
>>>>>>> enable configuration and I/O with the DMA devices.
>>>>>>>
>>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>>>>>>> ---
>>>>>> Thanks for sending this.
>>>>>>
>>>>>> Of most interest to me right now are the key data-plane APIs. While we are
>>>>>> still in the prototyping phase, below is a draft of what we are thinking
>>>>>> for the key enqueue/perform_ops/completed_ops APIs.
>>>>>>
>>>>>> Some key differences I note in below vs your original RFC:
>>>>>> * Use of void pointers rather than iova addresses. While using iova's makes
>>>>>>   sense in the general case when using hardware, in that it can work with
>>>>>>   both physical addresses and virtual addresses, if we change the APIs to use
>>>>>>   void pointers instead it will still work for DPDK in VA mode, while at the
>>>>>>   same time allow use of software fallbacks in error cases, and also a stub
>>>>>>   driver than uses memcpy in the background. Finally, using iova's makes the
>>>>>>   APIs a lot more awkward to use with anything but mbufs or similar buffers
>>>>>>   where we already have a pre-computed physical address.
>>>>>
>>>>> The iova is an hint to application, and widely used in DPDK.
>>>>> If switch to void, how to pass the address (iova or just va ?)
>>>>> this may introduce implementation dependencies here.
>>>>>
>>>>> Or always pass the va, and the driver performs address translation, and this
>>>>> translation may cost too much cpu I think.
>>>>>
>>>>
>>>> On the latter point, about driver doing address translation I would agree.
>>>> However, we probably need more discussion about the use of iova vs just
>>>> virtual addresses. My thinking on this is that if we specify the API using
>>>> iovas it will severely hurt usability of the API, since it forces the user
>>>> to take more inefficient codepaths in a large number of cases. Given a
>>>> pointer to the middle of an mbuf, one cannot just pass that straight as an
>>>> iova but must instead do a translation into offset from mbuf pointer and
>>>> then readd the offset to the mbuf base address.
>>>>
>>>> My preference therefore is to require the use of an IOMMU when using a
>>>> dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
>>>> present, DPDK will run in VA mode, allowing virtual addresses to our
>>>> hugepage memory to be sent directly to hardware. Also, when using
>>>> dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
>>>> management for the app, removing further the restrictions on what memory
>>>> can be addressed by hardware.
>>>
>>> Some DMA devices many don't support IOMMU or IOMMU bypass default, so driver may
>>> should call rte_mem_virt2phy() do the address translate, but the rte_mem_virt2phy()
>>> cost too many CPU cycles.
>>>
>>> If the API defined as iova, it will work fine in:
>>> 1) If DMA don't support IOMMU or IOMMU bypass, then start application with
>>>    --iova-mode=pa
>>> 2) If DMA support IOMMU, --iova-mode=pa/va work both fine
>>>
>>
>> I suppose if we keep the iova as the datatype, we can just cast "void *"
>> pointers to that in the case that virtual addresses can be used directly. I
>> believe your RFC included a capability query API - "uses void * as iova" 
>> should probably be one of those capabilities, and that would resolve this.
>> If DPDK is in iova=va mode because of the presence of an iommu, all drivers
>> could report this capability too.
>>
>>>>
>>>>>> * Use of id values rather than user-provided handles. Allowing the user/app
>>>>>>   to manage the amount of data stored per operation is a better solution, I
>>>>>>   feel than proscribing a certain about of in-driver tracking. Some apps may
>>>>>>   not care about anything other than a job being completed, while other apps
>>>>>>   may have significant metadata to be tracked. Taking the user-context
>>>>>>   handles out of the API also makes the driver code simpler.
>>>>>
>>>>> The user-provided handle was mainly used to simply application implementation,
>>>>> It provides the ability to quickly locate contexts.
>>>>>
>>>>> The "use of id values" seem like the dma_cookie of Linux DMA engine framework,
>>>>> user will get a unique dma_cookie after calling dmaengine_submit(), and then
>>>>> could use it to call dma_async_is_tx_complete() to get completion status.
>>>>>
>>>>
>>>> Yes, the idea of the id is the same - to locate contexts. The main
>>>> difference is that if we have the driver manage contexts or pointer to
>>>> contexts, as well as giving more work to the driver, it complicates the APIs
>>>> for measuring completions. If we use an ID-based approach, where the app
>>>> maintains its own ring of contexts (if any), it avoids the need to have an
>>>> "out" parameter array for returning those contexts, which needs to be
>>>> appropriately sized. Instead we can just report that all ids up to N are
>>>> completed. [This would be similar to your suggestion that N jobs be
>>>> reported as done, in that no contexts are provided, it's just that knowing
>>>> the ID of what is completed is generally more useful than the number (which
>>>> can be obviously got by subtracting the old value)]
>>>>
>>>> We are still working on prototyping all this, but would hope to have a
>>>> functional example of all this soon.
>>>>
>>>>> How about define the copy prototype as following:
>>>>>   dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx)
>>>>> while the dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
>>>>> enqueue successful else fail.
>>>>> when complete the dmadev will return latest completed dma_cookie, and the
>>>>> application could use the dma_cookie to quick locate contexts.
>>>>>
>>>>
>>>> If I understand this correctly, I believe this is largely what I was
>>>> suggesting - just with the typedef for the type? In which case it obviously
>>>> looks good to me.
>>>>
>>>>>> * I've kept a single combined API for completions, which differs from the
>>>>>>   separate error handling completion API you propose. I need to give the
>>>>>>   two function approach a bit of thought, but likely both could work. If we
>>>>>>   (likely) never expect failed ops, then the specifics of error handling
>>>>>>   should not matter that much.
>>>>>
>>>>> The rte_ioat_completed_ops API is too complex, and consider some applications
>>>>> may never copy fail, so split them as two API.
>>>>> It's indeed not friendly to other scenarios that always require error handling.
>>>>>
>>>>> I prefer use completed operations number as return value other than the ID so
>>>>> that application could simple judge whether have new completed operations, and
>>>>> the new prototype:
>>>>>  uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint32_t *status, uint16_t max_status, uint16_t *num_fails);
>>>>>
>>>>> 1) for normal case which never expect failed ops:
>>>>>    just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL);
>>>>> 2) for other case:
>>>>>    ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status, &fails);
>>>>>    at this point the fails <= ret <= max_status
>>>>>
>>>> Completely agree that we need to plan for the happy-day case where all is
>>>> passing. Looking at the prototypes you have above, I am ok with returning
>>>> number of completed ops as the return value with the final completed cookie
>>>> as an "out" parameter.
>>>> For handling errors, I'm ok with what you propose above, just with one
>>>> small adjustment - I would remove the restriction that ret <= max_status.
>>>>
>>>> In case of zero-failures, we can report as many ops succeeding as we like,
>>>> and even in case of failure, we can still report as many successful ops as
>>>> we like before we start filling in the status field. For example, if 32 ops
>>>> are completed, and the last one fails, we can just fill in one entry into
>>>> status, and return 32. Alternatively if the 4th last one fails we fill in 4
>>>> entries and return 32. The only requirements would be:
>>>> * fails <= max_status
>>>> * fails <= ret
>>>> * cookie holds the id of the last entry in status.
>>>
>>> I think we understand the same:
>>>
>>> The fails <= ret <= max_status include following situation:
>>> 1) If max_status is 32, and there are 32 completed ops, then the ret will be 32
>>> no matter which ops is failed
>>> 2) If max_status is 33, and there are 32 completed ops, then the ret will be 32
>>> 3) If max_status is 16, and there are 32 completed ops, then the ret will be 16
>>>
>>> and the cookie always hold the id of the last returned completed ops, no matter
>>> it's completed successful or failed
>>>
>>
>> I actually disagree on the #3. If max_status is 16, there are 32 completed
>> ops, and *no failures* the ret will be 32, not 16, because we are not
>> returning any status entries so max_status need not apply. Keeping that
>> same scenario #3, depending on the number of failures and the point of
>> them, the return value may similarly vary, for example:
>> * if job #28 fails, then ret could still be 32, cookie would be the cookie
>>   for that job, "fails" parameter would return as 4, with status holding the
>>   failure of 28 plus the succeeded status of jobs 29-31, i.e. 4 elements.
>> * if job #5 fails, then we can't fit the status list from 5 though 31 in an
>>   array of 16, so "fails" == 16(max_status) and status contains the 16
>>   statuses starting from #5, which means that cookie contains the value for
>>   job #20 and ret is 21.
>>
>> In other words, ignore max_status and status parameters *unless we have an
>> error to return*, meaning the fast-path/happy-day case works as fast as
>> possible. You don't need to worry about sizing your status array to be big,
>> and you always get back a large number of completions when available. Your
>> fastpath code only need check the "fails" parameter to see if status needs
>> to ever be consulted, and in normal case it doesn't.
>>
>> If this is too complicated, maybe we can simplify a little by returning just
>> one failure at a time, though at the cost of making error handling slower?
>>
>> rte_dmadev_completed(dev_id, &cookie, &failure_status)
>>
>> In this case, we always return the number of completed ops on success,
>> while on failure, we return the first error code. For a single error, this
>> works fine, but if we get a burst of errors together, things will work
>> slower - which may be acceptable if errors are very rare. However, for idxd 
>> at least if a fence occurs after a failure all jobs in the batch after the
>> fence would be skipped, which would lead to the "burst of errors" case.
>> Therefore, I'd prefer to have the original suggestion allowing multiple
>> errors to be reported at a time.
>>
>> /Bruce
> 
> Apologies for self-reply, but thinking about it more, a combination of
> normal-case and error-case APIs may be just simpler:
> 
> int rte_dmadev_completed(dev_id, &cookie)
> 
> returns number of items completed and cookie of last item. If there is an
> error, returns all successfull values up to the error entry and returns -1
> on subsequent call.
> 
> int rte_dmadev_completed_status(dev_id, &cookie, max_status, status_array,
> 	&error_count)
> 
> this is a slower completion API which behaves like you originally said
> above, returning number of completions x, 0 <= x <= max_status, with x
> status values filled into array, and the number of unsuccessful values in
> the error_count value.
> 
> This would allow code to be written in the application to use
> rte_dmadev_completed() in the normal case, and on getting a "-1" value, use
> rte_dmadev_completed_status() to get the error details. If strings of
> errors might be expected, the app can continually use the
> completed_status() function until error_count returns 0, and then switch
> back to the faster/simpler version.

This two-function simplify the status_array's maintenance because we don't need init it to zero.
I think it's a good trade-off between performance and rich error info (status code).

Here I'd like to discuss the 'burst size', which is widely used in DPDK application (e.g.
nic polling or ring en/dequeue).
Currently we don't define a max completed ops in rte_dmadev_completed() API, the return
value may greater than 'burst size' of application, this may result in the application need to
maintain (or remember) the return value of the function and special handling at the next poll.

Also consider there may multiple calls rte_dmadev_completed to check fail, it may make it
difficult for the application to use.

So I prefer following prototype:
  uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_cpls, bool *has_error)
    -- nb_cpls: indicate max process operations number
    -- has_error: indicate if there is an error
    -- return value: the number of successful completed operations.
    -- example:
       1) If there are already 32 completed ops, and 4th is error, and nb_cpls is 32, then
          the ret will be 3(because 1/2/3th is OK), and has_error will be true.
       2) If there are already 32 completed ops, and all successful completed, then the ret
          will be min(32, nb_cpls), and has_error will be false.
       3) If there are already 32 completed ops, and all failed completed, then the ret will
          be 0, and has_error will be true.
  uint16_t rte_dmadev_completed_status(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_status, uint32_t *status)
    -- return value: the number of failed completed operations.

The application use the following invocation order when polling:
  has_error = false; // could be init to false by dmadev API, we need discuss
  ret = rte_dmadev_completed(dev_id, &cookie, bust_size, &has_error);
  // process successful completed case:
  for (int i = 0; i < ret; i++) {
  }
  if (unlikely(has_error)) {
    // process failed completed case
    ret = rte_dmadev_completed_status(dev_id, &cookie, burst_size - ret, status_array);
    for (int i = 0; i < ret; i++) {
      // ...
    }
  }


> 
> This two-function approach also allows future support for other DMA
> functions such as comparison, where a status value is always required. Any
> apps using that functionality would just always use the "_status" function
> for completions.
> 
> /Bruce
> 
> .
>
Bruce Richardson June 18, 2021, 9:30 a.m. UTC | #27
On Fri, Jun 18, 2021 at 04:52:00PM +0800, fengchengwen wrote:
> On 2021/6/17 22:18, Bruce Richardson wrote:
> > On Thu, Jun 17, 2021 at 12:02:00PM +0100, Bruce Richardson wrote:
> >> On Thu, Jun 17, 2021 at 05:48:05PM +0800, fengchengwen wrote:
> >>> On 2021/6/17 1:31, Bruce Richardson wrote:
> >>>> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> >>>>> On 2021/6/16 0:38, Bruce Richardson wrote:
> >>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> >>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
> >>>>>>> device.
> >>>>>>>
> >>>>>>> The APIs of dmadev library exposes some generic operations which
> >>>>>>> can enable configuration and I/O with the DMA devices.
> >>>>>>>
> >>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com> ---
> >>>>>> Thanks for sending this.
> >>>>>>
> >>>>>> Of most interest to me right now are the key data-plane APIs.
> >>>>>> While we are still in the prototyping phase, below is a draft of
> >>>>>> what we are thinking for the key enqueue/perform_ops/completed_ops
> >>>>>> APIs.
> >>>>>>
> >>>>>> Some key differences I note in below vs your original RFC: * Use
> >>>>>> of void pointers rather than iova addresses. While using iova's
> >>>>>> makes sense in the general case when using hardware, in that it
> >>>>>> can work with both physical addresses and virtual addresses, if we
> >>>>>> change the APIs to use void pointers instead it will still work
> >>>>>> for DPDK in VA mode, while at the same time allow use of software
> >>>>>> fallbacks in error cases, and also a stub driver than uses memcpy
> >>>>>> in the background. Finally, using iova's makes the APIs a lot more
> >>>>>> awkward to use with anything but mbufs or similar buffers where we
> >>>>>> already have a pre-computed physical address.
> >>>>>
> >>>>> The iova is an hint to application, and widely used in DPDK.  If
> >>>>> switch to void, how to pass the address (iova or just va ?) this
> >>>>> may introduce implementation dependencies here.
> >>>>>
> >>>>> Or always pass the va, and the driver performs address translation,
> >>>>> and this translation may cost too much cpu I think.
> >>>>>
> >>>>
> >>>> On the latter point, about driver doing address translation I would
> >>>> agree.  However, we probably need more discussion about the use of
> >>>> iova vs just virtual addresses. My thinking on this is that if we
> >>>> specify the API using iovas it will severely hurt usability of the
> >>>> API, since it forces the user to take more inefficient codepaths in
> >>>> a large number of cases. Given a pointer to the middle of an mbuf,
> >>>> one cannot just pass that straight as an iova but must instead do a
> >>>> translation into offset from mbuf pointer and then readd the offset
> >>>> to the mbuf base address.
> >>>>
> >>>> My preference therefore is to require the use of an IOMMU when using
> >>>> a dmadev, so that it can be a much closer analog of memcpy. Once an
> >>>> iommu is present, DPDK will run in VA mode, allowing virtual
> >>>> addresses to our hugepage memory to be sent directly to hardware.
> >>>> Also, when using dmadevs on top of an in-kernel driver, that kernel
> >>>> driver may do all iommu management for the app, removing further the
> >>>> restrictions on what memory can be addressed by hardware.
> >>>
> >>> Some DMA devices many don't support IOMMU or IOMMU bypass default, so
> >>> driver may should call rte_mem_virt2phy() do the address translate,
> >>> but the rte_mem_virt2phy() cost too many CPU cycles.
> >>>
> >>> If the API defined as iova, it will work fine in: 1) If DMA don't
> >>> support IOMMU or IOMMU bypass, then start application with
> >>> --iova-mode=pa 2) If DMA support IOMMU, --iova-mode=pa/va work both
> >>> fine
> >>>
> >>
> >> I suppose if we keep the iova as the datatype, we can just cast "void
> >> *" pointers to that in the case that virtual addresses can be used
> >> directly. I believe your RFC included a capability query API - "uses
> >> void * as iova" should probably be one of those capabilities, and that
> >> would resolve this.  If DPDK is in iova=va mode because of the
> >> presence of an iommu, all drivers could report this capability too.
> >>
> >>>>
> >>>>>> * Use of id values rather than user-provided handles. Allowing the
> >>>>>> user/app to manage the amount of data stored per operation is a
> >>>>>> better solution, I feel than proscribing a certain about of
> >>>>>> in-driver tracking. Some apps may not care about anything other
> >>>>>> than a job being completed, while other apps may have significant
> >>>>>> metadata to be tracked. Taking the user-context handles out of the
> >>>>>> API also makes the driver code simpler.
> >>>>>
> >>>>> The user-provided handle was mainly used to simply application
> >>>>> implementation, It provides the ability to quickly locate contexts.
> >>>>>
> >>>>> The "use of id values" seem like the dma_cookie of Linux DMA engine
> >>>>> framework, user will get a unique dma_cookie after calling
> >>>>> dmaengine_submit(), and then could use it to call
> >>>>> dma_async_is_tx_complete() to get completion status.
> >>>>>
> >>>>
> >>>> Yes, the idea of the id is the same - to locate contexts. The main
> >>>> difference is that if we have the driver manage contexts or pointer
> >>>> to contexts, as well as giving more work to the driver, it
> >>>> complicates the APIs for measuring completions. If we use an
> >>>> ID-based approach, where the app maintains its own ring of contexts
> >>>> (if any), it avoids the need to have an "out" parameter array for
> >>>> returning those contexts, which needs to be appropriately sized.
> >>>> Instead we can just report that all ids up to N are completed. [This
> >>>> would be similar to your suggestion that N jobs be reported as done,
> >>>> in that no contexts are provided, it's just that knowing the ID of
> >>>> what is completed is generally more useful than the number (which
> >>>> can be obviously got by subtracting the old value)]
> >>>>
> >>>> We are still working on prototyping all this, but would hope to have
> >>>> a functional example of all this soon.
> >>>>
> >>>>> How about define the copy prototype as following: dma_cookie_t
> >>>>> rte_dmadev_copy(uint16_t dev_id, xxx) while the dma_cookie_t is
> >>>>> int32 and is monotonically increasing, when >=0 mean enqueue
> >>>>> successful else fail.  when complete the dmadev will return latest
> >>>>> completed dma_cookie, and the application could use the dma_cookie
> >>>>> to quick locate contexts.
> >>>>>
> >>>>
> >>>> If I understand this correctly, I believe this is largely what I was
> >>>> suggesting - just with the typedef for the type? In which case it
> >>>> obviously looks good to me.
> >>>>
> >>>>>> * I've kept a single combined API for completions, which differs
> >>>>>> from the separate error handling completion API you propose. I
> >>>>>> need to give the two function approach a bit of thought, but
> >>>>>> likely both could work. If we (likely) never expect failed ops,
> >>>>>> then the specifics of error handling should not matter that much.
> >>>>>
> >>>>> The rte_ioat_completed_ops API is too complex, and consider some
> >>>>> applications may never copy fail, so split them as two API.  It's
> >>>>> indeed not friendly to other scenarios that always require error
> >>>>> handling.
> >>>>>
> >>>>> I prefer use completed operations number as return value other than
> >>>>> the ID so that application could simple judge whether have new
> >>>>> completed operations, and the new prototype: uint16_t
> >>>>> rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie,
> >>>>> uint32_t *status, uint16_t max_status, uint16_t *num_fails);
> >>>>>
> >>>>> 1) for normal case which never expect failed ops: just call: ret =
> >>>>> rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL); 2) for other
> >>>>> case: ret = rte_dmadev_completed(dev_id, &cookie, &status,
> >>>>> max_status, &fails); at this point the fails <= ret <= max_status
> >>>>>
> >>>> Completely agree that we need to plan for the happy-day case where
> >>>> all is passing. Looking at the prototypes you have above, I am ok
> >>>> with returning number of completed ops as the return value with the
> >>>> final completed cookie as an "out" parameter.  For handling errors,
> >>>> I'm ok with what you propose above, just with one small adjustment -
> >>>> I would remove the restriction that ret <= max_status.
> >>>>
> >>>> In case of zero-failures, we can report as many ops succeeding as we
> >>>> like, and even in case of failure, we can still report as many
> >>>> successful ops as we like before we start filling in the status
> >>>> field. For example, if 32 ops are completed, and the last one fails,
> >>>> we can just fill in one entry into status, and return 32.
> >>>> Alternatively if the 4th last one fails we fill in 4 entries and
> >>>> return 32. The only requirements would be: * fails <= max_status *
> >>>> fails <= ret * cookie holds the id of the last entry in status.
> >>>
> >>> I think we understand the same:
> >>>
> >>> The fails <= ret <= max_status include following situation: 1) If
> >>> max_status is 32, and there are 32 completed ops, then the ret will
> >>> be 32 no matter which ops is failed 2) If max_status is 33, and there
> >>> are 32 completed ops, then the ret will be 32 3) If max_status is 16,
> >>> and there are 32 completed ops, then the ret will be 16
> >>>
> >>> and the cookie always hold the id of the last returned completed ops,
> >>> no matter it's completed successful or failed
> >>>
> >>
> >> I actually disagree on the #3. If max_status is 16, there are 32
> >> completed ops, and *no failures* the ret will be 32, not 16, because
> >> we are not returning any status entries so max_status need not apply.
> >> Keeping that same scenario #3, depending on the number of failures and
> >> the point of them, the return value may similarly vary, for example: *
> >> if job #28 fails, then ret could still be 32, cookie would be the
> >> cookie for that job, "fails" parameter would return as 4, with status
> >> holding the failure of 28 plus the succeeded status of jobs 29-31,
> >> i.e. 4 elements.  * if job #5 fails, then we can't fit the status list
> >> from 5 though 31 in an array of 16, so "fails" == 16(max_status) and
> >> status contains the 16 statuses starting from #5, which means that
> >> cookie contains the value for job #20 and ret is 21.
> >>
> >> In other words, ignore max_status and status parameters *unless we
> >> have an error to return*, meaning the fast-path/happy-day case works
> >> as fast as possible. You don't need to worry about sizing your status
> >> array to be big, and you always get back a large number of completions
> >> when available. Your fastpath code only need check the "fails"
> >> parameter to see if status needs to ever be consulted, and in normal
> >> case it doesn't.
> >>
> >> If this is too complicated, maybe we can simplify a little by
> >> returning just one failure at a time, though at the cost of making
> >> error handling slower?
> >>
> >> rte_dmadev_completed(dev_id, &cookie, &failure_status)
> >>
> >> In this case, we always return the number of completed ops on success,
> >> while on failure, we return the first error code. For a single error,
> >> this works fine, but if we get a burst of errors together, things will
> >> work slower - which may be acceptable if errors are very rare.
> >> However, for idxd at least if a fence occurs after a failure all jobs
> >> in the batch after the fence would be skipped, which would lead to the
> >> "burst of errors" case.  Therefore, I'd prefer to have the original
> >> suggestion allowing multiple errors to be reported at a time.
> >>
> >> /Bruce
> > 
> > Apologies for self-reply, but thinking about it more, a combination of
> > normal-case and error-case APIs may be just simpler:
> > 
> > int rte_dmadev_completed(dev_id, &cookie)
> > 
> > returns number of items completed and cookie of last item. If there is
> > an error, returns all successfull values up to the error entry and
> > returns -1 on subsequent call.
> > 
> > int rte_dmadev_completed_status(dev_id, &cookie, max_status,
> > status_array, &error_count)
> > 
> > this is a slower completion API which behaves like you originally said
> > above, returning number of completions x, 0 <= x <= max_status, with x
> > status values filled into array, and the number of unsuccessful values
> > in the error_count value.
> > 
> > This would allow code to be written in the application to use
> > rte_dmadev_completed() in the normal case, and on getting a "-1" value,
> > use rte_dmadev_completed_status() to get the error details. If strings
> > of errors might be expected, the app can continually use the
> > completed_status() function until error_count returns 0, and then
> > switch back to the faster/simpler version.
> 
> This two-function simplify the status_array's maintenance because we
> don't need init it to zero.  I think it's a good trade-off between
> performance and rich error info (status code).
> 
> Here I'd like to discuss the 'burst size', which is widely used in DPDK
> application (e.g.  nic polling or ring en/dequeue).  Currently we don't
> define a max completed ops in rte_dmadev_completed() API, the return
> value may greater than 'burst size' of application, this may result in
> the application need to maintain (or remember) the return value of the
> function and special handling at the next poll.
> 
> Also consider there may multiple calls rte_dmadev_completed to check
> fail, it may make it difficult for the application to use.
> 
> So I prefer following prototype: uint16_t rte_dmadev_completed(uint16_t
> dev_id, dma_cookie_t *cookie, uint16_t nb_cpls, bool *has_error) --
> nb_cpls: indicate max process operations number -- has_error: indicate if
> there is an error -- return value: the number of successful completed
> operations.  -- example: 1) If there are already 32 completed ops, and
> 4th is error, and nb_cpls is 32, then the ret will be 3(because 1/2/3th
> is OK), and has_error will be true.  2) If there are already 32 completed
> ops, and all successful completed, then the ret will be min(32, nb_cpls),
> and has_error will be false.  3) If there are already 32 completed ops,
> and all failed completed, then the ret will be 0, and has_error will be
> true.  uint16_t rte_dmadev_completed_status(uint16_t dev_id, dma_cookie_t
> *cookie, uint16_t nb_status, uint32_t *status) -- return value: the
> number of failed completed operations.
> 
> The application use the following invocation order when polling:
> has_error = false; // could be init to false by dmadev API, we need
> discuss ret = rte_dmadev_completed(dev_id, &cookie, bust_size,
> &has_error); // process successful completed case: for (int i = 0; i <
> ret; i++) { } if (unlikely(has_error)) { // process failed completed case
> ret = rte_dmadev_completed_status(dev_id, &cookie, burst_size - ret,
> status_array); for (int i = 0; i < ret; i++) { // ...  } }
>
Seems reasonable. Let's go with this as an option for now - I just want to
check for all these the perf impacts to the offload cost. Once I get our
prototype working with some hardware (hopefully very soon) I can check this
out directly.
fengchengwen June 18, 2021, 9:41 a.m. UTC | #28
On 2021/6/18 13:52, Jerin Jacob wrote:
> On Thu, Jun 17, 2021 at 2:46 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
>>
>> On Wed, Jun 16, 2021 at 08:07:26PM +0530, Jerin Jacob wrote:
>>> On Wed, Jun 16, 2021 at 3:47 PM fengchengwen <fengchengwen@huawei.com> wrote:
>>>>
>>>> On 2021/6/16 15:09, Morten Brørup wrote:
>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
>>>>>> Sent: Tuesday, 15 June 2021 18.39
>>>>>>
>>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
>>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
>>>>>>> device.
>>>>>>>
>>>>>>> The APIs of dmadev library exposes some generic operations which can
>>>>>>> enable configuration and I/O with the DMA devices.
>>>>>>>
>>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>>>>>>> ---
>>>>>> Thanks for sending this.
>>>>>>
>>>>>> Of most interest to me right now are the key data-plane APIs. While we
>>>>>> are
>>>>>> still in the prototyping phase, below is a draft of what we are
>>>>>> thinking
>>>>>> for the key enqueue/perform_ops/completed_ops APIs.
>>>>>>
>>>>>> Some key differences I note in below vs your original RFC:
>>>>>> * Use of void pointers rather than iova addresses. While using iova's
>>>>>> makes
>>>>>>   sense in the general case when using hardware, in that it can work
>>>>>> with
>>>>>>   both physical addresses and virtual addresses, if we change the APIs
>>>>>> to use
>>>>>>   void pointers instead it will still work for DPDK in VA mode, while
>>>>>> at the
>>>>>>   same time allow use of software fallbacks in error cases, and also a
>>>>>> stub
>>>>>>   driver than uses memcpy in the background. Finally, using iova's
>>>>>> makes the
>>>>>>   APIs a lot more awkward to use with anything but mbufs or similar
>>>>>> buffers
>>>>>>   where we already have a pre-computed physical address.
>>>>>> * Use of id values rather than user-provided handles. Allowing the
>>>>>> user/app
>>>>>>   to manage the amount of data stored per operation is a better
>>>>>> solution, I
>>>>>>   feel than proscribing a certain about of in-driver tracking. Some
>>>>>> apps may
>>>>>>   not care about anything other than a job being completed, while other
>>>>>> apps
>>>>>>   may have significant metadata to be tracked. Taking the user-context
>>>>>>   handles out of the API also makes the driver code simpler.
>>>>>> * I've kept a single combined API for completions, which differs from
>>>>>> the
>>>>>>   separate error handling completion API you propose. I need to give
>>>>>> the
>>>>>>   two function approach a bit of thought, but likely both could work.
>>>>>> If we
>>>>>>   (likely) never expect failed ops, then the specifics of error
>>>>>> handling
>>>>>>   should not matter that much.
>>>>>>
>>>>>> For the rest, the control / setup APIs are likely to be rather
>>>>>> uncontroversial, I suspect. However, I think that rather than xstats
>>>>>> APIs,
>>>>>> the library should first provide a set of standardized stats like
>>>>>> ethdev
>>>>>> does. If driver-specific stats are needed, we can add xstats later to
>>>>>> the
>>>>>> API.
>>>>>>
>>>>>> Appreciate your further thoughts on this, thanks.
>>>>>>
>>>>>> Regards,
>>>>>> /Bruce
>>>>>
>>>>> I generally agree with Bruce's points above.
>>>>>
>>>>> I would like to share a couple of ideas for further discussion:
>>>
>>>
>>> I believe some of the other requirements and comments for generic DMA will be
>>>
>>> 1) Support for the _channel_, Each channel may have different
>>> capabilities and functionalities.
>>> Typical cases are, each channel have separate source and destination
>>> devices like
>>> DMA between PCIe EP to Host memory, Host memory to Host memory, PCIe
>>> EP to PCIe EP.
>>> So we need some notion of the channel in the specification.
>>>
>>
>> Can you share a bit more detail on what constitutes a channel in this case?
>> Is it equivalent to a device queue (which we are flattening to individual
>> devices in this API), or to a specific configuration on a queue?
> 
> It not a queue. It is one of the attributes for transfer.
> I.e in the same queue, for a given transfer it can specify the
> different "source" and "destination" device.
> Like CPU to Sound card, CPU to network card etc.
> 
> 
>>
>>> 2) I assume current data plane APIs are not thread-safe. Is it right?
>>>
>> Yes.
>>
>>>
>>> 3) Cookie scheme outlined earlier looks good to me. Instead of having
>>> generic dequeue() API
>>>
>>> 4) Can split the rte_dmadev_enqueue_copy(uint16_t dev_id, void * src,
>>> void * dst, unsigned int length);
>>> to two stage API like, Where one will be used in fastpath and other
>>> one will use used in slowpath.
>>>
>>> - slowpath API will for take channel and take other attributes for transfer
>>>
>>> Example syantx will be:
>>>
>>> struct rte_dmadev_desc {
>>>            channel id;
>>>            ops ; // copy, xor, fill etc
>>>           other arguments specific to dma transfer // it can be set
>>> based on capability.
>>>
>>> };
>>>
>>> rte_dmadev_desc_t rte_dmadev_preprare(uint16_t dev_id,  struct
>>> rte_dmadev_desc *dec);
>>>
>>> - Fastpath takes arguments that need to change per transfer along with
>>> slow-path handle.
>>>
>>> rte_dmadev_enqueue(uint16_t dev_id, void * src, void * dst, unsigned
>>> int length,  rte_dmadev_desc_t desc)
>>>
>>> This will help to driver to
>>> -Former API form the device-specific descriptors in slow path  for a
>>> given channel and fixed attributes per transfer
>>> -Later API blend "variable" arguments such as src, dest address with
>>> slow-path created descriptors
>>>
>>
>> This seems like an API for a context-aware device, where the channel is the
>> config data/context that is preserved across operations - is that correct?
>> At least from the Intel DMA accelerators side, we have no concept of this
>> context, and each operation is completely self-described. The location or
>> type of memory for copies is irrelevant, you just pass the src/dst
>> addresses to reference.
> 
> it is not context-aware device. Each HW JOB is self-described.
> You can view it different attributes of transfer.
> 
> 
>>
>>> The above will give better performance and is the best trade-off c
>>> between performance and per transfer variables.
>>
>> We may need to have different APIs for context-aware and context-unaware
>> processing, with which to use determined by the capabilities discovery.
>> Given that for these DMA devices the offload cost is critical, more so than
>> any other dev class I've looked at before, I'd like to avoid having APIs
>> with extra parameters than need to be passed about since that just adds
>> extra CPU cycles to the offload.
> 
> If driver does not support additional attributes and/or the
> application does not need it, rte_dmadev_desc_t can be NULL.
> So that it won't have any cost in the datapath. I think, we can go to
> different API
> cases if we can not abstract problems without performance impact.
> Otherwise, it will be too much
> pain for applications.

Yes, currently we plan to use different API for different case, e.g.
  rte_dmadev_memcpy()  -- deal with local to local memcopy
  rte_dmadev_memset()  -- deal with fill with local memory with pattern
maybe:
  rte_dmadev_imm_data()  --deal with copy very little data
  rte_dmadev_p2pcopy()   --deal with peer-to-peer copy of diffenet PCIE addr

These API capabilities will be reflected in the device capability set so that
application could know by standard API.

> 
> Just to understand, I think, we need to HW capabilities and how to
> have a common API.
> I assume HW will have some HW JOB descriptors which will be filled in
> SW and submitted to HW.
> In our HW,  Job descriptor has the following main elements
> 
> - Channel   // We don't expect the application to change per transfer
> - Source address - It can be scatter-gather too - Will be changed per transfer
> - Destination address - It can be scatter-gather too - Will be changed
> per transfer
> - Transfer Length - - It can be scatter-gather too - Will be changed
> per transfer
> - IOVA address where HW post Job completion status PER Job descriptor
> - Will be changed per transfer
> - Another sideband information related to channel  // We don't expect
> the application to change per transfer
> - As an option, Job completion can be posted as an event to
> rte_event_queue  too // We don't expect the application to change per
> transfer

The 'option' field looks like a software interface field, but not HW descriptor.

> 
> @Richardson, Bruce @fengchengwen @Hemant Agrawal
> 
> Could you share the options for your HW descriptors  which you are
> planning to expose through API like above so that we can easily
> converge on fastpath API
> 

Kunpeng HW descriptor is self-describing, and don't need refer context info.

Maybe the fields which was fix with some transfer type could setup by driver, and
don't expose to application.

So that we could use more generic way to define the API.

> 
> 
>>
>> /Bruce
> 
> .
>
Bruce Richardson June 18, 2021, 9:55 a.m. UTC | #29
On Fri, Jun 18, 2021 at 11:22:28AM +0530, Jerin Jacob wrote:
> On Thu, Jun 17, 2021 at 2:46 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Wed, Jun 16, 2021 at 08:07:26PM +0530, Jerin Jacob wrote:
> > > On Wed, Jun 16, 2021 at 3:47 PM fengchengwen <fengchengwen@huawei.com> wrote:
> > > >
> > > > On 2021/6/16 15:09, Morten Brørup wrote:
> > > > >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > > > >> Sent: Tuesday, 15 June 2021 18.39
> > > > >>
> > > > >> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > > > >>> This patch introduces 'dmadevice' which is a generic type of DMA
> > > > >>> device.
> > > > >>>
> > > > >>> The APIs of dmadev library exposes some generic operations which can
> > > > >>> enable configuration and I/O with the DMA devices.
> > > > >>>
> > > > >>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > > > >>> ---
> > > > >> Thanks for sending this.
> > > > >>
> > > > >> Of most interest to me right now are the key data-plane APIs. While we
> > > > >> are
> > > > >> still in the prototyping phase, below is a draft of what we are
> > > > >> thinking
> > > > >> for the key enqueue/perform_ops/completed_ops APIs.
> > > > >>
> > > > >> Some key differences I note in below vs your original RFC:
> > > > >> * Use of void pointers rather than iova addresses. While using iova's
> > > > >> makes
> > > > >>   sense in the general case when using hardware, in that it can work
> > > > >> with
> > > > >>   both physical addresses and virtual addresses, if we change the APIs
> > > > >> to use
> > > > >>   void pointers instead it will still work for DPDK in VA mode, while
> > > > >> at the
> > > > >>   same time allow use of software fallbacks in error cases, and also a
> > > > >> stub
> > > > >>   driver than uses memcpy in the background. Finally, using iova's
> > > > >> makes the
> > > > >>   APIs a lot more awkward to use with anything but mbufs or similar
> > > > >> buffers
> > > > >>   where we already have a pre-computed physical address.
> > > > >> * Use of id values rather than user-provided handles. Allowing the
> > > > >> user/app
> > > > >>   to manage the amount of data stored per operation is a better
> > > > >> solution, I
> > > > >>   feel than proscribing a certain about of in-driver tracking. Some
> > > > >> apps may
> > > > >>   not care about anything other than a job being completed, while other
> > > > >> apps
> > > > >>   may have significant metadata to be tracked. Taking the user-context
> > > > >>   handles out of the API also makes the driver code simpler.
> > > > >> * I've kept a single combined API for completions, which differs from
> > > > >> the
> > > > >>   separate error handling completion API you propose. I need to give
> > > > >> the
> > > > >>   two function approach a bit of thought, but likely both could work.
> > > > >> If we
> > > > >>   (likely) never expect failed ops, then the specifics of error
> > > > >> handling
> > > > >>   should not matter that much.
> > > > >>
> > > > >> For the rest, the control / setup APIs are likely to be rather
> > > > >> uncontroversial, I suspect. However, I think that rather than xstats
> > > > >> APIs,
> > > > >> the library should first provide a set of standardized stats like
> > > > >> ethdev
> > > > >> does. If driver-specific stats are needed, we can add xstats later to
> > > > >> the
> > > > >> API.
> > > > >>
> > > > >> Appreciate your further thoughts on this, thanks.
> > > > >>
> > > > >> Regards,
> > > > >> /Bruce
> > > > >
> > > > > I generally agree with Bruce's points above.
> > > > >
> > > > > I would like to share a couple of ideas for further discussion:
> > >
> > >
> > > I believe some of the other requirements and comments for generic DMA will be
> > >
> > > 1) Support for the _channel_, Each channel may have different
> > > capabilities and functionalities.
> > > Typical cases are, each channel have separate source and destination
> > > devices like
> > > DMA between PCIe EP to Host memory, Host memory to Host memory, PCIe
> > > EP to PCIe EP.
> > > So we need some notion of the channel in the specification.
> > >
> >
> > Can you share a bit more detail on what constitutes a channel in this case?
> > Is it equivalent to a device queue (which we are flattening to individual
> > devices in this API), or to a specific configuration on a queue?
> 
> It not a queue. It is one of the attributes for transfer.
> I.e in the same queue, for a given transfer it can specify the
> different "source" and "destination" device.
> Like CPU to Sound card, CPU to network card etc.
>
Ok. Thanks for clarifying. Do you think it's best given as a
device-specific parameter to the various functions, and NULL for hardware
that doesn't need it?
 
> 
> >
> > > 2) I assume current data plane APIs are not thread-safe. Is it right?
> > >
> > Yes.
> >
> > >
> > > 3) Cookie scheme outlined earlier looks good to me. Instead of having
> > > generic dequeue() API
> > >
> > > 4) Can split the rte_dmadev_enqueue_copy(uint16_t dev_id, void * src,
> > > void * dst, unsigned int length);
> > > to two stage API like, Where one will be used in fastpath and other
> > > one will use used in slowpath.
> > >
> > > - slowpath API will for take channel and take other attributes for transfer
> > >
> > > Example syantx will be:
> > >
> > > struct rte_dmadev_desc {
> > >            channel id;
> > >            ops ; // copy, xor, fill etc
> > >           other arguments specific to dma transfer // it can be set
> > > based on capability.
> > >
> > > };
> > >
> > > rte_dmadev_desc_t rte_dmadev_preprare(uint16_t dev_id,  struct
> > > rte_dmadev_desc *dec);
> > >
> > > - Fastpath takes arguments that need to change per transfer along with
> > > slow-path handle.
> > >
> > > rte_dmadev_enqueue(uint16_t dev_id, void * src, void * dst, unsigned
> > > int length,  rte_dmadev_desc_t desc)
> > >
> > > This will help to driver to
> > > -Former API form the device-specific descriptors in slow path  for a
> > > given channel and fixed attributes per transfer
> > > -Later API blend "variable" arguments such as src, dest address with
> > > slow-path created descriptors
> > >
> >
> > This seems like an API for a context-aware device, where the channel is the
> > config data/context that is preserved across operations - is that correct?
> > At least from the Intel DMA accelerators side, we have no concept of this
> > context, and each operation is completely self-described. The location or
> > type of memory for copies is irrelevant, you just pass the src/dst
> > addresses to reference.
> 
> it is not context-aware device. Each HW JOB is self-described.
> You can view it different attributes of transfer.
> 
> 
> >
> > > The above will give better performance and is the best trade-off c
> > > between performance and per transfer variables.
> >
> > We may need to have different APIs for context-aware and context-unaware
> > processing, with which to use determined by the capabilities discovery.
> > Given that for these DMA devices the offload cost is critical, more so than
> > any other dev class I've looked at before, I'd like to avoid having APIs
> > with extra parameters than need to be passed about since that just adds
> > extra CPU cycles to the offload.
> 
> If driver does not support additional attributes and/or the
> application does not need it, rte_dmadev_desc_t can be NULL.
> So that it won't have any cost in the datapath. I think, we can go to
> different API
> cases if we can not abstract problems without performance impact.
> Otherwise, it will be too much
> pain for applications.

Ok. Having one extra parameter ignored by some drivers should not be that
big of a deal. [With all these, we'll only really know for sure when
implemented and offload cost measured]

> 
> Just to understand, I think, we need to HW capabilities and how to
> have a common API.
> I assume HW will have some HW JOB descriptors which will be filled in
> SW and submitted to HW.
> In our HW,  Job descriptor has the following main elements
> 
> - Channel   // We don't expect the application to change per transfer
> - Source address - It can be scatter-gather too - Will be changed per transfer
> - Destination address - It can be scatter-gather too - Will be changed
> per transfer
> - Transfer Length - - It can be scatter-gather too - Will be changed
> per transfer
> - IOVA address where HW post Job completion status PER Job descriptor
> - Will be changed per transfer
> - Another sideband information related to channel  // We don't expect
> the application to change per transfer
> - As an option, Job completion can be posted as an event to
> rte_event_queue  too // We don't expect the application to change per
> transfer
> 
> @Richardson, Bruce @fengchengwen @Hemant Agrawal
> 
> Could you share the options for your HW descriptors  which you are
> planning to expose through API like above so that we can easily
> converge on fastpath API
> 
Taking the case of a simple copy op, the parameters we need are:

* src
* dst
* length

Depending on the specific hardware there will also be passed in the
descriptor a completion address, but we plan for these cases to always have
the completions written back to a set location so that we have essentially
ring-writeback, as with the hardware which doesn't explicitly have a
separate completion address. Beyond that, I believe the only descriptor
fields we will use are just the flags field indicating the op type etc.

/Bruce
Bruce Richardson June 18, 2021, 10:03 a.m. UTC | #30
On Fri, Jun 18, 2021 at 10:46:08AM +0530, Jerin Jacob wrote:
> On Thu, Jun 17, 2021 at 1:30 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Thu, Jun 17, 2021 at 01:12:22PM +0530, Jerin Jacob wrote:
> > > On Thu, Jun 17, 2021 at 12:43 AM Bruce Richardson
> > > <bruce.richardson@intel.com> wrote:
> > > >
> > > > On Wed, Jun 16, 2021 at 11:38:08PM +0530, Jerin Jacob wrote:
> > > > > On Wed, Jun 16, 2021 at 11:01 PM Bruce Richardson
> > > > > <bruce.richardson@intel.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> > > > > > > On 2021/6/16 0:38, Bruce Richardson wrote:
> > > > > > > > On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > > > > > > >> This patch introduces 'dmadevice' which is a generic type of DMA
> > > > > > > >> device.
> > > > > > > >>
> > > > > > > >> The APIs of dmadev library exposes some generic operations which can
> > > > > > > >> enable configuration and I/O with the DMA devices.
> > > > > > > >>
> > > > > > > >> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > > > > > > >> ---
> > > > > > > > Thanks for sending this.
> > > > > > > >
> > > > > > > > Of most interest to me right now are the key data-plane APIs. While we are
> > > > > > > > still in the prototyping phase, below is a draft of what we are thinking
> > > > > > > > for the key enqueue/perform_ops/completed_ops APIs.
> > > > > > > >
> > > > > > > > Some key differences I note in below vs your original RFC:
> > > > > > > > * Use of void pointers rather than iova addresses. While using iova's makes
> > > > > > > >   sense in the general case when using hardware, in that it can work with
> > > > > > > >   both physical addresses and virtual addresses, if we change the APIs to use
> > > > > > > >   void pointers instead it will still work for DPDK in VA mode, while at the
> > > > > > > >   same time allow use of software fallbacks in error cases, and also a stub
> > > > > > > >   driver than uses memcpy in the background. Finally, using iova's makes the
> > > > > > > >   APIs a lot more awkward to use with anything but mbufs or similar buffers
> > > > > > > >   where we already have a pre-computed physical address.
> > > > > > >
> > > > > > > The iova is an hint to application, and widely used in DPDK.
> > > > > > > If switch to void, how to pass the address (iova or just va ?)
> > > > > > > this may introduce implementation dependencies here.
> > > > > > >
> > > > > > > Or always pass the va, and the driver performs address translation, and this
> > > > > > > translation may cost too much cpu I think.
> > > > > > >
> > > > > >
> > > > > > On the latter point, about driver doing address translation I would agree.
> > > > > > However, we probably need more discussion about the use of iova vs just
> > > > > > virtual addresses. My thinking on this is that if we specify the API using
> > > > > > iovas it will severely hurt usability of the API, since it forces the user
> > > > > > to take more inefficient codepaths in a large number of cases. Given a
> > > > > > pointer to the middle of an mbuf, one cannot just pass that straight as an
> > > > > > iova but must instead do a translation into offset from mbuf pointer and
> > > > > > then readd the offset to the mbuf base address.
> > > > > >
> > > > > > My preference therefore is to require the use of an IOMMU when using a
> > > > > > dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
> > > > > > present, DPDK will run in VA mode, allowing virtual addresses to our
> > > > > > hugepage memory to be sent directly to hardware. Also, when using
> > > > > > dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> > > > > > management for the app, removing further the restrictions on what memory
> > > > > > can be addressed by hardware.
> > > > >
> > > > >
> > > > > One issue of keeping void * is that memory can come from stack or heap .
> > > > > which HW can not really operate it on.
> > > >
> > > > when kernel driver is managing the IOMMU all process memory can be worked
> > > > on, not just hugepage memory, so using iova is wrong in these cases.
> > >
> > > But not for stack and heap memory. Right?
> > >
> > Yes, even stack and heap can be accessed.
> 
> The HW device cannot as that memory is NOT mapped to IOMMU. It will
> result in the transaction
> fault.
>

Not if the kernel driver rather than DPDK is managing the IOMMU:
https://www.kernel.org/doc/html/latest/x86/sva.html
"Shared Virtual Addressing (SVA) allows the processor and device to use the
same virtual addresses avoiding the need for software to translate virtual
addresses to physical addresses. SVA is what PCIe calls Shared Virtual
Memory (SVM)."
 
> At least, In octeon, DMA HW job descriptor will have a pointer (IOVA)
> which will be updated by _HW_
> upon copy job completion. That memory can not be from the
> heap(malloc()) or stack as those are not
> mapped by IOMMU.
> 
> 
> >
> > > >
> > > > As I previously said, using iova prevents the creation of a pure software
> > > > dummy driver too using memcpy in the background.
> > >
> > > Why ? the memory alloced uing rte_alloc/rte_memzone etc can be touched by CPU.
> > >
> > Yes, but it can't be accessed using physical address, so again only VA mode
> > where iova's are "void *" make sense.
> 
> I agree that it should be a physical address. My only concern that
> void * does not express
> it can not be from stack/heap. If API tells the memory need to
> allotted by rte_alloc() or rte_memzone() etc
> is fine with me.
> 
That could be a capability field too. Hardware supporting SVA/SVM does not
have this limitation so can specify that any virtual address may be used.

I suppose it really doesn't matter whether the APIs are written to take
pointers or iova's so long as the restrictions are clear. Since iova is the
default for other HW ops, I'm ok for functions to take params as iovas and
have the capability definitons provide the info to the user that in some
cases virtual addresses can be used.
Jerin Jacob June 22, 2021, 5:25 p.m. UTC | #31
On Fri, Jun 18, 2021 at 3:11 PM fengchengwen <fengchengwen@huawei.com> wrote:
>
> On 2021/6/18 13:52, Jerin Jacob wrote:
> > On Thu, Jun 17, 2021 at 2:46 PM Bruce Richardson
> > <bruce.richardson@intel.com> wrote:
> >>
> >> On Wed, Jun 16, 2021 at 08:07:26PM +0530, Jerin Jacob wrote:
> >>> On Wed, Jun 16, 2021 at 3:47 PM fengchengwen <fengchengwen@huawei.com> wrote:
> >>>>
> >>>> On 2021/6/16 15:09, Morten Brørup wrote:
> >>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> >>>>>> Sent: Tuesday, 15 June 2021 18.39
> >>>>>>
> >>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> >>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
> >>>>>>> device.
> >>>>>>>
> >>>>>>> The APIs of dmadev library exposes some generic operations which can
> >>>>>>> enable configuration and I/O with the DMA devices.
> >>>>>>>
> >>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> >>>>>>> ---
> >>>>>> Thanks for sending this.
> >>>>>>
> >>>>>> Of most interest to me right now are the key data-plane APIs. While we
> >>>>>> are
> >>>>>> still in the prototyping phase, below is a draft of what we are
> >>>>>> thinking
> >>>>>> for the key enqueue/perform_ops/completed_ops APIs.
> >>>>>>
> >>>>>> Some key differences I note in below vs your original RFC:
> >>>>>> * Use of void pointers rather than iova addresses. While using iova's
> >>>>>> makes
> >>>>>>   sense in the general case when using hardware, in that it can work
> >>>>>> with
> >>>>>>   both physical addresses and virtual addresses, if we change the APIs
> >>>>>> to use
> >>>>>>   void pointers instead it will still work for DPDK in VA mode, while
> >>>>>> at the
> >>>>>>   same time allow use of software fallbacks in error cases, and also a
> >>>>>> stub
> >>>>>>   driver than uses memcpy in the background. Finally, using iova's
> >>>>>> makes the
> >>>>>>   APIs a lot more awkward to use with anything but mbufs or similar
> >>>>>> buffers
> >>>>>>   where we already have a pre-computed physical address.
> >>>>>> * Use of id values rather than user-provided handles. Allowing the
> >>>>>> user/app
> >>>>>>   to manage the amount of data stored per operation is a better
> >>>>>> solution, I
> >>>>>>   feel than proscribing a certain about of in-driver tracking. Some
> >>>>>> apps may
> >>>>>>   not care about anything other than a job being completed, while other
> >>>>>> apps
> >>>>>>   may have significant metadata to be tracked. Taking the user-context
> >>>>>>   handles out of the API also makes the driver code simpler.
> >>>>>> * I've kept a single combined API for completions, which differs from
> >>>>>> the
> >>>>>>   separate error handling completion API you propose. I need to give
> >>>>>> the
> >>>>>>   two function approach a bit of thought, but likely both could work.
> >>>>>> If we
> >>>>>>   (likely) never expect failed ops, then the specifics of error
> >>>>>> handling
> >>>>>>   should not matter that much.
> >>>>>>
> >>>>>> For the rest, the control / setup APIs are likely to be rather
> >>>>>> uncontroversial, I suspect. However, I think that rather than xstats
> >>>>>> APIs,
> >>>>>> the library should first provide a set of standardized stats like
> >>>>>> ethdev
> >>>>>> does. If driver-specific stats are needed, we can add xstats later to
> >>>>>> the
> >>>>>> API.
> >>>>>>
> >>>>>> Appreciate your further thoughts on this, thanks.
> >>>>>>
> >>>>>> Regards,
> >>>>>> /Bruce
> >>>>>
> >>>>> I generally agree with Bruce's points above.
> >>>>>
> >>>>> I would like to share a couple of ideas for further discussion:
> >>>
> >>>
> >>> I believe some of the other requirements and comments for generic DMA will be
> >>>
> >>> 1) Support for the _channel_, Each channel may have different
> >>> capabilities and functionalities.
> >>> Typical cases are, each channel have separate source and destination
> >>> devices like
> >>> DMA between PCIe EP to Host memory, Host memory to Host memory, PCIe
> >>> EP to PCIe EP.
> >>> So we need some notion of the channel in the specification.
> >>>
> >>
> >> Can you share a bit more detail on what constitutes a channel in this case?
> >> Is it equivalent to a device queue (which we are flattening to individual
> >> devices in this API), or to a specific configuration on a queue?
> >
> > It not a queue. It is one of the attributes for transfer.
> > I.e in the same queue, for a given transfer it can specify the
> > different "source" and "destination" device.
> > Like CPU to Sound card, CPU to network card etc.
> >
> >
> >>
> >>> 2) I assume current data plane APIs are not thread-safe. Is it right?
> >>>
> >> Yes.
> >>
> >>>
> >>> 3) Cookie scheme outlined earlier looks good to me. Instead of having
> >>> generic dequeue() API
> >>>
> >>> 4) Can split the rte_dmadev_enqueue_copy(uint16_t dev_id, void * src,
> >>> void * dst, unsigned int length);
> >>> to two stage API like, Where one will be used in fastpath and other
> >>> one will use used in slowpath.
> >>>
> >>> - slowpath API will for take channel and take other attributes for transfer
> >>>
> >>> Example syantx will be:
> >>>
> >>> struct rte_dmadev_desc {
> >>>            channel id;
> >>>            ops ; // copy, xor, fill etc
> >>>           other arguments specific to dma transfer // it can be set
> >>> based on capability.
> >>>
> >>> };
> >>>
> >>> rte_dmadev_desc_t rte_dmadev_preprare(uint16_t dev_id,  struct
> >>> rte_dmadev_desc *dec);
> >>>
> >>> - Fastpath takes arguments that need to change per transfer along with
> >>> slow-path handle.
> >>>
> >>> rte_dmadev_enqueue(uint16_t dev_id, void * src, void * dst, unsigned
> >>> int length,  rte_dmadev_desc_t desc)
> >>>
> >>> This will help to driver to
> >>> -Former API form the device-specific descriptors in slow path  for a
> >>> given channel and fixed attributes per transfer
> >>> -Later API blend "variable" arguments such as src, dest address with
> >>> slow-path created descriptors
> >>>
> >>
> >> This seems like an API for a context-aware device, where the channel is the
> >> config data/context that is preserved across operations - is that correct?
> >> At least from the Intel DMA accelerators side, we have no concept of this
> >> context, and each operation is completely self-described. The location or
> >> type of memory for copies is irrelevant, you just pass the src/dst
> >> addresses to reference.
> >
> > it is not context-aware device. Each HW JOB is self-described.
> > You can view it different attributes of transfer.
> >
> >
> >>
> >>> The above will give better performance and is the best trade-off c
> >>> between performance and per transfer variables.
> >>
> >> We may need to have different APIs for context-aware and context-unaware
> >> processing, with which to use determined by the capabilities discovery.
> >> Given that for these DMA devices the offload cost is critical, more so than
> >> any other dev class I've looked at before, I'd like to avoid having APIs
> >> with extra parameters than need to be passed about since that just adds
> >> extra CPU cycles to the offload.
> >
> > If driver does not support additional attributes and/or the
> > application does not need it, rte_dmadev_desc_t can be NULL.
> > So that it won't have any cost in the datapath. I think, we can go to
> > different API
> > cases if we can not abstract problems without performance impact.
> > Otherwise, it will be too much
> > pain for applications.
>
> Yes, currently we plan to use different API for different case, e.g.
>   rte_dmadev_memcpy()  -- deal with local to local memcopy
>   rte_dmadev_memset()  -- deal with fill with local memory with pattern
> maybe:
>   rte_dmadev_imm_data()  --deal with copy very little data
>   rte_dmadev_p2pcopy()   --deal with peer-to-peer copy of diffenet PCIE addr
>
> These API capabilities will be reflected in the device capability set so that
> application could know by standard API.


There will be a lot of combination of that it will be like M x N cross
base case, It won't scale.

>
> >
> > Just to understand, I think, we need to HW capabilities and how to
> > have a common API.
> > I assume HW will have some HW JOB descriptors which will be filled in
> > SW and submitted to HW.
> > In our HW,  Job descriptor has the following main elements
> >
> > - Channel   // We don't expect the application to change per transfer
> > - Source address - It can be scatter-gather too - Will be changed per transfer
> > - Destination address - It can be scatter-gather too - Will be changed
> > per transfer
> > - Transfer Length - - It can be scatter-gather too - Will be changed
> > per transfer
> > - IOVA address where HW post Job completion status PER Job descriptor
> > - Will be changed per transfer
> > - Another sideband information related to channel  // We don't expect
> > the application to change per transfer
> > - As an option, Job completion can be posted as an event to
> > rte_event_queue  too // We don't expect the application to change per
> > transfer
>
> The 'option' field looks like a software interface field, but not HW descriptor.

It is in HW descriptor.

>
> >
> > @Richardson, Bruce @fengchengwen @Hemant Agrawal
> >
> > Could you share the options for your HW descriptors  which you are
> > planning to expose through API like above so that we can easily
> > converge on fastpath API
> >
>
> Kunpeng HW descriptor is self-describing, and don't need refer context info.
>
> Maybe the fields which was fix with some transfer type could setup by driver, and
> don't expose to application.

Yes. I agree.I think, that reason why I though to have
rte_dmadev_prep() call to convert DPDK DMA transfer attributes to HW
specific descriptors
and have single enq() operation with variable argument(through enq
parameter) and fix argumenents through rte_dmadev_prep() call object.

>
> So that we could use more generic way to define the API.
>
> >
> >
> >>
> >> /Bruce
> >
> > .
> >
>
Jerin Jacob June 22, 2021, 5:31 p.m. UTC | #32
On Fri, Jun 18, 2021 at 3:25 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Fri, Jun 18, 2021 at 11:22:28AM +0530, Jerin Jacob wrote:
> > On Thu, Jun 17, 2021 at 2:46 PM Bruce Richardson
> > <bruce.richardson@intel.com> wrote:
> > >
> > > On Wed, Jun 16, 2021 at 08:07:26PM +0530, Jerin Jacob wrote:
> > > > On Wed, Jun 16, 2021 at 3:47 PM fengchengwen <fengchengwen@huawei.com> wrote:
> > > > >
> > > > > On 2021/6/16 15:09, Morten Brørup wrote:
> > > > > >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > > > > >> Sent: Tuesday, 15 June 2021 18.39
> > > > > >>
> > > > > >> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > > > > >>> This patch introduces 'dmadevice' which is a generic type of DMA
> > > > > >>> device.
> > > > > >>>
> > > > > >>> The APIs of dmadev library exposes some generic operations which can
> > > > > >>> enable configuration and I/O with the DMA devices.
> > > > > >>>
> > > > > >>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > > > > >>> ---
> > > > > >> Thanks for sending this.
> > > > > >>
> > > > > >> Of most interest to me right now are the key data-plane APIs. While we
> > > > > >> are
> > > > > >> still in the prototyping phase, below is a draft of what we are
> > > > > >> thinking
> > > > > >> for the key enqueue/perform_ops/completed_ops APIs.
> > > > > >>
> > > > > >> Some key differences I note in below vs your original RFC:
> > > > > >> * Use of void pointers rather than iova addresses. While using iova's
> > > > > >> makes
> > > > > >>   sense in the general case when using hardware, in that it can work
> > > > > >> with
> > > > > >>   both physical addresses and virtual addresses, if we change the APIs
> > > > > >> to use
> > > > > >>   void pointers instead it will still work for DPDK in VA mode, while
> > > > > >> at the
> > > > > >>   same time allow use of software fallbacks in error cases, and also a
> > > > > >> stub
> > > > > >>   driver than uses memcpy in the background. Finally, using iova's
> > > > > >> makes the
> > > > > >>   APIs a lot more awkward to use with anything but mbufs or similar
> > > > > >> buffers
> > > > > >>   where we already have a pre-computed physical address.
> > > > > >> * Use of id values rather than user-provided handles. Allowing the
> > > > > >> user/app
> > > > > >>   to manage the amount of data stored per operation is a better
> > > > > >> solution, I
> > > > > >>   feel than proscribing a certain about of in-driver tracking. Some
> > > > > >> apps may
> > > > > >>   not care about anything other than a job being completed, while other
> > > > > >> apps
> > > > > >>   may have significant metadata to be tracked. Taking the user-context
> > > > > >>   handles out of the API also makes the driver code simpler.
> > > > > >> * I've kept a single combined API for completions, which differs from
> > > > > >> the
> > > > > >>   separate error handling completion API you propose. I need to give
> > > > > >> the
> > > > > >>   two function approach a bit of thought, but likely both could work.
> > > > > >> If we
> > > > > >>   (likely) never expect failed ops, then the specifics of error
> > > > > >> handling
> > > > > >>   should not matter that much.
> > > > > >>
> > > > > >> For the rest, the control / setup APIs are likely to be rather
> > > > > >> uncontroversial, I suspect. However, I think that rather than xstats
> > > > > >> APIs,
> > > > > >> the library should first provide a set of standardized stats like
> > > > > >> ethdev
> > > > > >> does. If driver-specific stats are needed, we can add xstats later to
> > > > > >> the
> > > > > >> API.
> > > > > >>
> > > > > >> Appreciate your further thoughts on this, thanks.
> > > > > >>
> > > > > >> Regards,
> > > > > >> /Bruce
> > > > > >
> > > > > > I generally agree with Bruce's points above.
> > > > > >
> > > > > > I would like to share a couple of ideas for further discussion:
> > > >
> > > >
> > > > I believe some of the other requirements and comments for generic DMA will be
> > > >
> > > > 1) Support for the _channel_, Each channel may have different
> > > > capabilities and functionalities.
> > > > Typical cases are, each channel have separate source and destination
> > > > devices like
> > > > DMA between PCIe EP to Host memory, Host memory to Host memory, PCIe
> > > > EP to PCIe EP.
> > > > So we need some notion of the channel in the specification.
> > > >
> > >
> > > Can you share a bit more detail on what constitutes a channel in this case?
> > > Is it equivalent to a device queue (which we are flattening to individual
> > > devices in this API), or to a specific configuration on a queue?
> >
> > It not a queue. It is one of the attributes for transfer.
> > I.e in the same queue, for a given transfer it can specify the
> > different "source" and "destination" device.
> > Like CPU to Sound card, CPU to network card etc.
> >
> Ok. Thanks for clarifying. Do you think it's best given as a
> device-specific parameter to the various functions, and NULL for hardware
> that doesn't need it?

Various functions won't scales. As we will have N number of channel
and M number of ops.
Things could blow up easily if we have separate functions and fast
path function pointers
space will run out easily in the dev structure.

>
> >
> > >
> > > > 2) I assume current data plane APIs are not thread-safe. Is it right?
> > > >
> > > Yes.
> > >
> > > >
> > > > 3) Cookie scheme outlined earlier looks good to me. Instead of having
> > > > generic dequeue() API
> > > >
> > > > 4) Can split the rte_dmadev_enqueue_copy(uint16_t dev_id, void * src,
> > > > void * dst, unsigned int length);
> > > > to two stage API like, Where one will be used in fastpath and other
> > > > one will use used in slowpath.
> > > >
> > > > - slowpath API will for take channel and take other attributes for transfer
> > > >
> > > > Example syantx will be:
> > > >
> > > > struct rte_dmadev_desc {
> > > >            channel id;
> > > >            ops ; // copy, xor, fill etc
> > > >           other arguments specific to dma transfer // it can be set
> > > > based on capability.
> > > >
> > > > };
> > > >
> > > > rte_dmadev_desc_t rte_dmadev_preprare(uint16_t dev_id,  struct
> > > > rte_dmadev_desc *dec);
> > > >
> > > > - Fastpath takes arguments that need to change per transfer along with
> > > > slow-path handle.
> > > >
> > > > rte_dmadev_enqueue(uint16_t dev_id, void * src, void * dst, unsigned
> > > > int length,  rte_dmadev_desc_t desc)
> > > >
> > > > This will help to driver to
> > > > -Former API form the device-specific descriptors in slow path  for a
> > > > given channel and fixed attributes per transfer
> > > > -Later API blend "variable" arguments such as src, dest address with
> > > > slow-path created descriptors
> > > >
> > >
> > > This seems like an API for a context-aware device, where the channel is the
> > > config data/context that is preserved across operations - is that correct?
> > > At least from the Intel DMA accelerators side, we have no concept of this
> > > context, and each operation is completely self-described. The location or
> > > type of memory for copies is irrelevant, you just pass the src/dst
> > > addresses to reference.
> >
> > it is not context-aware device. Each HW JOB is self-described.
> > You can view it different attributes of transfer.
> >
> >
> > >
> > > > The above will give better performance and is the best trade-off c
> > > > between performance and per transfer variables.
> > >
> > > We may need to have different APIs for context-aware and context-unaware
> > > processing, with which to use determined by the capabilities discovery.
> > > Given that for these DMA devices the offload cost is critical, more so than
> > > any other dev class I've looked at before, I'd like to avoid having APIs
> > > with extra parameters than need to be passed about since that just adds
> > > extra CPU cycles to the offload.
> >
> > If driver does not support additional attributes and/or the
> > application does not need it, rte_dmadev_desc_t can be NULL.
> > So that it won't have any cost in the datapath. I think, we can go to
> > different API
> > cases if we can not abstract problems without performance impact.
> > Otherwise, it will be too much
> > pain for applications.
>
> Ok. Having one extra parameter ignored by some drivers should not be that
> big of a deal. [With all these, we'll only really know for sure when
> implemented and offload cost measured]
>
> >
> > Just to understand, I think, we need to HW capabilities and how to
> > have a common API.
> > I assume HW will have some HW JOB descriptors which will be filled in
> > SW and submitted to HW.
> > In our HW,  Job descriptor has the following main elements
> >
> > - Channel   // We don't expect the application to change per transfer
> > - Source address - It can be scatter-gather too - Will be changed per transfer
> > - Destination address - It can be scatter-gather too - Will be changed
> > per transfer
> > - Transfer Length - - It can be scatter-gather too - Will be changed
> > per transfer
> > - IOVA address where HW post Job completion status PER Job descriptor
> > - Will be changed per transfer
> > - Another sideband information related to channel  // We don't expect
> > the application to change per transfer
> > - As an option, Job completion can be posted as an event to
> > rte_event_queue  too // We don't expect the application to change per
> > transfer
> >
> > @Richardson, Bruce @fengchengwen @Hemant Agrawal
> >
> > Could you share the options for your HW descriptors  which you are
> > planning to expose through API like above so that we can easily
> > converge on fastpath API
> >
> Taking the case of a simple copy op, the parameters we need are:
>
> * src
> * dst
> * length

OK. Is it the case where no other attribute that supported in HW or
you are not planning to
expose that through DPDK generic DMA API.


>
> Depending on the specific hardware there will also be passed in the
> descriptor a completion address, but we plan for these cases to always have
> the completions written back to a set location so that we have essentially
> ring-writeback, as with the hardware which doesn't explicitly have a
> separate completion address. Beyond that, I believe the only descriptor
> fields we will use are just the flags field indicating the op type etc.

OK. In HW, we need to have IOVA for completion address that's only the
constraint. rest looks good to me.


>
> /Bruce
Jerin Jacob June 22, 2021, 5:36 p.m. UTC | #33
On Fri, Jun 18, 2021 at 3:34 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Fri, Jun 18, 2021 at 10:46:08AM +0530, Jerin Jacob wrote:
> > On Thu, Jun 17, 2021 at 1:30 PM Bruce Richardson
> > <bruce.richardson@intel.com> wrote:
> > >
> > > On Thu, Jun 17, 2021 at 01:12:22PM +0530, Jerin Jacob wrote:
> > > > On Thu, Jun 17, 2021 at 12:43 AM Bruce Richardson
> > > > <bruce.richardson@intel.com> wrote:
> > > > >
> > > > > On Wed, Jun 16, 2021 at 11:38:08PM +0530, Jerin Jacob wrote:
> > > > > > On Wed, Jun 16, 2021 at 11:01 PM Bruce Richardson
> > > > > > <bruce.richardson@intel.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> > > > > > > > On 2021/6/16 0:38, Bruce Richardson wrote:
> > > > > > > > > On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > > > > > > > >> This patch introduces 'dmadevice' which is a generic type of DMA
> > > > > > > > >> device.
> > > > > > > > >>
> > > > > > > > >> The APIs of dmadev library exposes some generic operations which can
> > > > > > > > >> enable configuration and I/O with the DMA devices.
> > > > > > > > >>
> > > > > > > > >> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > > > > > > > >> ---
> > > > > > > > > Thanks for sending this.
> > > > > > > > >
> > > > > > > > > Of most interest to me right now are the key data-plane APIs. While we are
> > > > > > > > > still in the prototyping phase, below is a draft of what we are thinking
> > > > > > > > > for the key enqueue/perform_ops/completed_ops APIs.
> > > > > > > > >
> > > > > > > > > Some key differences I note in below vs your original RFC:
> > > > > > > > > * Use of void pointers rather than iova addresses. While using iova's makes
> > > > > > > > >   sense in the general case when using hardware, in that it can work with
> > > > > > > > >   both physical addresses and virtual addresses, if we change the APIs to use
> > > > > > > > >   void pointers instead it will still work for DPDK in VA mode, while at the
> > > > > > > > >   same time allow use of software fallbacks in error cases, and also a stub
> > > > > > > > >   driver than uses memcpy in the background. Finally, using iova's makes the
> > > > > > > > >   APIs a lot more awkward to use with anything but mbufs or similar buffers
> > > > > > > > >   where we already have a pre-computed physical address.
> > > > > > > >
> > > > > > > > The iova is an hint to application, and widely used in DPDK.
> > > > > > > > If switch to void, how to pass the address (iova or just va ?)
> > > > > > > > this may introduce implementation dependencies here.
> > > > > > > >
> > > > > > > > Or always pass the va, and the driver performs address translation, and this
> > > > > > > > translation may cost too much cpu I think.
> > > > > > > >
> > > > > > >
> > > > > > > On the latter point, about driver doing address translation I would agree.
> > > > > > > However, we probably need more discussion about the use of iova vs just
> > > > > > > virtual addresses. My thinking on this is that if we specify the API using
> > > > > > > iovas it will severely hurt usability of the API, since it forces the user
> > > > > > > to take more inefficient codepaths in a large number of cases. Given a
> > > > > > > pointer to the middle of an mbuf, one cannot just pass that straight as an
> > > > > > > iova but must instead do a translation into offset from mbuf pointer and
> > > > > > > then readd the offset to the mbuf base address.
> > > > > > >
> > > > > > > My preference therefore is to require the use of an IOMMU when using a
> > > > > > > dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
> > > > > > > present, DPDK will run in VA mode, allowing virtual addresses to our
> > > > > > > hugepage memory to be sent directly to hardware. Also, when using
> > > > > > > dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> > > > > > > management for the app, removing further the restrictions on what memory
> > > > > > > can be addressed by hardware.
> > > > > >
> > > > > >
> > > > > > One issue of keeping void * is that memory can come from stack or heap .
> > > > > > which HW can not really operate it on.
> > > > >
> > > > > when kernel driver is managing the IOMMU all process memory can be worked
> > > > > on, not just hugepage memory, so using iova is wrong in these cases.
> > > >
> > > > But not for stack and heap memory. Right?
> > > >
> > > Yes, even stack and heap can be accessed.
> >
> > The HW device cannot as that memory is NOT mapped to IOMMU. It will
> > result in the transaction
> > fault.
> >
>
> Not if the kernel driver rather than DPDK is managing the IOMMU:
> https://www.kernel.org/doc/html/latest/x86/sva.html
> "Shared Virtual Addressing (SVA) allows the processor and device to use the
> same virtual addresses avoiding the need for software to translate virtual
> addresses to physical addresses. SVA is what PCIe calls Shared Virtual
> Memory (SVM)."

Thanks for the info. Looks like a cool x86 arch feature. However, we
don't have this feature.

>
> > At least, In octeon, DMA HW job descriptor will have a pointer (IOVA)
> > which will be updated by _HW_
> > upon copy job completion. That memory can not be from the
> > heap(malloc()) or stack as those are not
> > mapped by IOMMU.
> >
> >
> > >
> > > > >
> > > > > As I previously said, using iova prevents the creation of a pure software
> > > > > dummy driver too using memcpy in the background.
> > > >
> > > > Why ? the memory alloced uing rte_alloc/rte_memzone etc can be touched by CPU.
> > > >
> > > Yes, but it can't be accessed using physical address, so again only VA mode
> > > where iova's are "void *" make sense.
> >
> > I agree that it should be a physical address. My only concern that
> > void * does not express
> > it can not be from stack/heap. If API tells the memory need to
> > allotted by rte_alloc() or rte_memzone() etc
> > is fine with me.
> >
> That could be a capability field too. Hardware supporting SVA/SVM does not
> have this limitation so can specify that any virtual address may be used.
>
> I suppose it really doesn't matter whether the APIs are written to take
> pointers or iova's so long as the restrictions are clear. Since iova is the
> default for other HW ops, I'm ok for functions to take params as iovas and
> have the capability definitons provide the info to the user that in some
> cases virtual addresses can be used.

OK.
Jerin Jacob June 22, 2021, 5:51 p.m. UTC | #34
On Fri, Jun 18, 2021 at 2:22 PM fengchengwen <fengchengwen@huawei.com> wrote:
>
> On 2021/6/17 22:18, Bruce Richardson wrote:
> > On Thu, Jun 17, 2021 at 12:02:00PM +0100, Bruce Richardson wrote:
> >> On Thu, Jun 17, 2021 at 05:48:05PM +0800, fengchengwen wrote:
> >>> On 2021/6/17 1:31, Bruce Richardson wrote:
> >>>> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> >>>>> On 2021/6/16 0:38, Bruce Richardson wrote:
> >>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> >>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
> >>>>>>> device.
> >>>>>>>
> >>>>>>> The APIs of dmadev library exposes some generic operations which can
> >>>>>>> enable configuration and I/O with the DMA devices.
> >>>>>>>
> >>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> >>>>>>> ---
> >>>>>> Thanks for sending this.
> >>>>>>
> >>>>>> Of most interest to me right now are the key data-plane APIs. While we are
> >>>>>> still in the prototyping phase, below is a draft of what we are thinking
> >>>>>> for the key enqueue/perform_ops/completed_ops APIs.
> >>>>>>
> >>>>>> Some key differences I note in below vs your original RFC:
> >>>>>> * Use of void pointers rather than iova addresses. While using iova's makes
> >>>>>>   sense in the general case when using hardware, in that it can work with
> >>>>>>   both physical addresses and virtual addresses, if we change the APIs to use
> >>>>>>   void pointers instead it will still work for DPDK in VA mode, while at the
> >>>>>>   same time allow use of software fallbacks in error cases, and also a stub
> >>>>>>   driver than uses memcpy in the background. Finally, using iova's makes the
> >>>>>>   APIs a lot more awkward to use with anything but mbufs or similar buffers
> >>>>>>   where we already have a pre-computed physical address.
> >>>>>
> >>>>> The iova is an hint to application, and widely used in DPDK.
> >>>>> If switch to void, how to pass the address (iova or just va ?)
> >>>>> this may introduce implementation dependencies here.
> >>>>>
> >>>>> Or always pass the va, and the driver performs address translation, and this
> >>>>> translation may cost too much cpu I think.
> >>>>>
> >>>>
> >>>> On the latter point, about driver doing address translation I would agree.
> >>>> However, we probably need more discussion about the use of iova vs just
> >>>> virtual addresses. My thinking on this is that if we specify the API using
> >>>> iovas it will severely hurt usability of the API, since it forces the user
> >>>> to take more inefficient codepaths in a large number of cases. Given a
> >>>> pointer to the middle of an mbuf, one cannot just pass that straight as an
> >>>> iova but must instead do a translation into offset from mbuf pointer and
> >>>> then readd the offset to the mbuf base address.
> >>>>
> >>>> My preference therefore is to require the use of an IOMMU when using a
> >>>> dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
> >>>> present, DPDK will run in VA mode, allowing virtual addresses to our
> >>>> hugepage memory to be sent directly to hardware. Also, when using
> >>>> dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> >>>> management for the app, removing further the restrictions on what memory
> >>>> can be addressed by hardware.
> >>>
> >>> Some DMA devices many don't support IOMMU or IOMMU bypass default, so driver may
> >>> should call rte_mem_virt2phy() do the address translate, but the rte_mem_virt2phy()
> >>> cost too many CPU cycles.
> >>>
> >>> If the API defined as iova, it will work fine in:
> >>> 1) If DMA don't support IOMMU or IOMMU bypass, then start application with
> >>>    --iova-mode=pa
> >>> 2) If DMA support IOMMU, --iova-mode=pa/va work both fine
> >>>
> >>
> >> I suppose if we keep the iova as the datatype, we can just cast "void *"
> >> pointers to that in the case that virtual addresses can be used directly. I
> >> believe your RFC included a capability query API - "uses void * as iova"
> >> should probably be one of those capabilities, and that would resolve this.
> >> If DPDK is in iova=va mode because of the presence of an iommu, all drivers
> >> could report this capability too.
> >>
> >>>>
> >>>>>> * Use of id values rather than user-provided handles. Allowing the user/app
> >>>>>>   to manage the amount of data stored per operation is a better solution, I
> >>>>>>   feel than proscribing a certain about of in-driver tracking. Some apps may
> >>>>>>   not care about anything other than a job being completed, while other apps
> >>>>>>   may have significant metadata to be tracked. Taking the user-context
> >>>>>>   handles out of the API also makes the driver code simpler.
> >>>>>
> >>>>> The user-provided handle was mainly used to simply application implementation,
> >>>>> It provides the ability to quickly locate contexts.
> >>>>>
> >>>>> The "use of id values" seem like the dma_cookie of Linux DMA engine framework,
> >>>>> user will get a unique dma_cookie after calling dmaengine_submit(), and then
> >>>>> could use it to call dma_async_is_tx_complete() to get completion status.
> >>>>>
> >>>>
> >>>> Yes, the idea of the id is the same - to locate contexts. The main
> >>>> difference is that if we have the driver manage contexts or pointer to
> >>>> contexts, as well as giving more work to the driver, it complicates the APIs
> >>>> for measuring completions. If we use an ID-based approach, where the app
> >>>> maintains its own ring of contexts (if any), it avoids the need to have an
> >>>> "out" parameter array for returning those contexts, which needs to be
> >>>> appropriately sized. Instead we can just report that all ids up to N are
> >>>> completed. [This would be similar to your suggestion that N jobs be
> >>>> reported as done, in that no contexts are provided, it's just that knowing
> >>>> the ID of what is completed is generally more useful than the number (which
> >>>> can be obviously got by subtracting the old value)]
> >>>>
> >>>> We are still working on prototyping all this, but would hope to have a
> >>>> functional example of all this soon.
> >>>>
> >>>>> How about define the copy prototype as following:
> >>>>>   dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx)
> >>>>> while the dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
> >>>>> enqueue successful else fail.
> >>>>> when complete the dmadev will return latest completed dma_cookie, and the
> >>>>> application could use the dma_cookie to quick locate contexts.
> >>>>>
> >>>>
> >>>> If I understand this correctly, I believe this is largely what I was
> >>>> suggesting - just with the typedef for the type? In which case it obviously
> >>>> looks good to me.
> >>>>
> >>>>>> * I've kept a single combined API for completions, which differs from the
> >>>>>>   separate error handling completion API you propose. I need to give the
> >>>>>>   two function approach a bit of thought, but likely both could work. If we
> >>>>>>   (likely) never expect failed ops, then the specifics of error handling
> >>>>>>   should not matter that much.
> >>>>>
> >>>>> The rte_ioat_completed_ops API is too complex, and consider some applications
> >>>>> may never copy fail, so split them as two API.
> >>>>> It's indeed not friendly to other scenarios that always require error handling.
> >>>>>
> >>>>> I prefer use completed operations number as return value other than the ID so
> >>>>> that application could simple judge whether have new completed operations, and
> >>>>> the new prototype:
> >>>>>  uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint32_t *status, uint16_t max_status, uint16_t *num_fails);
> >>>>>
> >>>>> 1) for normal case which never expect failed ops:
> >>>>>    just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL);
> >>>>> 2) for other case:
> >>>>>    ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status, &fails);
> >>>>>    at this point the fails <= ret <= max_status
> >>>>>
> >>>> Completely agree that we need to plan for the happy-day case where all is
> >>>> passing. Looking at the prototypes you have above, I am ok with returning
> >>>> number of completed ops as the return value with the final completed cookie
> >>>> as an "out" parameter.
> >>>> For handling errors, I'm ok with what you propose above, just with one
> >>>> small adjustment - I would remove the restriction that ret <= max_status.
> >>>>
> >>>> In case of zero-failures, we can report as many ops succeeding as we like,
> >>>> and even in case of failure, we can still report as many successful ops as
> >>>> we like before we start filling in the status field. For example, if 32 ops
> >>>> are completed, and the last one fails, we can just fill in one entry into
> >>>> status, and return 32. Alternatively if the 4th last one fails we fill in 4
> >>>> entries and return 32. The only requirements would be:
> >>>> * fails <= max_status
> >>>> * fails <= ret
> >>>> * cookie holds the id of the last entry in status.
> >>>
> >>> I think we understand the same:
> >>>
> >>> The fails <= ret <= max_status include following situation:
> >>> 1) If max_status is 32, and there are 32 completed ops, then the ret will be 32
> >>> no matter which ops is failed
> >>> 2) If max_status is 33, and there are 32 completed ops, then the ret will be 32
> >>> 3) If max_status is 16, and there are 32 completed ops, then the ret will be 16
> >>>
> >>> and the cookie always hold the id of the last returned completed ops, no matter
> >>> it's completed successful or failed
> >>>
> >>
> >> I actually disagree on the #3. If max_status is 16, there are 32 completed
> >> ops, and *no failures* the ret will be 32, not 16, because we are not
> >> returning any status entries so max_status need not apply. Keeping that
> >> same scenario #3, depending on the number of failures and the point of
> >> them, the return value may similarly vary, for example:
> >> * if job #28 fails, then ret could still be 32, cookie would be the cookie
> >>   for that job, "fails" parameter would return as 4, with status holding the
> >>   failure of 28 plus the succeeded status of jobs 29-31, i.e. 4 elements.
> >> * if job #5 fails, then we can't fit the status list from 5 though 31 in an
> >>   array of 16, so "fails" == 16(max_status) and status contains the 16
> >>   statuses starting from #5, which means that cookie contains the value for
> >>   job #20 and ret is 21.
> >>
> >> In other words, ignore max_status and status parameters *unless we have an
> >> error to return*, meaning the fast-path/happy-day case works as fast as
> >> possible. You don't need to worry about sizing your status array to be big,
> >> and you always get back a large number of completions when available. Your
> >> fastpath code only need check the "fails" parameter to see if status needs
> >> to ever be consulted, and in normal case it doesn't.
> >>
> >> If this is too complicated, maybe we can simplify a little by returning just
> >> one failure at a time, though at the cost of making error handling slower?
> >>
> >> rte_dmadev_completed(dev_id, &cookie, &failure_status)
> >>
> >> In this case, we always return the number of completed ops on success,
> >> while on failure, we return the first error code. For a single error, this
> >> works fine, but if we get a burst of errors together, things will work
> >> slower - which may be acceptable if errors are very rare. However, for idxd
> >> at least if a fence occurs after a failure all jobs in the batch after the
> >> fence would be skipped, which would lead to the "burst of errors" case.
> >> Therefore, I'd prefer to have the original suggestion allowing multiple
> >> errors to be reported at a time.
> >>
> >> /Bruce
> >
> > Apologies for self-reply, but thinking about it more, a combination of
> > normal-case and error-case APIs may be just simpler:
> >
> > int rte_dmadev_completed(dev_id, &cookie)
> >
> > returns number of items completed and cookie of last item. If there is an
> > error, returns all successfull values up to the error entry and returns -1
> > on subsequent call.
> >
> > int rte_dmadev_completed_status(dev_id, &cookie, max_status, status_array,
> >       &error_count)
> >
> > this is a slower completion API which behaves like you originally said
> > above, returning number of completions x, 0 <= x <= max_status, with x
> > status values filled into array, and the number of unsuccessful values in
> > the error_count value.
> >
> > This would allow code to be written in the application to use
> > rte_dmadev_completed() in the normal case, and on getting a "-1" value, use
> > rte_dmadev_completed_status() to get the error details. If strings of
> > errors might be expected, the app can continually use the
> > completed_status() function until error_count returns 0, and then switch
> > back to the faster/simpler version.
>
> This two-function simplify the status_array's maintenance because we don't need init it to zero.
> I think it's a good trade-off between performance and rich error info (status code).
>
> Here I'd like to discuss the 'burst size', which is widely used in DPDK application (e.g.
> nic polling or ring en/dequeue).
> Currently we don't define a max completed ops in rte_dmadev_completed() API, the return
> value may greater than 'burst size' of application, this may result in the application need to
> maintain (or remember) the return value of the function and special handling at the next poll.
>
> Also consider there may multiple calls rte_dmadev_completed to check fail, it may make it
> difficult for the application to use.
>
> So I prefer following prototype:
>   uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_cpls, bool *has_error)
>     -- nb_cpls: indicate max process operations number
>     -- has_error: indicate if there is an error
>     -- return value: the number of successful completed operations.
>     -- example:
>        1) If there are already 32 completed ops, and 4th is error, and nb_cpls is 32, then
>           the ret will be 3(because 1/2/3th is OK), and has_error will be true.
>        2) If there are already 32 completed ops, and all successful completed, then the ret
>           will be min(32, nb_cpls), and has_error will be false.
>        3) If there are already 32 completed ops, and all failed completed, then the ret will
>           be 0, and has_error will be true.
>   uint16_t rte_dmadev_completed_status(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_status, uint32_t *status)
>     -- return value: the number of failed completed operations.



In typical storage use cases etc, Sometimes application need to
provide scatter-gather list,
At least in our hardware sg list gives a "single completion result"
and it stops on the first failure to restart
the transfer by application. Have you thought of scatter-gather use
case and how it is in other  HW?

prototype like the following works for us:
rte_dmadev_enq_sg(void **src, void **dest, unsigned int **length, int
nb_segments, cookie, ,,,)


>
> The application use the following invocation order when polling:
>   has_error = false; // could be init to false by dmadev API, we need discuss
>   ret = rte_dmadev_completed(dev_id, &cookie, bust_size, &has_error);
>   // process successful completed case:
>   for (int i = 0; i < ret; i++) {
>   }
>   if (unlikely(has_error)) {
>     // process failed completed case
>     ret = rte_dmadev_completed_status(dev_id, &cookie, burst_size - ret, status_array);
>     for (int i = 0; i < ret; i++) {
>       // ...
>     }
>   }
>
>
> >
> > This two-function approach also allows future support for other DMA
> > functions such as comparison, where a status value is always required. Any
> > apps using that functionality would just always use the "_status" function
> > for completions.
> >
> > /Bruce
> >
> > .
> >
>
Bruce Richardson June 22, 2021, 7:17 p.m. UTC | #35
On Tue, Jun 22, 2021 at 11:01:47PM +0530, Jerin Jacob wrote:
> On Fri, Jun 18, 2021 at 3:25 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > >
> > Taking the case of a simple copy op, the parameters we need are:
> >
> > * src
> > * dst
> > * length
> 
> OK. Is it the case where no other attribute that supported in HW or
> you are not planning to
> expose that through DPDK generic DMA API.
> 
Only other parameters that might be needed can all be specified as flags,
so all we need for a copy op is a general flags field for future expansion.

> >
> > Depending on the specific hardware there will also be passed in the
> > descriptor a completion address, but we plan for these cases to always have
> > the completions written back to a set location so that we have essentially
> > ring-writeback, as with the hardware which doesn't explicitly have a
> > separate completion address. Beyond that, I believe the only descriptor
> > fields we will use are just the flags field indicating the op type etc.
> 
> OK. In HW, we need to have IOVA for completion address that's only the
> constraint. rest looks good to me.
> 
That's like what we have, but I was not planning on exposing the completion
address through the API at all, but have it internal to the driver and let
the "completion" APIs just inform the app what is done or not. If we expose
completion addresses, then that leaves the app open to having to parse
different completion formats, so it needs to be internal IMHO.

/Bruce
fengchengwen June 23, 2021, 3:30 a.m. UTC | #36
On 2021/6/23 1:25, Jerin Jacob wrote:
> On Fri, Jun 18, 2021 at 3:11 PM fengchengwen <fengchengwen@huawei.com> wrote:
>>
>> On 2021/6/18 13:52, Jerin Jacob wrote:
>>> On Thu, Jun 17, 2021 at 2:46 PM Bruce Richardson
>>> <bruce.richardson@intel.com> wrote:
>>>>
>>>> On Wed, Jun 16, 2021 at 08:07:26PM +0530, Jerin Jacob wrote:
>>>>> On Wed, Jun 16, 2021 at 3:47 PM fengchengwen <fengchengwen@huawei.com> wrote:
>>>>>>
>>>>>> On 2021/6/16 15:09, Morten Brørup wrote:
>>>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
>>>>>>>> Sent: Tuesday, 15 June 2021 18.39
>>>>>>>>
>>>>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
>>>>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
>>>>>>>>> device.
>>>>>>>>>
>>>>>>>>> The APIs of dmadev library exposes some generic operations which can
>>>>>>>>> enable configuration and I/O with the DMA devices.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>>>>>>>>> ---
>>>>>>>> Thanks for sending this.
>>>>>>>>
>>>>>>>> Of most interest to me right now are the key data-plane APIs. While we
>>>>>>>> are
>>>>>>>> still in the prototyping phase, below is a draft of what we are
>>>>>>>> thinking
>>>>>>>> for the key enqueue/perform_ops/completed_ops APIs.
>>>>>>>>
>>>>>>>> Some key differences I note in below vs your original RFC:
>>>>>>>> * Use of void pointers rather than iova addresses. While using iova's
>>>>>>>> makes
>>>>>>>>   sense in the general case when using hardware, in that it can work
>>>>>>>> with
>>>>>>>>   both physical addresses and virtual addresses, if we change the APIs
>>>>>>>> to use
>>>>>>>>   void pointers instead it will still work for DPDK in VA mode, while
>>>>>>>> at the
>>>>>>>>   same time allow use of software fallbacks in error cases, and also a
>>>>>>>> stub
>>>>>>>>   driver than uses memcpy in the background. Finally, using iova's
>>>>>>>> makes the
>>>>>>>>   APIs a lot more awkward to use with anything but mbufs or similar
>>>>>>>> buffers
>>>>>>>>   where we already have a pre-computed physical address.
>>>>>>>> * Use of id values rather than user-provided handles. Allowing the
>>>>>>>> user/app
>>>>>>>>   to manage the amount of data stored per operation is a better
>>>>>>>> solution, I
>>>>>>>>   feel than proscribing a certain about of in-driver tracking. Some
>>>>>>>> apps may
>>>>>>>>   not care about anything other than a job being completed, while other
>>>>>>>> apps
>>>>>>>>   may have significant metadata to be tracked. Taking the user-context
>>>>>>>>   handles out of the API also makes the driver code simpler.
>>>>>>>> * I've kept a single combined API for completions, which differs from
>>>>>>>> the
>>>>>>>>   separate error handling completion API you propose. I need to give
>>>>>>>> the
>>>>>>>>   two function approach a bit of thought, but likely both could work.
>>>>>>>> If we
>>>>>>>>   (likely) never expect failed ops, then the specifics of error
>>>>>>>> handling
>>>>>>>>   should not matter that much.
>>>>>>>>
>>>>>>>> For the rest, the control / setup APIs are likely to be rather
>>>>>>>> uncontroversial, I suspect. However, I think that rather than xstats
>>>>>>>> APIs,
>>>>>>>> the library should first provide a set of standardized stats like
>>>>>>>> ethdev
>>>>>>>> does. If driver-specific stats are needed, we can add xstats later to
>>>>>>>> the
>>>>>>>> API.
>>>>>>>>
>>>>>>>> Appreciate your further thoughts on this, thanks.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> /Bruce
>>>>>>>
>>>>>>> I generally agree with Bruce's points above.
>>>>>>>
>>>>>>> I would like to share a couple of ideas for further discussion:
>>>>>
>>>>>
>>>>> I believe some of the other requirements and comments for generic DMA will be
>>>>>
>>>>> 1) Support for the _channel_, Each channel may have different
>>>>> capabilities and functionalities.
>>>>> Typical cases are, each channel have separate source and destination
>>>>> devices like
>>>>> DMA between PCIe EP to Host memory, Host memory to Host memory, PCIe
>>>>> EP to PCIe EP.
>>>>> So we need some notion of the channel in the specification.
>>>>>
>>>>
>>>> Can you share a bit more detail on what constitutes a channel in this case?
>>>> Is it equivalent to a device queue (which we are flattening to individual
>>>> devices in this API), or to a specific configuration on a queue?
>>>
>>> It not a queue. It is one of the attributes for transfer.
>>> I.e in the same queue, for a given transfer it can specify the
>>> different "source" and "destination" device.
>>> Like CPU to Sound card, CPU to network card etc.
>>>
>>>
>>>>
>>>>> 2) I assume current data plane APIs are not thread-safe. Is it right?
>>>>>
>>>> Yes.
>>>>
>>>>>
>>>>> 3) Cookie scheme outlined earlier looks good to me. Instead of having
>>>>> generic dequeue() API
>>>>>
>>>>> 4) Can split the rte_dmadev_enqueue_copy(uint16_t dev_id, void * src,
>>>>> void * dst, unsigned int length);
>>>>> to two stage API like, Where one will be used in fastpath and other
>>>>> one will use used in slowpath.
>>>>>
>>>>> - slowpath API will for take channel and take other attributes for transfer
>>>>>
>>>>> Example syantx will be:
>>>>>
>>>>> struct rte_dmadev_desc {
>>>>>            channel id;
>>>>>            ops ; // copy, xor, fill etc
>>>>>           other arguments specific to dma transfer // it can be set
>>>>> based on capability.
>>>>>
>>>>> };
>>>>>
>>>>> rte_dmadev_desc_t rte_dmadev_preprare(uint16_t dev_id,  struct
>>>>> rte_dmadev_desc *dec);
>>>>>
>>>>> - Fastpath takes arguments that need to change per transfer along with
>>>>> slow-path handle.
>>>>>
>>>>> rte_dmadev_enqueue(uint16_t dev_id, void * src, void * dst, unsigned
>>>>> int length,  rte_dmadev_desc_t desc)
>>>>>
>>>>> This will help to driver to
>>>>> -Former API form the device-specific descriptors in slow path  for a
>>>>> given channel and fixed attributes per transfer
>>>>> -Later API blend "variable" arguments such as src, dest address with
>>>>> slow-path created descriptors
>>>>>
>>>>
>>>> This seems like an API for a context-aware device, where the channel is the
>>>> config data/context that is preserved across operations - is that correct?
>>>> At least from the Intel DMA accelerators side, we have no concept of this
>>>> context, and each operation is completely self-described. The location or
>>>> type of memory for copies is irrelevant, you just pass the src/dst
>>>> addresses to reference.
>>>
>>> it is not context-aware device. Each HW JOB is self-described.
>>> You can view it different attributes of transfer.
>>>
>>>
>>>>
>>>>> The above will give better performance and is the best trade-off c
>>>>> between performance and per transfer variables.
>>>>
>>>> We may need to have different APIs for context-aware and context-unaware
>>>> processing, with which to use determined by the capabilities discovery.
>>>> Given that for these DMA devices the offload cost is critical, more so than
>>>> any other dev class I've looked at before, I'd like to avoid having APIs
>>>> with extra parameters than need to be passed about since that just adds
>>>> extra CPU cycles to the offload.
>>>
>>> If driver does not support additional attributes and/or the
>>> application does not need it, rte_dmadev_desc_t can be NULL.
>>> So that it won't have any cost in the datapath. I think, we can go to
>>> different API
>>> cases if we can not abstract problems without performance impact.
>>> Otherwise, it will be too much
>>> pain for applications.
>>
>> Yes, currently we plan to use different API for different case, e.g.
>>   rte_dmadev_memcpy()  -- deal with local to local memcopy
>>   rte_dmadev_memset()  -- deal with fill with local memory with pattern
>> maybe:
>>   rte_dmadev_imm_data()  --deal with copy very little data
>>   rte_dmadev_p2pcopy()   --deal with peer-to-peer copy of diffenet PCIE addr
>>
>> These API capabilities will be reflected in the device capability set so that
>> application could know by standard API.
> 
> 
> There will be a lot of combination of that it will be like M x N cross
> base case, It won't scale.

Currently, it is hard to define generic dma descriptor, I think the well-defined
APIs is feasible.

> 
>>
>>>
>>> Just to understand, I think, we need to HW capabilities and how to
>>> have a common API.
>>> I assume HW will have some HW JOB descriptors which will be filled in
>>> SW and submitted to HW.
>>> In our HW,  Job descriptor has the following main elements
>>>
>>> - Channel   // We don't expect the application to change per transfer
>>> - Source address - It can be scatter-gather too - Will be changed per transfer
>>> - Destination address - It can be scatter-gather too - Will be changed
>>> per transfer
>>> - Transfer Length - - It can be scatter-gather too - Will be changed
>>> per transfer
>>> - IOVA address where HW post Job completion status PER Job descriptor
>>> - Will be changed per transfer
>>> - Another sideband information related to channel  // We don't expect
>>> the application to change per transfer
>>> - As an option, Job completion can be posted as an event to
>>> rte_event_queue  too // We don't expect the application to change per
>>> transfer
>>
>> The 'option' field looks like a software interface field, but not HW descriptor.
> 
> It is in HW descriptor.

The HW is interesting, something like: DMA could send completion direct to EventHWQueue,
the DMA and EventHWQueue are link in the hardware range, rather than by software.

Could you provide public driver of this HW ? So we could know more about it's working
mechanism and software-hardware collaboration.

> 
>>
>>>
>>> @Richardson, Bruce @fengchengwen @Hemant Agrawal
>>>
>>> Could you share the options for your HW descriptors  which you are
>>> planning to expose through API like above so that we can easily
>>> converge on fastpath API
>>>
>>
>> Kunpeng HW descriptor is self-describing, and don't need refer context info.
>>
>> Maybe the fields which was fix with some transfer type could setup by driver, and
>> don't expose to application.
> 
> Yes. I agree.I think, that reason why I though to have
> rte_dmadev_prep() call to convert DPDK DMA transfer attributes to HW
> specific descriptors
> and have single enq() operation with variable argument(through enq
> parameter) and fix argumenents through rte_dmadev_prep() call object.
> 
>>
>> So that we could use more generic way to define the API.
>>
>>>
>>>
>>>>
>>>> /Bruce
>>>
>>> .
>>>
>>
> 
> .
>
fengchengwen June 23, 2021, 3:50 a.m. UTC | #37
On 2021/6/23 1:51, Jerin Jacob wrote:
> On Fri, Jun 18, 2021 at 2:22 PM fengchengwen <fengchengwen@huawei.com> wrote:
>>
>> On 2021/6/17 22:18, Bruce Richardson wrote:
>>> On Thu, Jun 17, 2021 at 12:02:00PM +0100, Bruce Richardson wrote:
>>>> On Thu, Jun 17, 2021 at 05:48:05PM +0800, fengchengwen wrote:
>>>>> On 2021/6/17 1:31, Bruce Richardson wrote:
>>>>>> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
>>>>>>> On 2021/6/16 0:38, Bruce Richardson wrote:
>>>>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
>>>>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
>>>>>>>>> device.
>>>>>>>>>
>>>>>>>>> The APIs of dmadev library exposes some generic operations which can
>>>>>>>>> enable configuration and I/O with the DMA devices.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>>>>>>>>> ---
>>>>>>>> Thanks for sending this.
>>>>>>>>
>>>>>>>> Of most interest to me right now are the key data-plane APIs. While we are
>>>>>>>> still in the prototyping phase, below is a draft of what we are thinking
>>>>>>>> for the key enqueue/perform_ops/completed_ops APIs.
>>>>>>>>
>>>>>>>> Some key differences I note in below vs your original RFC:
>>>>>>>> * Use of void pointers rather than iova addresses. While using iova's makes
>>>>>>>>   sense in the general case when using hardware, in that it can work with
>>>>>>>>   both physical addresses and virtual addresses, if we change the APIs to use
>>>>>>>>   void pointers instead it will still work for DPDK in VA mode, while at the
>>>>>>>>   same time allow use of software fallbacks in error cases, and also a stub
>>>>>>>>   driver than uses memcpy in the background. Finally, using iova's makes the
>>>>>>>>   APIs a lot more awkward to use with anything but mbufs or similar buffers
>>>>>>>>   where we already have a pre-computed physical address.
>>>>>>>
>>>>>>> The iova is an hint to application, and widely used in DPDK.
>>>>>>> If switch to void, how to pass the address (iova or just va ?)
>>>>>>> this may introduce implementation dependencies here.
>>>>>>>
>>>>>>> Or always pass the va, and the driver performs address translation, and this
>>>>>>> translation may cost too much cpu I think.
>>>>>>>
>>>>>>
>>>>>> On the latter point, about driver doing address translation I would agree.
>>>>>> However, we probably need more discussion about the use of iova vs just
>>>>>> virtual addresses. My thinking on this is that if we specify the API using
>>>>>> iovas it will severely hurt usability of the API, since it forces the user
>>>>>> to take more inefficient codepaths in a large number of cases. Given a
>>>>>> pointer to the middle of an mbuf, one cannot just pass that straight as an
>>>>>> iova but must instead do a translation into offset from mbuf pointer and
>>>>>> then readd the offset to the mbuf base address.
>>>>>>
>>>>>> My preference therefore is to require the use of an IOMMU when using a
>>>>>> dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
>>>>>> present, DPDK will run in VA mode, allowing virtual addresses to our
>>>>>> hugepage memory to be sent directly to hardware. Also, when using
>>>>>> dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
>>>>>> management for the app, removing further the restrictions on what memory
>>>>>> can be addressed by hardware.
>>>>>
>>>>> Some DMA devices many don't support IOMMU or IOMMU bypass default, so driver may
>>>>> should call rte_mem_virt2phy() do the address translate, but the rte_mem_virt2phy()
>>>>> cost too many CPU cycles.
>>>>>
>>>>> If the API defined as iova, it will work fine in:
>>>>> 1) If DMA don't support IOMMU or IOMMU bypass, then start application with
>>>>>    --iova-mode=pa
>>>>> 2) If DMA support IOMMU, --iova-mode=pa/va work both fine
>>>>>
>>>>
>>>> I suppose if we keep the iova as the datatype, we can just cast "void *"
>>>> pointers to that in the case that virtual addresses can be used directly. I
>>>> believe your RFC included a capability query API - "uses void * as iova"
>>>> should probably be one of those capabilities, and that would resolve this.
>>>> If DPDK is in iova=va mode because of the presence of an iommu, all drivers
>>>> could report this capability too.
>>>>
>>>>>>
>>>>>>>> * Use of id values rather than user-provided handles. Allowing the user/app
>>>>>>>>   to manage the amount of data stored per operation is a better solution, I
>>>>>>>>   feel than proscribing a certain about of in-driver tracking. Some apps may
>>>>>>>>   not care about anything other than a job being completed, while other apps
>>>>>>>>   may have significant metadata to be tracked. Taking the user-context
>>>>>>>>   handles out of the API also makes the driver code simpler.
>>>>>>>
>>>>>>> The user-provided handle was mainly used to simply application implementation,
>>>>>>> It provides the ability to quickly locate contexts.
>>>>>>>
>>>>>>> The "use of id values" seem like the dma_cookie of Linux DMA engine framework,
>>>>>>> user will get a unique dma_cookie after calling dmaengine_submit(), and then
>>>>>>> could use it to call dma_async_is_tx_complete() to get completion status.
>>>>>>>
>>>>>>
>>>>>> Yes, the idea of the id is the same - to locate contexts. The main
>>>>>> difference is that if we have the driver manage contexts or pointer to
>>>>>> contexts, as well as giving more work to the driver, it complicates the APIs
>>>>>> for measuring completions. If we use an ID-based approach, where the app
>>>>>> maintains its own ring of contexts (if any), it avoids the need to have an
>>>>>> "out" parameter array for returning those contexts, which needs to be
>>>>>> appropriately sized. Instead we can just report that all ids up to N are
>>>>>> completed. [This would be similar to your suggestion that N jobs be
>>>>>> reported as done, in that no contexts are provided, it's just that knowing
>>>>>> the ID of what is completed is generally more useful than the number (which
>>>>>> can be obviously got by subtracting the old value)]
>>>>>>
>>>>>> We are still working on prototyping all this, but would hope to have a
>>>>>> functional example of all this soon.
>>>>>>
>>>>>>> How about define the copy prototype as following:
>>>>>>>   dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx)
>>>>>>> while the dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
>>>>>>> enqueue successful else fail.
>>>>>>> when complete the dmadev will return latest completed dma_cookie, and the
>>>>>>> application could use the dma_cookie to quick locate contexts.
>>>>>>>
>>>>>>
>>>>>> If I understand this correctly, I believe this is largely what I was
>>>>>> suggesting - just with the typedef for the type? In which case it obviously
>>>>>> looks good to me.
>>>>>>
>>>>>>>> * I've kept a single combined API for completions, which differs from the
>>>>>>>>   separate error handling completion API you propose. I need to give the
>>>>>>>>   two function approach a bit of thought, but likely both could work. If we
>>>>>>>>   (likely) never expect failed ops, then the specifics of error handling
>>>>>>>>   should not matter that much.
>>>>>>>
>>>>>>> The rte_ioat_completed_ops API is too complex, and consider some applications
>>>>>>> may never copy fail, so split them as two API.
>>>>>>> It's indeed not friendly to other scenarios that always require error handling.
>>>>>>>
>>>>>>> I prefer use completed operations number as return value other than the ID so
>>>>>>> that application could simple judge whether have new completed operations, and
>>>>>>> the new prototype:
>>>>>>>  uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint32_t *status, uint16_t max_status, uint16_t *num_fails);
>>>>>>>
>>>>>>> 1) for normal case which never expect failed ops:
>>>>>>>    just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL);
>>>>>>> 2) for other case:
>>>>>>>    ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status, &fails);
>>>>>>>    at this point the fails <= ret <= max_status
>>>>>>>
>>>>>> Completely agree that we need to plan for the happy-day case where all is
>>>>>> passing. Looking at the prototypes you have above, I am ok with returning
>>>>>> number of completed ops as the return value with the final completed cookie
>>>>>> as an "out" parameter.
>>>>>> For handling errors, I'm ok with what you propose above, just with one
>>>>>> small adjustment - I would remove the restriction that ret <= max_status.
>>>>>>
>>>>>> In case of zero-failures, we can report as many ops succeeding as we like,
>>>>>> and even in case of failure, we can still report as many successful ops as
>>>>>> we like before we start filling in the status field. For example, if 32 ops
>>>>>> are completed, and the last one fails, we can just fill in one entry into
>>>>>> status, and return 32. Alternatively if the 4th last one fails we fill in 4
>>>>>> entries and return 32. The only requirements would be:
>>>>>> * fails <= max_status
>>>>>> * fails <= ret
>>>>>> * cookie holds the id of the last entry in status.
>>>>>
>>>>> I think we understand the same:
>>>>>
>>>>> The fails <= ret <= max_status include following situation:
>>>>> 1) If max_status is 32, and there are 32 completed ops, then the ret will be 32
>>>>> no matter which ops is failed
>>>>> 2) If max_status is 33, and there are 32 completed ops, then the ret will be 32
>>>>> 3) If max_status is 16, and there are 32 completed ops, then the ret will be 16
>>>>>
>>>>> and the cookie always hold the id of the last returned completed ops, no matter
>>>>> it's completed successful or failed
>>>>>
>>>>
>>>> I actually disagree on the #3. If max_status is 16, there are 32 completed
>>>> ops, and *no failures* the ret will be 32, not 16, because we are not
>>>> returning any status entries so max_status need not apply. Keeping that
>>>> same scenario #3, depending on the number of failures and the point of
>>>> them, the return value may similarly vary, for example:
>>>> * if job #28 fails, then ret could still be 32, cookie would be the cookie
>>>>   for that job, "fails" parameter would return as 4, with status holding the
>>>>   failure of 28 plus the succeeded status of jobs 29-31, i.e. 4 elements.
>>>> * if job #5 fails, then we can't fit the status list from 5 though 31 in an
>>>>   array of 16, so "fails" == 16(max_status) and status contains the 16
>>>>   statuses starting from #5, which means that cookie contains the value for
>>>>   job #20 and ret is 21.
>>>>
>>>> In other words, ignore max_status and status parameters *unless we have an
>>>> error to return*, meaning the fast-path/happy-day case works as fast as
>>>> possible. You don't need to worry about sizing your status array to be big,
>>>> and you always get back a large number of completions when available. Your
>>>> fastpath code only need check the "fails" parameter to see if status needs
>>>> to ever be consulted, and in normal case it doesn't.
>>>>
>>>> If this is too complicated, maybe we can simplify a little by returning just
>>>> one failure at a time, though at the cost of making error handling slower?
>>>>
>>>> rte_dmadev_completed(dev_id, &cookie, &failure_status)
>>>>
>>>> In this case, we always return the number of completed ops on success,
>>>> while on failure, we return the first error code. For a single error, this
>>>> works fine, but if we get a burst of errors together, things will work
>>>> slower - which may be acceptable if errors are very rare. However, for idxd
>>>> at least if a fence occurs after a failure all jobs in the batch after the
>>>> fence would be skipped, which would lead to the "burst of errors" case.
>>>> Therefore, I'd prefer to have the original suggestion allowing multiple
>>>> errors to be reported at a time.
>>>>
>>>> /Bruce
>>>
>>> Apologies for self-reply, but thinking about it more, a combination of
>>> normal-case and error-case APIs may be just simpler:
>>>
>>> int rte_dmadev_completed(dev_id, &cookie)
>>>
>>> returns number of items completed and cookie of last item. If there is an
>>> error, returns all successfull values up to the error entry and returns -1
>>> on subsequent call.
>>>
>>> int rte_dmadev_completed_status(dev_id, &cookie, max_status, status_array,
>>>       &error_count)
>>>
>>> this is a slower completion API which behaves like you originally said
>>> above, returning number of completions x, 0 <= x <= max_status, with x
>>> status values filled into array, and the number of unsuccessful values in
>>> the error_count value.
>>>
>>> This would allow code to be written in the application to use
>>> rte_dmadev_completed() in the normal case, and on getting a "-1" value, use
>>> rte_dmadev_completed_status() to get the error details. If strings of
>>> errors might be expected, the app can continually use the
>>> completed_status() function until error_count returns 0, and then switch
>>> back to the faster/simpler version.
>>
>> This two-function simplify the status_array's maintenance because we don't need init it to zero.
>> I think it's a good trade-off between performance and rich error info (status code).
>>
>> Here I'd like to discuss the 'burst size', which is widely used in DPDK application (e.g.
>> nic polling or ring en/dequeue).
>> Currently we don't define a max completed ops in rte_dmadev_completed() API, the return
>> value may greater than 'burst size' of application, this may result in the application need to
>> maintain (or remember) the return value of the function and special handling at the next poll.
>>
>> Also consider there may multiple calls rte_dmadev_completed to check fail, it may make it
>> difficult for the application to use.
>>
>> So I prefer following prototype:
>>   uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_cpls, bool *has_error)
>>     -- nb_cpls: indicate max process operations number
>>     -- has_error: indicate if there is an error
>>     -- return value: the number of successful completed operations.
>>     -- example:
>>        1) If there are already 32 completed ops, and 4th is error, and nb_cpls is 32, then
>>           the ret will be 3(because 1/2/3th is OK), and has_error will be true.
>>        2) If there are already 32 completed ops, and all successful completed, then the ret
>>           will be min(32, nb_cpls), and has_error will be false.
>>        3) If there are already 32 completed ops, and all failed completed, then the ret will
>>           be 0, and has_error will be true.
>>   uint16_t rte_dmadev_completed_status(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_status, uint32_t *status)
>>     -- return value: the number of failed completed operations.
> 
> 
> 
> In typical storage use cases etc, Sometimes application need to
> provide scatter-gather list,
> At least in our hardware sg list gives a "single completion result"
> and it stops on the first failure to restart
> the transfer by application. Have you thought of scatter-gather use
> case and how it is in other  HW?

cookie and request are in a one-to-one correspondence, whether the request is a single or sg-list.
Kunpeng9x0 don't support sg-list, I'm still investigating other hardware.

The above 'restart the transfer by application' mean re-schedule request (and have one new cookie) or
just re-enable current failed request (this may introduce new API) ?

> 
> prototype like the following works for us:
> rte_dmadev_enq_sg(void **src, void **dest, unsigned int **length, int
> nb_segments, cookie, ,,,)

OK, we could define one scatter-list struct to wrap src/dest/length.

> 
> 
>>
>> The application use the following invocation order when polling:
>>   has_error = false; // could be init to false by dmadev API, we need discuss
>>   ret = rte_dmadev_completed(dev_id, &cookie, bust_size, &has_error);
>>   // process successful completed case:
>>   for (int i = 0; i < ret; i++) {
>>   }
>>   if (unlikely(has_error)) {
>>     // process failed completed case
>>     ret = rte_dmadev_completed_status(dev_id, &cookie, burst_size - ret, status_array);
>>     for (int i = 0; i < ret; i++) {
>>       // ...
>>     }
>>   }
>>
>>
>>>
>>> This two-function approach also allows future support for other DMA
>>> functions such as comparison, where a status value is always required. Any
>>> apps using that functionality would just always use the "_status" function
>>> for completions.
>>>
>>> /Bruce
>>>
>>> .
>>>
>>
> 
> .
>
Hu, Jiayu June 23, 2021, 5:34 a.m. UTC | #38
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Bruce Richardson
> Sent: Thursday, June 17, 2021 1:31 AM
> To: fengchengwen <fengchengwen@huawei.com>
> Cc: thomas@monjalon.net; Yigit, Ferruh <ferruh.yigit@intel.com>;
> dev@dpdk.org; nipun.gupta@nxp.com; hemant.agrawal@nxp.com;
> maxime.coquelin@redhat.com; honnappa.nagarahalli@arm.com;
> jerinj@marvell.com; david.marchand@redhat.com; jerinjacobk@gmail.com
> Subject: Re: [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library
> 
> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> > On 2021/6/16 0:38, Bruce Richardson wrote:
> > > On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > >> This patch introduces 'dmadevice' which is a generic type of DMA
> > >> device.
> > >>
> > >> The APIs of dmadev library exposes some generic operations which
> > >> can enable configuration and I/O with the DMA devices.
> > >>
> > >> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > >> ---
> > > Thanks for sending this.
> > >
> > > Of most interest to me right now are the key data-plane APIs. While
> > > we are still in the prototyping phase, below is a draft of what we
> > > are thinking for the key enqueue/perform_ops/completed_ops APIs.
> > >
> > > Some key differences I note in below vs your original RFC:
> > > * Use of void pointers rather than iova addresses. While using iova's
> makes
> > >   sense in the general case when using hardware, in that it can work with
> > >   both physical addresses and virtual addresses, if we change the APIs to
> use
> > >   void pointers instead it will still work for DPDK in VA mode, while at the
> > >   same time allow use of software fallbacks in error cases, and also a stub
> > >   driver than uses memcpy in the background. Finally, using iova's makes
> the
> > >   APIs a lot more awkward to use with anything but mbufs or similar
> buffers
> > >   where we already have a pre-computed physical address.
> >
> > The iova is an hint to application, and widely used in DPDK.
> > If switch to void, how to pass the address (iova or just va ?) this
> > may introduce implementation dependencies here.
> >
> > Or always pass the va, and the driver performs address translation,
> > and this translation may cost too much cpu I think.
> >
> 
> On the latter point, about driver doing address translation I would agree.
> However, we probably need more discussion about the use of iova vs just
> virtual addresses. My thinking on this is that if we specify the API using iovas
> it will severely hurt usability of the API, since it forces the user to take more
> inefficient codepaths in a large number of cases. Given a pointer to the
> middle of an mbuf, one cannot just pass that straight as an iova but must
> instead do a translation into offset from mbuf pointer and then readd the
> offset to the mbuf base address.

Agree. Vhost is one consumer of DMA devices. To support SW fallback
in case of DMA copy errors, vhost needs to pass VA for both DPDK mbuf
and guest buffer to the callback layer (a middle layer between vhost and
dma device). If DMA devices use iova, it will require the callback layer to
call rte_mem_virt2iova() to translate va to iova in data path, even if iova
is va in some cases. But if DMA devices claim to use va, device differences
can be hided inside driver, which makes the DMA callback layer simpler
and more efficient.

Thanks,
Jiayu

> 
> My preference therefore is to require the use of an IOMMU when using a
> dmadev, so that it can be a much closer analog of memcpy. Once an iommu
> is present, DPDK will run in VA mode, allowing virtual addresses to our
> hugepage memory to be sent directly to hardware. Also, when using
> dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> management for the app, removing further the restrictions on what memory
> can be addressed by hardware.
> 
> > > * Use of id values rather than user-provided handles. Allowing the
> user/app
> > >   to manage the amount of data stored per operation is a better solution,
> I
> > >   feel than proscribing a certain about of in-driver tracking. Some apps
> may
> > >   not care about anything other than a job being completed, while other
> apps
> > >   may have significant metadata to be tracked. Taking the user-context
> > >   handles out of the API also makes the driver code simpler.
> >
> > The user-provided handle was mainly used to simply application
> > implementation, It provides the ability to quickly locate contexts.
> >
> > The "use of id values" seem like the dma_cookie of Linux DMA engine
> > framework, user will get a unique dma_cookie after calling
> > dmaengine_submit(), and then could use it to call
> dma_async_is_tx_complete() to get completion status.
> >
> 
> Yes, the idea of the id is the same - to locate contexts. The main difference is
> that if we have the driver manage contexts or pointer to contexts, as well as
> giving more work to the driver, it complicates the APIs for measuring
> completions. If we use an ID-based approach, where the app maintains its
> own ring of contexts (if any), it avoids the need to have an "out" parameter
> array for returning those contexts, which needs to be appropriately sized.
> Instead we can just report that all ids up to N are completed. [This would be
> similar to your suggestion that N jobs be reported as done, in that no
> contexts are provided, it's just that knowing the ID of what is completed is
> generally more useful than the number (which can be obviously got by
> subtracting the old value)]
> 
> We are still working on prototyping all this, but would hope to have a
> functional example of all this soon.
> 
> > How about define the copy prototype as following:
> >   dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx) while the
> > dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
> > enqueue successful else fail.
> > when complete the dmadev will return latest completed dma_cookie, and
> > the application could use the dma_cookie to quick locate contexts.
> >
> 
> If I understand this correctly, I believe this is largely what I was suggesting -
> just with the typedef for the type? In which case it obviously looks good to
> me.
> 
> > > * I've kept a single combined API for completions, which differs from the
> > >   separate error handling completion API you propose. I need to give the
> > >   two function approach a bit of thought, but likely both could work. If we
> > >   (likely) never expect failed ops, then the specifics of error handling
> > >   should not matter that much.
> >
> > The rte_ioat_completed_ops API is too complex, and consider some
> > applications may never copy fail, so split them as two API.
> > It's indeed not friendly to other scenarios that always require error
> handling.
> >
> > I prefer use completed operations number as return value other than
> > the ID so that application could simple judge whether have new
> > completed operations, and the new prototype:
> >  uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie,
> > uint32_t *status, uint16_t max_status, uint16_t *num_fails);
> >
> > 1) for normal case which never expect failed ops:
> >    just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0,
> > NULL);
> > 2) for other case:
> >    ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status,
> &fails);
> >    at this point the fails <= ret <= max_status
> >
> Completely agree that we need to plan for the happy-day case where all is
> passing. Looking at the prototypes you have above, I am ok with returning
> number of completed ops as the return value with the final completed cookie
> as an "out" parameter.
> For handling errors, I'm ok with what you propose above, just with one small
> adjustment - I would remove the restriction that ret <= max_status.
> 
> In case of zero-failures, we can report as many ops succeeding as we like,
> and even in case of failure, we can still report as many successful ops as we
> like before we start filling in the status field. For example, if 32 ops are
> completed, and the last one fails, we can just fill in one entry into status, and
> return 32. Alternatively if the 4th last one fails we fill in 4 entries and return
> 32. The only requirements would be:
> * fails <= max_status
> * fails <= ret
> * cookie holds the id of the last entry in status.
> 
> A further possible variation is to have separate "completed" and
> "completed_status" APIs, where "completed_status" is as above, but
> "completed" skips the final 3 parameters and returns -1 on error. In that case
> the user can fall back to the completed_status call.
> 
> > >
> > > For the rest, the control / setup APIs are likely to be rather
> > > uncontroversial, I suspect. However, I think that rather than xstats
> > > APIs, the library should first provide a set of standardized stats
> > > like ethdev does. If driver-specific stats are needed, we can add
> > > xstats later to the API.
> >
> > Agree, will fix in v2
> >
> Thanks. In parallel, we will be working on our prototype implementation too,
> taking in the feedback here, and hopefully send it as an RFC soon.
> Then we can look to compare and contrast and arrive at an agreed API. It
> might also be worthwhile to set up a community call for all interested parties
> in this API to discuss things with a more rapid turnaround. That was done in
> the past for other new device class APIs that were developed, e.g. eventdev.
> 
> Regards,
> /Bruce
Jerin Jacob June 23, 2021, 7 a.m. UTC | #39
On Wed, Jun 23, 2021 at 12:47 AM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Tue, Jun 22, 2021 at 11:01:47PM +0530, Jerin Jacob wrote:
> > On Fri, Jun 18, 2021 at 3:25 PM Bruce Richardson
> > <bruce.richardson@intel.com> wrote:
> > >
> > > >
> > > Taking the case of a simple copy op, the parameters we need are:
> > >
> > > * src
> > > * dst
> > > * length
> >
> > OK. Is it the case where no other attribute that supported in HW or
> > you are not planning to
> > expose that through DPDK generic DMA API.
> >
> Only other parameters that might be needed can all be specified as flags,
> so all we need for a copy op is a general flags field for future expansion.
>
> > >
> > > Depending on the specific hardware there will also be passed in the
> > > descriptor a completion address, but we plan for these cases to always have
> > > the completions written back to a set location so that we have essentially
> > > ring-writeback, as with the hardware which doesn't explicitly have a
> > > separate completion address. Beyond that, I believe the only descriptor
> > > fields we will use are just the flags field indicating the op type etc.
> >
> > OK. In HW, we need to have IOVA for completion address that's only the
> > constraint. rest looks good to me.
> >
> That's like what we have, but I was not planning on exposing the completion
> address through the API at all, but have it internal to the driver and let
> the "completion" APIs just inform the app what is done or not. If we expose
> completion addresses, then that leaves the app open to having to parse
> different completion formats, so it needs to be internal IMHO.

Ack

>
> /Bruce
Jerin Jacob June 23, 2021, 7:21 a.m. UTC | #40
On Wed, Jun 23, 2021 at 9:00 AM fengchengwen <fengchengwen@huawei.com> wrote:
>

> >>>
> >>>>
> >>>>> The above will give better performance and is the best trade-off c
> >>>>> between performance and per transfer variables.
> >>>>
> >>>> We may need to have different APIs for context-aware and context-unaware
> >>>> processing, with which to use determined by the capabilities discovery.
> >>>> Given that for these DMA devices the offload cost is critical, more so than
> >>>> any other dev class I've looked at before, I'd like to avoid having APIs
> >>>> with extra parameters than need to be passed about since that just adds
> >>>> extra CPU cycles to the offload.
> >>>
> >>> If driver does not support additional attributes and/or the
> >>> application does not need it, rte_dmadev_desc_t can be NULL.
> >>> So that it won't have any cost in the datapath. I think, we can go to
> >>> different API
> >>> cases if we can not abstract problems without performance impact.
> >>> Otherwise, it will be too much
> >>> pain for applications.
> >>
> >> Yes, currently we plan to use different API for different case, e.g.
> >>   rte_dmadev_memcpy()  -- deal with local to local memcopy
> >>   rte_dmadev_memset()  -- deal with fill with local memory with pattern
> >> maybe:
> >>   rte_dmadev_imm_data()  --deal with copy very little data
> >>   rte_dmadev_p2pcopy()   --deal with peer-to-peer copy of diffenet PCIE addr
> >>
> >> These API capabilities will be reflected in the device capability set so that
> >> application could know by standard API.
> >
> >
> > There will be a lot of combination of that it will be like M x N cross
> > base case, It won't scale.
>
> Currently, it is hard to define generic dma descriptor, I think the well-defined
> APIs is feasible.

I would like to understand why not feasible? if we move the
preparation to the slow path.

i.e

struct rte_dmadev_desc defines all the "attributes" of all DMA devices available
using capability. I believe with the scheme, we can scale and
incorporate all features of
all DMA HW without any performance impact.

something like:

struct rte_dmadev_desc {
  /* Attributes all DMA transfer available for all HW under capability. */
  channel or port;
  ops ; // copy, fill etc..
 /* impemention opqueue memory as zero length array,
rte_dmadev_desc_prep() update this memory with HW specific information
*/
  uint8_t impl_opq[];
}

// allocate the memory for dma decriptor
struct rte_dmadev_desc *rte_dmadev_desc_alloc(devid);
// Convert DPDK specific descriptors to HW specific descriptors in slowpath */
rte_dmadev_desc_prep(devid, struct rte_dmadev_desc *desc);
// Free dma descriptor memory
rte_dmadev_desc_free(devid, struct rte_dmadev_desc *desc )

The above calls in slow path.

Only below call in fastpath.
// Here desc can be NULL(in case you don't need any specific attribute
attached to transfer, if needed, it can be an object which is gone
through rte_dmadev_desc_prep())
rte_dmadev_enq(devid, struct rte_dmadev_desc *desc, void *src, void
*dest, unsigned int len, cookie)

>
> >
> >>
> >>>
> >>> Just to understand, I think, we need to HW capabilities and how to
> >>> have a common API.
> >>> I assume HW will have some HW JOB descriptors which will be filled in
> >>> SW and submitted to HW.
> >>> In our HW,  Job descriptor has the following main elements
> >>>
> >>> - Channel   // We don't expect the application to change per transfer
> >>> - Source address - It can be scatter-gather too - Will be changed per transfer
> >>> - Destination address - It can be scatter-gather too - Will be changed
> >>> per transfer
> >>> - Transfer Length - - It can be scatter-gather too - Will be changed
> >>> per transfer
> >>> - IOVA address where HW post Job completion status PER Job descriptor
> >>> - Will be changed per transfer
> >>> - Another sideband information related to channel  // We don't expect
> >>> the application to change per transfer
> >>> - As an option, Job completion can be posted as an event to
> >>> rte_event_queue  too // We don't expect the application to change per
> >>> transfer
> >>
> >> The 'option' field looks like a software interface field, but not HW descriptor.
> >
> > It is in HW descriptor.
>
> The HW is interesting, something like: DMA could send completion direct to EventHWQueue,
> the DMA and EventHWQueue are link in the hardware range, rather than by software.

Yes.

>
> Could you provide public driver of this HW ? So we could know more about it's working
> mechanism and software-hardware collaboration.

http://code.dpdk.org/dpdk/v21.05/source/drivers/raw/octeontx2_dma/otx2_dpi_rawdev.h#L149
is the DMA instruction header.
Bruce Richardson June 23, 2021, 9:37 a.m. UTC | #41
On Wed, Jun 23, 2021 at 12:51:07PM +0530, Jerin Jacob wrote:
> On Wed, Jun 23, 2021 at 9:00 AM fengchengwen <fengchengwen@huawei.com> wrote:
> >
> 
> > >>>
> > >>>>
> > >>>>> The above will give better performance and is the best trade-off c
> > >>>>> between performance and per transfer variables.
> > >>>>
> > >>>> We may need to have different APIs for context-aware and context-unaware
> > >>>> processing, with which to use determined by the capabilities discovery.
> > >>>> Given that for these DMA devices the offload cost is critical, more so than
> > >>>> any other dev class I've looked at before, I'd like to avoid having APIs
> > >>>> with extra parameters than need to be passed about since that just adds
> > >>>> extra CPU cycles to the offload.
> > >>>
> > >>> If driver does not support additional attributes and/or the
> > >>> application does not need it, rte_dmadev_desc_t can be NULL.
> > >>> So that it won't have any cost in the datapath. I think, we can go to
> > >>> different API
> > >>> cases if we can not abstract problems without performance impact.
> > >>> Otherwise, it will be too much
> > >>> pain for applications.
> > >>
> > >> Yes, currently we plan to use different API for different case, e.g.
> > >>   rte_dmadev_memcpy()  -- deal with local to local memcopy
> > >>   rte_dmadev_memset()  -- deal with fill with local memory with pattern
> > >> maybe:
> > >>   rte_dmadev_imm_data()  --deal with copy very little data
> > >>   rte_dmadev_p2pcopy()   --deal with peer-to-peer copy of diffenet PCIE addr
> > >>
> > >> These API capabilities will be reflected in the device capability set so that
> > >> application could know by standard API.
> > >
> > >
> > > There will be a lot of combination of that it will be like M x N cross
> > > base case, It won't scale.
> >
> > Currently, it is hard to define generic dma descriptor, I think the well-defined
> > APIs is feasible.
> 
> I would like to understand why not feasible? if we move the
> preparation to the slow path.
> 
> i.e
> 
> struct rte_dmadev_desc defines all the "attributes" of all DMA devices available
> using capability. I believe with the scheme, we can scale and
> incorporate all features of
> all DMA HW without any performance impact.
> 
> something like:
> 
> struct rte_dmadev_desc {
>   /* Attributes all DMA transfer available for all HW under capability. */
>   channel or port;
>   ops ; // copy, fill etc..
>  /* impemention opqueue memory as zero length array,
> rte_dmadev_desc_prep() update this memory with HW specific information
> */
>   uint8_t impl_opq[];
> }
> 
> // allocate the memory for dma decriptor
> struct rte_dmadev_desc *rte_dmadev_desc_alloc(devid);
> // Convert DPDK specific descriptors to HW specific descriptors in slowpath */
> rte_dmadev_desc_prep(devid, struct rte_dmadev_desc *desc);
> // Free dma descriptor memory
> rte_dmadev_desc_free(devid, struct rte_dmadev_desc *desc )
> 
> The above calls in slow path.
> 
> Only below call in fastpath.
> // Here desc can be NULL(in case you don't need any specific attribute
> attached to transfer, if needed, it can be an object which is gone
> through rte_dmadev_desc_prep())
> rte_dmadev_enq(devid, struct rte_dmadev_desc *desc, void *src, void
> *dest, unsigned int len, cookie)
> 

The trouble here is the performance penalty due to building up and tearing
down structures and passing those structures into functions via function
pointer. With the APIs for enqueue/dequeue that have been discussed here,
all parameters will be passed in registers, and then each driver can do a
write of the actual hardware descriptor straight to cache/memory from
registers. With the scheme you propose above, the register contains a
pointer to the data which must then be loaded into the CPU before being
written out again. This increases our offload cost.

However, assuming that the desc_prep call is just for slowpath or
initialization time, I'd be ok to have the functions take an extra
hw-specific parameter for each call prepared with tx_prep. It would still
allow all other parameters to be passed in registers. How much data are you
looking to store in this desc struct? It can't all be represented as flags,
for example?

As for the individual APIs, we could do a generic "enqueue" API, which
takes the op as a parameter, I prefer having each operation as a separate
function, in order to increase the readability of the code and to reduce
the number of parameters needed per function i.e. thereby saving registers
needing to be used and potentially making the function calls and offload
cost cheaper. Perhaps we can have the "common" ops such as copy, fill, have
their own functions, and have a generic "enqueue" function for the
less-commonly used or supported ops?

/Bruce
Bruce Richardson June 23, 2021, 9:41 a.m. UTC | #42
On Tue, Jun 22, 2021 at 10:55:24PM +0530, Jerin Jacob wrote:
> On Fri, Jun 18, 2021 at 3:11 PM fengchengwen <fengchengwen@huawei.com> wrote:
> >
> > On 2021/6/18 13:52, Jerin Jacob wrote:
> > > On Thu, Jun 17, 2021 at 2:46 PM Bruce Richardson
> > > <bruce.richardson@intel.com> wrote:
> > >>
> > >> On Wed, Jun 16, 2021 at 08:07:26PM +0530, Jerin Jacob wrote:
> > >>> On Wed, Jun 16, 2021 at 3:47 PM fengchengwen <fengchengwen@huawei.com> wrote:
> > >>>>
> > >>>> On 2021/6/16 15:09, Morten Brørup wrote:
> > >>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > >>>>>> Sent: Tuesday, 15 June 2021 18.39
> > >>>>>>
> > >>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > >>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
> > >>>>>>> device.
> > >>>>>>>
> > >>>>>>> The APIs of dmadev library exposes some generic operations which can
> > >>>>>>> enable configuration and I/O with the DMA devices.
> > >>>>>>>
> > >>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > >>>>>>> ---
> > >>>>>> Thanks for sending this.
> > >>>>>>
> > >>>>>> Of most interest to me right now are the key data-plane APIs. While we
> > >>>>>> are
> > >>>>>> still in the prototyping phase, below is a draft of what we are
> > >>>>>> thinking
> > >>>>>> for the key enqueue/perform_ops/completed_ops APIs.
> > >>>>>>
> > >>>>>> Some key differences I note in below vs your original RFC:
> > >>>>>> * Use of void pointers rather than iova addresses. While using iova's
> > >>>>>> makes
> > >>>>>>   sense in the general case when using hardware, in that it can work
> > >>>>>> with
> > >>>>>>   both physical addresses and virtual addresses, if we change the APIs
> > >>>>>> to use
> > >>>>>>   void pointers instead it will still work for DPDK in VA mode, while
> > >>>>>> at the
> > >>>>>>   same time allow use of software fallbacks in error cases, and also a
> > >>>>>> stub
> > >>>>>>   driver than uses memcpy in the background. Finally, using iova's
> > >>>>>> makes the
> > >>>>>>   APIs a lot more awkward to use with anything but mbufs or similar
> > >>>>>> buffers
> > >>>>>>   where we already have a pre-computed physical address.
> > >>>>>> * Use of id values rather than user-provided handles. Allowing the
> > >>>>>> user/app
> > >>>>>>   to manage the amount of data stored per operation is a better
> > >>>>>> solution, I
> > >>>>>>   feel than proscribing a certain about of in-driver tracking. Some
> > >>>>>> apps may
> > >>>>>>   not care about anything other than a job being completed, while other
> > >>>>>> apps
> > >>>>>>   may have significant metadata to be tracked. Taking the user-context
> > >>>>>>   handles out of the API also makes the driver code simpler.
> > >>>>>> * I've kept a single combined API for completions, which differs from
> > >>>>>> the
> > >>>>>>   separate error handling completion API you propose. I need to give
> > >>>>>> the
> > >>>>>>   two function approach a bit of thought, but likely both could work.
> > >>>>>> If we
> > >>>>>>   (likely) never expect failed ops, then the specifics of error
> > >>>>>> handling
> > >>>>>>   should not matter that much.
> > >>>>>>
> > >>>>>> For the rest, the control / setup APIs are likely to be rather
> > >>>>>> uncontroversial, I suspect. However, I think that rather than xstats
> > >>>>>> APIs,
> > >>>>>> the library should first provide a set of standardized stats like
> > >>>>>> ethdev
> > >>>>>> does. If driver-specific stats are needed, we can add xstats later to
> > >>>>>> the
> > >>>>>> API.
> > >>>>>>
> > >>>>>> Appreciate your further thoughts on this, thanks.
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> /Bruce
> > >>>>>
> > >>>>> I generally agree with Bruce's points above.
> > >>>>>
> > >>>>> I would like to share a couple of ideas for further discussion:
> > >>>
> > >>>
> > >>> I believe some of the other requirements and comments for generic DMA will be
> > >>>
> > >>> 1) Support for the _channel_, Each channel may have different
> > >>> capabilities and functionalities.
> > >>> Typical cases are, each channel have separate source and destination
> > >>> devices like
> > >>> DMA between PCIe EP to Host memory, Host memory to Host memory, PCIe
> > >>> EP to PCIe EP.
> > >>> So we need some notion of the channel in the specification.
> > >>>
> > >>
> > >> Can you share a bit more detail on what constitutes a channel in this case?
> > >> Is it equivalent to a device queue (which we are flattening to individual
> > >> devices in this API), or to a specific configuration on a queue?
> > >
> > > It not a queue. It is one of the attributes for transfer.
> > > I.e in the same queue, for a given transfer it can specify the
> > > different "source" and "destination" device.
> > > Like CPU to Sound card, CPU to network card etc.
> > >
> > >
> > >>
> > >>> 2) I assume current data plane APIs are not thread-safe. Is it right?
> > >>>
> > >> Yes.
> > >>
> > >>>
> > >>> 3) Cookie scheme outlined earlier looks good to me. Instead of having
> > >>> generic dequeue() API
> > >>>
> > >>> 4) Can split the rte_dmadev_enqueue_copy(uint16_t dev_id, void * src,
> > >>> void * dst, unsigned int length);
> > >>> to two stage API like, Where one will be used in fastpath and other
> > >>> one will use used in slowpath.
> > >>>
> > >>> - slowpath API will for take channel and take other attributes for transfer
> > >>>
> > >>> Example syantx will be:
> > >>>
> > >>> struct rte_dmadev_desc {
> > >>>            channel id;
> > >>>            ops ; // copy, xor, fill etc
> > >>>           other arguments specific to dma transfer // it can be set
> > >>> based on capability.
> > >>>
> > >>> };
> > >>>
> > >>> rte_dmadev_desc_t rte_dmadev_preprare(uint16_t dev_id,  struct
> > >>> rte_dmadev_desc *dec);
> > >>>
> > >>> - Fastpath takes arguments that need to change per transfer along with
> > >>> slow-path handle.
> > >>>
> > >>> rte_dmadev_enqueue(uint16_t dev_id, void * src, void * dst, unsigned
> > >>> int length,  rte_dmadev_desc_t desc)
> > >>>
> > >>> This will help to driver to
> > >>> -Former API form the device-specific descriptors in slow path  for a
> > >>> given channel and fixed attributes per transfer
> > >>> -Later API blend "variable" arguments such as src, dest address with
> > >>> slow-path created descriptors
> > >>>
> > >>
> > >> This seems like an API for a context-aware device, where the channel is the
> > >> config data/context that is preserved across operations - is that correct?
> > >> At least from the Intel DMA accelerators side, we have no concept of this
> > >> context, and each operation is completely self-described. The location or
> > >> type of memory for copies is irrelevant, you just pass the src/dst
> > >> addresses to reference.
> > >
> > > it is not context-aware device. Each HW JOB is self-described.
> > > You can view it different attributes of transfer.
> > >
> > >
> > >>
> > >>> The above will give better performance and is the best trade-off c
> > >>> between performance and per transfer variables.
> > >>
> > >> We may need to have different APIs for context-aware and context-unaware
> > >> processing, with which to use determined by the capabilities discovery.
> > >> Given that for these DMA devices the offload cost is critical, more so than
> > >> any other dev class I've looked at before, I'd like to avoid having APIs
> > >> with extra parameters than need to be passed about since that just adds
> > >> extra CPU cycles to the offload.
> > >
> > > If driver does not support additional attributes and/or the
> > > application does not need it, rte_dmadev_desc_t can be NULL.
> > > So that it won't have any cost in the datapath. I think, we can go to
> > > different API
> > > cases if we can not abstract problems without performance impact.
> > > Otherwise, it will be too much
> > > pain for applications.
> >
> > Yes, currently we plan to use different API for different case, e.g.
> >   rte_dmadev_memcpy()  -- deal with local to local memcopy
> >   rte_dmadev_memset()  -- deal with fill with local memory with pattern
> > maybe:
> >   rte_dmadev_imm_data()  --deal with copy very little data
> >   rte_dmadev_p2pcopy()   --deal with peer-to-peer copy of diffenet PCIE addr
> >
> > These API capabilities will be reflected in the device capability set so that
> > application could know by standard API.
> 
> 
> There will be a lot of combination of that it will be like M x N cross
> base case, It won't scale.
> 

What are the various cases that are so significantly different? Using the
examples above, the "imm_data" and "p2p_copy" operations are still copy
ops, and the fact of it being a small copy or a p2p one can be expressed
just using flags? [Also, you are not likely to want to offload a small
copy, are you?]

> >
> > >
> > > Just to understand, I think, we need to HW capabilities and how to
> > > have a common API.
> > > I assume HW will have some HW JOB descriptors which will be filled in
> > > SW and submitted to HW.
> > > In our HW,  Job descriptor has the following main elements
> > >
> > > - Channel   // We don't expect the application to change per transfer
> > > - Source address - It can be scatter-gather too - Will be changed per transfer
> > > - Destination address - It can be scatter-gather too - Will be changed
> > > per transfer
> > > - Transfer Length - - It can be scatter-gather too - Will be changed
> > > per transfer
> > > - IOVA address where HW post Job completion status PER Job descriptor
> > > - Will be changed per transfer
> > > - Another sideband information related to channel  // We don't expect
> > > the application to change per transfer
> > > - As an option, Job completion can be posted as an event to
> > > rte_event_queue  too // We don't expect the application to change per
> > > transfer
> >
> > The 'option' field looks like a software interface field, but not HW descriptor.
> 
> It is in HW descriptor.
> 
> >
> > >
> > > @Richardson, Bruce @fengchengwen @Hemant Agrawal
> > >
> > > Could you share the options for your HW descriptors  which you are
> > > planning to expose through API like above so that we can easily
> > > converge on fastpath API
> > >
> >
> > Kunpeng HW descriptor is self-describing, and don't need refer context info.
> >
> > Maybe the fields which was fix with some transfer type could setup by driver, and
> > don't expose to application.
> 
> Yes. I agree.I think, that reason why I though to have
> rte_dmadev_prep() call to convert DPDK DMA transfer attributes to HW
> specific descriptors

What are all these attributes? Do you have a reference link for them?
Morten Brørup June 23, 2021, 10:10 a.m. UTC | #43
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Wednesday, 23 June 2021 11.41
> 
> [Also, you are not likely to want to offload a small
> copy, are you?]

Maybe. I would consider it for two reasons:

1. To avoid CPU cache pollution. E.g. copy packet data (or part of it) to memory that is not going to be used any time soon, such as a packet capture buffer.
When a packet has been DPI analyzed at ingress and put in a QoS queue for holding, the packet data will be cold at the time of egress. If such a packet is to be captured at egress, the CPU is not going to touch the data in the original packet anymore (only the TX NIC will). And if the copy of the packet is queued to be processed by a "capture storage readout" pipeline stage, the CPU core that copied the packet will not touch the copy either (only the CPU core in the "capture storage readout" pipeline stage will; but that may be minutes or hours later).

2. To simplify code by using the same method, regardless of size.
Although it might be more efficient copying 64 B packets using CPU instructions and 1518 B packets using DMA, writing code that only uses one of the methods might simplify things.

-Morten
Jerin Jacob June 23, 2021, 11 a.m. UTC | #44
On Wed, Jun 23, 2021 at 9:20 AM fengchengwen <fengchengwen@huawei.com> wrote:
>
> On 2021/6/23 1:51, Jerin Jacob wrote:
> > On Fri, Jun 18, 2021 at 2:22 PM fengchengwen <fengchengwen@huawei.com> wrote:
> >>
> >> On 2021/6/17 22:18, Bruce Richardson wrote:
> >>> On Thu, Jun 17, 2021 at 12:02:00PM +0100, Bruce Richardson wrote:
> >>>> On Thu, Jun 17, 2021 at 05:48:05PM +0800, fengchengwen wrote:
> >>>>> On 2021/6/17 1:31, Bruce Richardson wrote:
> >>>>>> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> >>>>>>> On 2021/6/16 0:38, Bruce Richardson wrote:
> >>>>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> >>>>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
> >>>>>>>>> device.
> >>>>>>>>>
> >>>>>>>>> The APIs of dmadev library exposes some generic operations which can
> >>>>>>>>> enable configuration and I/O with the DMA devices.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> >>>>>>>>> ---
> >>>>>>>> Thanks for sending this.
> >>>>>>>>
> >>>>>>>> Of most interest to me right now are the key data-plane APIs. While we are
> >>>>>>>> still in the prototyping phase, below is a draft of what we are thinking
> >>>>>>>> for the key enqueue/perform_ops/completed_ops APIs.
> >>>>>>>>
> >>>>>>>> Some key differences I note in below vs your original RFC:
> >>>>>>>> * Use of void pointers rather than iova addresses. While using iova's makes
> >>>>>>>>   sense in the general case when using hardware, in that it can work with
> >>>>>>>>   both physical addresses and virtual addresses, if we change the APIs to use
> >>>>>>>>   void pointers instead it will still work for DPDK in VA mode, while at the
> >>>>>>>>   same time allow use of software fallbacks in error cases, and also a stub
> >>>>>>>>   driver than uses memcpy in the background. Finally, using iova's makes the
> >>>>>>>>   APIs a lot more awkward to use with anything but mbufs or similar buffers
> >>>>>>>>   where we already have a pre-computed physical address.
> >>>>>>>
> >>>>>>> The iova is an hint to application, and widely used in DPDK.
> >>>>>>> If switch to void, how to pass the address (iova or just va ?)
> >>>>>>> this may introduce implementation dependencies here.
> >>>>>>>
> >>>>>>> Or always pass the va, and the driver performs address translation, and this
> >>>>>>> translation may cost too much cpu I think.
> >>>>>>>
> >>>>>>
> >>>>>> On the latter point, about driver doing address translation I would agree.
> >>>>>> However, we probably need more discussion about the use of iova vs just
> >>>>>> virtual addresses. My thinking on this is that if we specify the API using
> >>>>>> iovas it will severely hurt usability of the API, since it forces the user
> >>>>>> to take more inefficient codepaths in a large number of cases. Given a
> >>>>>> pointer to the middle of an mbuf, one cannot just pass that straight as an
> >>>>>> iova but must instead do a translation into offset from mbuf pointer and
> >>>>>> then readd the offset to the mbuf base address.
> >>>>>>
> >>>>>> My preference therefore is to require the use of an IOMMU when using a
> >>>>>> dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
> >>>>>> present, DPDK will run in VA mode, allowing virtual addresses to our
> >>>>>> hugepage memory to be sent directly to hardware. Also, when using
> >>>>>> dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> >>>>>> management for the app, removing further the restrictions on what memory
> >>>>>> can be addressed by hardware.
> >>>>>
> >>>>> Some DMA devices many don't support IOMMU or IOMMU bypass default, so driver may
> >>>>> should call rte_mem_virt2phy() do the address translate, but the rte_mem_virt2phy()
> >>>>> cost too many CPU cycles.
> >>>>>
> >>>>> If the API defined as iova, it will work fine in:
> >>>>> 1) If DMA don't support IOMMU or IOMMU bypass, then start application with
> >>>>>    --iova-mode=pa
> >>>>> 2) If DMA support IOMMU, --iova-mode=pa/va work both fine
> >>>>>
> >>>>
> >>>> I suppose if we keep the iova as the datatype, we can just cast "void *"
> >>>> pointers to that in the case that virtual addresses can be used directly. I
> >>>> believe your RFC included a capability query API - "uses void * as iova"
> >>>> should probably be one of those capabilities, and that would resolve this.
> >>>> If DPDK is in iova=va mode because of the presence of an iommu, all drivers
> >>>> could report this capability too.
> >>>>
> >>>>>>
> >>>>>>>> * Use of id values rather than user-provided handles. Allowing the user/app
> >>>>>>>>   to manage the amount of data stored per operation is a better solution, I
> >>>>>>>>   feel than proscribing a certain about of in-driver tracking. Some apps may
> >>>>>>>>   not care about anything other than a job being completed, while other apps
> >>>>>>>>   may have significant metadata to be tracked. Taking the user-context
> >>>>>>>>   handles out of the API also makes the driver code simpler.
> >>>>>>>
> >>>>>>> The user-provided handle was mainly used to simply application implementation,
> >>>>>>> It provides the ability to quickly locate contexts.
> >>>>>>>
> >>>>>>> The "use of id values" seem like the dma_cookie of Linux DMA engine framework,
> >>>>>>> user will get a unique dma_cookie after calling dmaengine_submit(), and then
> >>>>>>> could use it to call dma_async_is_tx_complete() to get completion status.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, the idea of the id is the same - to locate contexts. The main
> >>>>>> difference is that if we have the driver manage contexts or pointer to
> >>>>>> contexts, as well as giving more work to the driver, it complicates the APIs
> >>>>>> for measuring completions. If we use an ID-based approach, where the app
> >>>>>> maintains its own ring of contexts (if any), it avoids the need to have an
> >>>>>> "out" parameter array for returning those contexts, which needs to be
> >>>>>> appropriately sized. Instead we can just report that all ids up to N are
> >>>>>> completed. [This would be similar to your suggestion that N jobs be
> >>>>>> reported as done, in that no contexts are provided, it's just that knowing
> >>>>>> the ID of what is completed is generally more useful than the number (which
> >>>>>> can be obviously got by subtracting the old value)]
> >>>>>>
> >>>>>> We are still working on prototyping all this, but would hope to have a
> >>>>>> functional example of all this soon.
> >>>>>>
> >>>>>>> How about define the copy prototype as following:
> >>>>>>>   dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx)
> >>>>>>> while the dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
> >>>>>>> enqueue successful else fail.
> >>>>>>> when complete the dmadev will return latest completed dma_cookie, and the
> >>>>>>> application could use the dma_cookie to quick locate contexts.
> >>>>>>>
> >>>>>>
> >>>>>> If I understand this correctly, I believe this is largely what I was
> >>>>>> suggesting - just with the typedef for the type? In which case it obviously
> >>>>>> looks good to me.
> >>>>>>
> >>>>>>>> * I've kept a single combined API for completions, which differs from the
> >>>>>>>>   separate error handling completion API you propose. I need to give the
> >>>>>>>>   two function approach a bit of thought, but likely both could work. If we
> >>>>>>>>   (likely) never expect failed ops, then the specifics of error handling
> >>>>>>>>   should not matter that much.
> >>>>>>>
> >>>>>>> The rte_ioat_completed_ops API is too complex, and consider some applications
> >>>>>>> may never copy fail, so split them as two API.
> >>>>>>> It's indeed not friendly to other scenarios that always require error handling.
> >>>>>>>
> >>>>>>> I prefer use completed operations number as return value other than the ID so
> >>>>>>> that application could simple judge whether have new completed operations, and
> >>>>>>> the new prototype:
> >>>>>>>  uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint32_t *status, uint16_t max_status, uint16_t *num_fails);
> >>>>>>>
> >>>>>>> 1) for normal case which never expect failed ops:
> >>>>>>>    just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL);
> >>>>>>> 2) for other case:
> >>>>>>>    ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status, &fails);
> >>>>>>>    at this point the fails <= ret <= max_status
> >>>>>>>
> >>>>>> Completely agree that we need to plan for the happy-day case where all is
> >>>>>> passing. Looking at the prototypes you have above, I am ok with returning
> >>>>>> number of completed ops as the return value with the final completed cookie
> >>>>>> as an "out" parameter.
> >>>>>> For handling errors, I'm ok with what you propose above, just with one
> >>>>>> small adjustment - I would remove the restriction that ret <= max_status.
> >>>>>>
> >>>>>> In case of zero-failures, we can report as many ops succeeding as we like,
> >>>>>> and even in case of failure, we can still report as many successful ops as
> >>>>>> we like before we start filling in the status field. For example, if 32 ops
> >>>>>> are completed, and the last one fails, we can just fill in one entry into
> >>>>>> status, and return 32. Alternatively if the 4th last one fails we fill in 4
> >>>>>> entries and return 32. The only requirements would be:
> >>>>>> * fails <= max_status
> >>>>>> * fails <= ret
> >>>>>> * cookie holds the id of the last entry in status.
> >>>>>
> >>>>> I think we understand the same:
> >>>>>
> >>>>> The fails <= ret <= max_status include following situation:
> >>>>> 1) If max_status is 32, and there are 32 completed ops, then the ret will be 32
> >>>>> no matter which ops is failed
> >>>>> 2) If max_status is 33, and there are 32 completed ops, then the ret will be 32
> >>>>> 3) If max_status is 16, and there are 32 completed ops, then the ret will be 16
> >>>>>
> >>>>> and the cookie always hold the id of the last returned completed ops, no matter
> >>>>> it's completed successful or failed
> >>>>>
> >>>>
> >>>> I actually disagree on the #3. If max_status is 16, there are 32 completed
> >>>> ops, and *no failures* the ret will be 32, not 16, because we are not
> >>>> returning any status entries so max_status need not apply. Keeping that
> >>>> same scenario #3, depending on the number of failures and the point of
> >>>> them, the return value may similarly vary, for example:
> >>>> * if job #28 fails, then ret could still be 32, cookie would be the cookie
> >>>>   for that job, "fails" parameter would return as 4, with status holding the
> >>>>   failure of 28 plus the succeeded status of jobs 29-31, i.e. 4 elements.
> >>>> * if job #5 fails, then we can't fit the status list from 5 though 31 in an
> >>>>   array of 16, so "fails" == 16(max_status) and status contains the 16
> >>>>   statuses starting from #5, which means that cookie contains the value for
> >>>>   job #20 and ret is 21.
> >>>>
> >>>> In other words, ignore max_status and status parameters *unless we have an
> >>>> error to return*, meaning the fast-path/happy-day case works as fast as
> >>>> possible. You don't need to worry about sizing your status array to be big,
> >>>> and you always get back a large number of completions when available. Your
> >>>> fastpath code only need check the "fails" parameter to see if status needs
> >>>> to ever be consulted, and in normal case it doesn't.
> >>>>
> >>>> If this is too complicated, maybe we can simplify a little by returning just
> >>>> one failure at a time, though at the cost of making error handling slower?
> >>>>
> >>>> rte_dmadev_completed(dev_id, &cookie, &failure_status)
> >>>>
> >>>> In this case, we always return the number of completed ops on success,
> >>>> while on failure, we return the first error code. For a single error, this
> >>>> works fine, but if we get a burst of errors together, things will work
> >>>> slower - which may be acceptable if errors are very rare. However, for idxd
> >>>> at least if a fence occurs after a failure all jobs in the batch after the
> >>>> fence would be skipped, which would lead to the "burst of errors" case.
> >>>> Therefore, I'd prefer to have the original suggestion allowing multiple
> >>>> errors to be reported at a time.
> >>>>
> >>>> /Bruce
> >>>
> >>> Apologies for self-reply, but thinking about it more, a combination of
> >>> normal-case and error-case APIs may be just simpler:
> >>>
> >>> int rte_dmadev_completed(dev_id, &cookie)
> >>>
> >>> returns number of items completed and cookie of last item. If there is an
> >>> error, returns all successfull values up to the error entry and returns -1
> >>> on subsequent call.
> >>>
> >>> int rte_dmadev_completed_status(dev_id, &cookie, max_status, status_array,
> >>>       &error_count)
> >>>
> >>> this is a slower completion API which behaves like you originally said
> >>> above, returning number of completions x, 0 <= x <= max_status, with x
> >>> status values filled into array, and the number of unsuccessful values in
> >>> the error_count value.
> >>>
> >>> This would allow code to be written in the application to use
> >>> rte_dmadev_completed() in the normal case, and on getting a "-1" value, use
> >>> rte_dmadev_completed_status() to get the error details. If strings of
> >>> errors might be expected, the app can continually use the
> >>> completed_status() function until error_count returns 0, and then switch
> >>> back to the faster/simpler version.
> >>
> >> This two-function simplify the status_array's maintenance because we don't need init it to zero.
> >> I think it's a good trade-off between performance and rich error info (status code).
> >>
> >> Here I'd like to discuss the 'burst size', which is widely used in DPDK application (e.g.
> >> nic polling or ring en/dequeue).
> >> Currently we don't define a max completed ops in rte_dmadev_completed() API, the return
> >> value may greater than 'burst size' of application, this may result in the application need to
> >> maintain (or remember) the return value of the function and special handling at the next poll.
> >>
> >> Also consider there may multiple calls rte_dmadev_completed to check fail, it may make it
> >> difficult for the application to use.
> >>
> >> So I prefer following prototype:
> >>   uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_cpls, bool *has_error)
> >>     -- nb_cpls: indicate max process operations number
> >>     -- has_error: indicate if there is an error
> >>     -- return value: the number of successful completed operations.
> >>     -- example:
> >>        1) If there are already 32 completed ops, and 4th is error, and nb_cpls is 32, then
> >>           the ret will be 3(because 1/2/3th is OK), and has_error will be true.
> >>        2) If there are already 32 completed ops, and all successful completed, then the ret
> >>           will be min(32, nb_cpls), and has_error will be false.
> >>        3) If there are already 32 completed ops, and all failed completed, then the ret will
> >>           be 0, and has_error will be true.
> >>   uint16_t rte_dmadev_completed_status(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_status, uint32_t *status)
> >>     -- return value: the number of failed completed operations.
> >
> >
> >
> > In typical storage use cases etc, Sometimes application need to
> > provide scatter-gather list,
> > At least in our hardware sg list gives a "single completion result"
> > and it stops on the first failure to restart
> > the transfer by application. Have you thought of scatter-gather use
> > case and how it is in other  HW?
>
> cookie and request are in a one-to-one correspondence, whether the request is a single or sg-list.

OK. Make sense.

> Kunpeng9x0 don't support sg-list, I'm still investigating other hardware.
>
> The above 'restart the transfer by application' mean re-schedule request (and have one new cookie) or
> just re-enable current failed request (this may introduce new API) ?

I think re-schedule the request is more portable.

>
> >
> > prototype like the following works for us:
> > rte_dmadev_enq_sg(void **src, void **dest, unsigned int **length, int
> > nb_segments, cookie, ,,,)
>
> OK, we could define one scatter-list struct to wrap src/dest/length.

OK.

>
> >
> >
> >>
> >> The application use the following invocation order when polling:
> >>   has_error = false; // could be init to false by dmadev API, we need discuss
> >>   ret = rte_dmadev_completed(dev_id, &cookie, bust_size, &has_error);
> >>   // process successful completed case:
> >>   for (int i = 0; i < ret; i++) {
> >>   }
> >>   if (unlikely(has_error)) {
> >>     // process failed completed case
> >>     ret = rte_dmadev_completed_status(dev_id, &cookie, burst_size - ret, status_array);
> >>     for (int i = 0; i < ret; i++) {
> >>       // ...
> >>     }
> >>   }
> >>
> >>
> >>>
> >>> This two-function approach also allows future support for other DMA
> >>> functions such as comparison, where a status value is always required. Any
> >>> apps using that functionality would just always use the "_status" function
> >>> for completions.
> >>>
> >>> /Bruce
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>
Jerin Jacob June 23, 2021, 11:07 a.m. UTC | #45
On Wed, Jun 23, 2021 at 11:04 AM Hu, Jiayu <jiayu.hu@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Bruce Richardson
> > Sent: Thursday, June 17, 2021 1:31 AM
> > To: fengchengwen <fengchengwen@huawei.com>
> > Cc: thomas@monjalon.net; Yigit, Ferruh <ferruh.yigit@intel.com>;
> > dev@dpdk.org; nipun.gupta@nxp.com; hemant.agrawal@nxp.com;
> > maxime.coquelin@redhat.com; honnappa.nagarahalli@arm.com;
> > jerinj@marvell.com; david.marchand@redhat.com; jerinjacobk@gmail.com
> > Subject: Re: [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library
> >
> > On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> > > On 2021/6/16 0:38, Bruce Richardson wrote:
> > > > On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > > >> This patch introduces 'dmadevice' which is a generic type of DMA
> > > >> device.
> > > >>
> > > >> The APIs of dmadev library exposes some generic operations which
> > > >> can enable configuration and I/O with the DMA devices.
> > > >>
> > > >> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > > >> ---
> > > > Thanks for sending this.
> > > >
> > > > Of most interest to me right now are the key data-plane APIs. While
> > > > we are still in the prototyping phase, below is a draft of what we
> > > > are thinking for the key enqueue/perform_ops/completed_ops APIs.
> > > >
> > > > Some key differences I note in below vs your original RFC:
> > > > * Use of void pointers rather than iova addresses. While using iova's
> > makes
> > > >   sense in the general case when using hardware, in that it can work with
> > > >   both physical addresses and virtual addresses, if we change the APIs to
> > use
> > > >   void pointers instead it will still work for DPDK in VA mode, while at the
> > > >   same time allow use of software fallbacks in error cases, and also a stub
> > > >   driver than uses memcpy in the background. Finally, using iova's makes
> > the
> > > >   APIs a lot more awkward to use with anything but mbufs or similar
> > buffers
> > > >   where we already have a pre-computed physical address.
> > >
> > > The iova is an hint to application, and widely used in DPDK.
> > > If switch to void, how to pass the address (iova or just va ?) this
> > > may introduce implementation dependencies here.
> > >
> > > Or always pass the va, and the driver performs address translation,
> > > and this translation may cost too much cpu I think.
> > >
> >
> > On the latter point, about driver doing address translation I would agree.
> > However, we probably need more discussion about the use of iova vs just
> > virtual addresses. My thinking on this is that if we specify the API using iovas
> > it will severely hurt usability of the API, since it forces the user to take more
> > inefficient codepaths in a large number of cases. Given a pointer to the
> > middle of an mbuf, one cannot just pass that straight as an iova but must
> > instead do a translation into offset from mbuf pointer and then readd the
> > offset to the mbuf base address.
>
> Agree. Vhost is one consumer of DMA devices. To support SW fallback
> in case of DMA copy errors, vhost needs to pass VA for both DPDK mbuf
> and guest buffer to the callback layer (a middle layer between vhost and
> dma device). If DMA devices use iova, it will require the callback layer to
> call rte_mem_virt2iova() to translate va to iova in data path, even if iova
> is va in some cases. But if DMA devices claim to use va, device differences
> can be hided inside driver, which makes the DMA callback layer simpler
> and more efficient.

+1 to Bruce suggestion. I think, we can make void * by:

- Add RTE_PCI_DRV_NEED_IOVA_AS_VA in our driver
and
- I think, we need capability  to say DMA address should be from
hugepage or (mapped by IOMMU)
Not from random heap and stack area. aka capablity to say
https://www.kernel.org/doc/html/latest/x86/sva.html feature is not
supported for those devices


>
> Thanks,
> Jiayu
>
> >
> > My preference therefore is to require the use of an IOMMU when using a
> > dmadev, so that it can be a much closer analog of memcpy. Once an iommu
> > is present, DPDK will run in VA mode, allowing virtual addresses to our
> > hugepage memory to be sent directly to hardware. Also, when using
> > dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> > management for the app, removing further the restrictions on what memory
> > can be addressed by hardware.
> >
> > > > * Use of id values rather than user-provided handles. Allowing the
> > user/app
> > > >   to manage the amount of data stored per operation is a better solution,
> > I
> > > >   feel than proscribing a certain about of in-driver tracking. Some apps
> > may
> > > >   not care about anything other than a job being completed, while other
> > apps
> > > >   may have significant metadata to be tracked. Taking the user-context
> > > >   handles out of the API also makes the driver code simpler.
> > >
> > > The user-provided handle was mainly used to simply application
> > > implementation, It provides the ability to quickly locate contexts.
> > >
> > > The "use of id values" seem like the dma_cookie of Linux DMA engine
> > > framework, user will get a unique dma_cookie after calling
> > > dmaengine_submit(), and then could use it to call
> > dma_async_is_tx_complete() to get completion status.
> > >
> >
> > Yes, the idea of the id is the same - to locate contexts. The main difference is
> > that if we have the driver manage contexts or pointer to contexts, as well as
> > giving more work to the driver, it complicates the APIs for measuring
> > completions. If we use an ID-based approach, where the app maintains its
> > own ring of contexts (if any), it avoids the need to have an "out" parameter
> > array for returning those contexts, which needs to be appropriately sized.
> > Instead we can just report that all ids up to N are completed. [This would be
> > similar to your suggestion that N jobs be reported as done, in that no
> > contexts are provided, it's just that knowing the ID of what is completed is
> > generally more useful than the number (which can be obviously got by
> > subtracting the old value)]
> >
> > We are still working on prototyping all this, but would hope to have a
> > functional example of all this soon.
> >
> > > How about define the copy prototype as following:
> > >   dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx) while the
> > > dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
> > > enqueue successful else fail.
> > > when complete the dmadev will return latest completed dma_cookie, and
> > > the application could use the dma_cookie to quick locate contexts.
> > >
> >
> > If I understand this correctly, I believe this is largely what I was suggesting -
> > just with the typedef for the type? In which case it obviously looks good to
> > me.
> >
> > > > * I've kept a single combined API for completions, which differs from the
> > > >   separate error handling completion API you propose. I need to give the
> > > >   two function approach a bit of thought, but likely both could work. If we
> > > >   (likely) never expect failed ops, then the specifics of error handling
> > > >   should not matter that much.
> > >
> > > The rte_ioat_completed_ops API is too complex, and consider some
> > > applications may never copy fail, so split them as two API.
> > > It's indeed not friendly to other scenarios that always require error
> > handling.
> > >
> > > I prefer use completed operations number as return value other than
> > > the ID so that application could simple judge whether have new
> > > completed operations, and the new prototype:
> > >  uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie,
> > > uint32_t *status, uint16_t max_status, uint16_t *num_fails);
> > >
> > > 1) for normal case which never expect failed ops:
> > >    just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0,
> > > NULL);
> > > 2) for other case:
> > >    ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status,
> > &fails);
> > >    at this point the fails <= ret <= max_status
> > >
> > Completely agree that we need to plan for the happy-day case where all is
> > passing. Looking at the prototypes you have above, I am ok with returning
> > number of completed ops as the return value with the final completed cookie
> > as an "out" parameter.
> > For handling errors, I'm ok with what you propose above, just with one small
> > adjustment - I would remove the restriction that ret <= max_status.
> >
> > In case of zero-failures, we can report as many ops succeeding as we like,
> > and even in case of failure, we can still report as many successful ops as we
> > like before we start filling in the status field. For example, if 32 ops are
> > completed, and the last one fails, we can just fill in one entry into status, and
> > return 32. Alternatively if the 4th last one fails we fill in 4 entries and return
> > 32. The only requirements would be:
> > * fails <= max_status
> > * fails <= ret
> > * cookie holds the id of the last entry in status.
> >
> > A further possible variation is to have separate "completed" and
> > "completed_status" APIs, where "completed_status" is as above, but
> > "completed" skips the final 3 parameters and returns -1 on error. In that case
> > the user can fall back to the completed_status call.
> >
> > > >
> > > > For the rest, the control / setup APIs are likely to be rather
> > > > uncontroversial, I suspect. However, I think that rather than xstats
> > > > APIs, the library should first provide a set of standardized stats
> > > > like ethdev does. If driver-specific stats are needed, we can add
> > > > xstats later to the API.
> > >
> > > Agree, will fix in v2
> > >
> > Thanks. In parallel, we will be working on our prototype implementation too,
> > taking in the feedback here, and hopefully send it as an RFC soon.
> > Then we can look to compare and contrast and arrive at an agreed API. It
> > might also be worthwhile to set up a community call for all interested parties
> > in this API to discuss things with a more rapid turnaround. That was done in
> > the past for other new device class APIs that were developed, e.g. eventdev.
> >
> > Regards,
> > /Bruce
Jerin Jacob June 23, 2021, 11:40 a.m. UTC | #46
On Wed, Jun 23, 2021 at 3:07 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Wed, Jun 23, 2021 at 12:51:07PM +0530, Jerin Jacob wrote:
> > On Wed, Jun 23, 2021 at 9:00 AM fengchengwen <fengchengwen@huawei.com> wrote:
> > >

> > >
> > > Currently, it is hard to define generic dma descriptor, I think the well-defined
> > > APIs is feasible.
> >
> > I would like to understand why not feasible? if we move the
> > preparation to the slow path.
> >
> > i.e
> >
> > struct rte_dmadev_desc defines all the "attributes" of all DMA devices available
> > using capability. I believe with the scheme, we can scale and
> > incorporate all features of
> > all DMA HW without any performance impact.
> >
> > something like:
> >
> > struct rte_dmadev_desc {
> >   /* Attributes all DMA transfer available for all HW under capability. */
> >   channel or port;
> >   ops ; // copy, fill etc..
> >  /* impemention opqueue memory as zero length array,
> > rte_dmadev_desc_prep() update this memory with HW specific information
> > */
> >   uint8_t impl_opq[];
> > }
> >
> > // allocate the memory for dma decriptor
> > struct rte_dmadev_desc *rte_dmadev_desc_alloc(devid);
> > // Convert DPDK specific descriptors to HW specific descriptors in slowpath */
> > rte_dmadev_desc_prep(devid, struct rte_dmadev_desc *desc);
> > // Free dma descriptor memory
> > rte_dmadev_desc_free(devid, struct rte_dmadev_desc *desc )
> >
> > The above calls in slow path.
> >
> > Only below call in fastpath.
> > // Here desc can be NULL(in case you don't need any specific attribute
> > attached to transfer, if needed, it can be an object which is gone
> > through rte_dmadev_desc_prep())
> > rte_dmadev_enq(devid, struct rte_dmadev_desc *desc, void *src, void
> > *dest, unsigned int len, cookie)
> >
>
> The trouble here is the performance penalty due to building up and tearing
> down structures and passing those structures into functions via function
> pointer. With the APIs for enqueue/dequeue that have been discussed here,
> all parameters will be passed in registers, and then each driver can do a
> write of the actual hardware descriptor straight to cache/memory from
> registers. With the scheme you propose above, the register contains a
> pointer to the data which must then be loaded into the CPU before being
> written out again. This increases our offload cost.

See below.

>
> However, assuming that the desc_prep call is just for slowpath or
> initialization time, I'd be ok to have the functions take an extra
> hw-specific parameter for each call prepared with tx_prep. It would still
> allow all other parameters to be passed in registers. How much data are you
> looking to store in this desc struct? It can't all be represented as flags,
> for example?

There is around 128bit of metadata for octeontx2. New HW may
completely different metata
http://code.dpdk.org/dpdk/v21.05/source/drivers/raw/octeontx2_dma/otx2_dpi_rawdev.h#L149

I see following issue with flags scheme:

- We need to start populate in fastpath, Since it based on capabality,
application needs to have
different versions of fastpath code
- Not future proof, Not easy add other stuff as needed when new HW
comes with new
transfer attributes.


>
> As for the individual APIs, we could do a generic "enqueue" API, which
> takes the op as a parameter, I prefer having each operation as a separate
> function, in order to increase the readability of the code and to reduce

Only issue I see, all application needs have two path for doing the stuff,
one with _prep() and separate function() and drivers need to support both.

> the number of parameters needed per function i.e. thereby saving registers
> needing to be used and potentially making the function calls and offload

My worry is, struct rte_dmadev can hold only function pointers for <=
8 fastpath functions for 64B cache line.
When you say new op, say fill, need a new function, What will be the
change wrt HW
driver point of view? Is it updating HW descriptor with op as _fill_
vs _copy_? something beyond that?
If it is about, HW descriptor update, then _prep() can do all work,
just driver need to copy desc to
to HW.

I believe upto to 6 arguments passed over registers in x86(it is 8 in
arm64). if so,
the desc pointer(already populated in HW descriptor format by _prep())
is in register, and
would  be simple 64bit/128bit copy from desc pointer to HW memory on
driver enq(). I dont see
any overhead on that, On other side, we if keep adding arguments, it
will spill out
to stack.



> cost cheaper. Perhaps we can have the "common" ops such as copy, fill, have
> their own functions, and have a generic "enqueue" function for the
> less-commonly used or supported ops?
>
> /Bruce
Jerin Jacob June 23, 2021, 11:46 a.m. UTC | #47
On Wed, Jun 23, 2021 at 3:11 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Tue, Jun 22, 2021 at 10:55:24PM +0530, Jerin Jacob wrote:
> > On Fri, Jun 18, 2021 at 3:11 PM fengchengwen <fengchengwen@huawei.com> wrote:
> > >
> > > On 2021/6/18 13:52, Jerin Jacob wrote:
> > > > On Thu, Jun 17, 2021 at 2:46 PM Bruce Richardson
> > > > <bruce.richardson@intel.com> wrote:
> > > >>
> > > >> On Wed, Jun 16, 2021 at 08:07:26PM +0530, Jerin Jacob wrote:
> > > >>> On Wed, Jun 16, 2021 at 3:47 PM fengchengwen <fengchengwen@huawei.com> wrote:
> > > >>>>
> > > >>>> On 2021/6/16 15:09, Morten Brørup wrote:
> > > >>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > > >>>>>> Sent: Tuesday, 15 June 2021 18.39
> > > >>>>>>
> > > >>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > > >>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
> > > >>>>>>> device.
> > > >>>>>>>
> > > >>>>>>> The APIs of dmadev library exposes some generic operations which can
> > > >>>>>>> enable configuration and I/O with the DMA devices.
> > > >>>>>>>
> > > >>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > > >>>>>>> ---
> > > >>>>>> Thanks for sending this.
> > > >>>>>>
> > > >>>>>> Of most interest to me right now are the key data-plane APIs. While we
> > > >>>>>> are
> > > >>>>>> still in the prototyping phase, below is a draft of what we are
> > > >>>>>> thinking
> > > >>>>>> for the key enqueue/perform_ops/completed_ops APIs.
> > > >>>>>>
> > > >>>>>> Some key differences I note in below vs your original RFC:
> > > >>>>>> * Use of void pointers rather than iova addresses. While using iova's
> > > >>>>>> makes
> > > >>>>>>   sense in the general case when using hardware, in that it can work
> > > >>>>>> with
> > > >>>>>>   both physical addresses and virtual addresses, if we change the APIs
> > > >>>>>> to use
> > > >>>>>>   void pointers instead it will still work for DPDK in VA mode, while
> > > >>>>>> at the
> > > >>>>>>   same time allow use of software fallbacks in error cases, and also a
> > > >>>>>> stub
> > > >>>>>>   driver than uses memcpy in the background. Finally, using iova's
> > > >>>>>> makes the
> > > >>>>>>   APIs a lot more awkward to use with anything but mbufs or similar
> > > >>>>>> buffers
> > > >>>>>>   where we already have a pre-computed physical address.
> > > >>>>>> * Use of id values rather than user-provided handles. Allowing the
> > > >>>>>> user/app
> > > >>>>>>   to manage the amount of data stored per operation is a better
> > > >>>>>> solution, I
> > > >>>>>>   feel than proscribing a certain about of in-driver tracking. Some
> > > >>>>>> apps may
> > > >>>>>>   not care about anything other than a job being completed, while other
> > > >>>>>> apps
> > > >>>>>>   may have significant metadata to be tracked. Taking the user-context
> > > >>>>>>   handles out of the API also makes the driver code simpler.
> > > >>>>>> * I've kept a single combined API for completions, which differs from
> > > >>>>>> the
> > > >>>>>>   separate error handling completion API you propose. I need to give
> > > >>>>>> the
> > > >>>>>>   two function approach a bit of thought, but likely both could work.
> > > >>>>>> If we
> > > >>>>>>   (likely) never expect failed ops, then the specifics of error
> > > >>>>>> handling
> > > >>>>>>   should not matter that much.
> > > >>>>>>
> > > >>>>>> For the rest, the control / setup APIs are likely to be rather
> > > >>>>>> uncontroversial, I suspect. However, I think that rather than xstats
> > > >>>>>> APIs,
> > > >>>>>> the library should first provide a set of standardized stats like
> > > >>>>>> ethdev
> > > >>>>>> does. If driver-specific stats are needed, we can add xstats later to
> > > >>>>>> the
> > > >>>>>> API.
> > > >>>>>>
> > > >>>>>> Appreciate your further thoughts on this, thanks.
> > > >>>>>>
> > > >>>>>> Regards,
> > > >>>>>> /Bruce
> > > >>>>>
> > > >>>>> I generally agree with Bruce's points above.
> > > >>>>>
> > > >>>>> I would like to share a couple of ideas for further discussion:
> > > >>>
> > > >>>
> > > >>> I believe some of the other requirements and comments for generic DMA will be
> > > >>>
> > > >>> 1) Support for the _channel_, Each channel may have different
> > > >>> capabilities and functionalities.
> > > >>> Typical cases are, each channel have separate source and destination
> > > >>> devices like
> > > >>> DMA between PCIe EP to Host memory, Host memory to Host memory, PCIe
> > > >>> EP to PCIe EP.
> > > >>> So we need some notion of the channel in the specification.
> > > >>>
> > > >>
> > > >> Can you share a bit more detail on what constitutes a channel in this case?
> > > >> Is it equivalent to a device queue (which we are flattening to individual
> > > >> devices in this API), or to a specific configuration on a queue?
> > > >
> > > > It not a queue. It is one of the attributes for transfer.
> > > > I.e in the same queue, for a given transfer it can specify the
> > > > different "source" and "destination" device.
> > > > Like CPU to Sound card, CPU to network card etc.
> > > >
> > > >
> > > >>
> > > >>> 2) I assume current data plane APIs are not thread-safe. Is it right?
> > > >>>
> > > >> Yes.
> > > >>
> > > >>>
> > > >>> 3) Cookie scheme outlined earlier looks good to me. Instead of having
> > > >>> generic dequeue() API
> > > >>>
> > > >>> 4) Can split the rte_dmadev_enqueue_copy(uint16_t dev_id, void * src,
> > > >>> void * dst, unsigned int length);
> > > >>> to two stage API like, Where one will be used in fastpath and other
> > > >>> one will use used in slowpath.
> > > >>>
> > > >>> - slowpath API will for take channel and take other attributes for transfer
> > > >>>
> > > >>> Example syantx will be:
> > > >>>
> > > >>> struct rte_dmadev_desc {
> > > >>>            channel id;
> > > >>>            ops ; // copy, xor, fill etc
> > > >>>           other arguments specific to dma transfer // it can be set
> > > >>> based on capability.
> > > >>>
> > > >>> };
> > > >>>
> > > >>> rte_dmadev_desc_t rte_dmadev_preprare(uint16_t dev_id,  struct
> > > >>> rte_dmadev_desc *dec);
> > > >>>
> > > >>> - Fastpath takes arguments that need to change per transfer along with
> > > >>> slow-path handle.
> > > >>>
> > > >>> rte_dmadev_enqueue(uint16_t dev_id, void * src, void * dst, unsigned
> > > >>> int length,  rte_dmadev_desc_t desc)
> > > >>>
> > > >>> This will help to driver to
> > > >>> -Former API form the device-specific descriptors in slow path  for a
> > > >>> given channel and fixed attributes per transfer
> > > >>> -Later API blend "variable" arguments such as src, dest address with
> > > >>> slow-path created descriptors
> > > >>>
> > > >>
> > > >> This seems like an API for a context-aware device, where the channel is the
> > > >> config data/context that is preserved across operations - is that correct?
> > > >> At least from the Intel DMA accelerators side, we have no concept of this
> > > >> context, and each operation is completely self-described. The location or
> > > >> type of memory for copies is irrelevant, you just pass the src/dst
> > > >> addresses to reference.
> > > >
> > > > it is not context-aware device. Each HW JOB is self-described.
> > > > You can view it different attributes of transfer.
> > > >
> > > >
> > > >>
> > > >>> The above will give better performance and is the best trade-off c
> > > >>> between performance and per transfer variables.
> > > >>
> > > >> We may need to have different APIs for context-aware and context-unaware
> > > >> processing, with which to use determined by the capabilities discovery.
> > > >> Given that for these DMA devices the offload cost is critical, more so than
> > > >> any other dev class I've looked at before, I'd like to avoid having APIs
> > > >> with extra parameters than need to be passed about since that just adds
> > > >> extra CPU cycles to the offload.
> > > >
> > > > If driver does not support additional attributes and/or the
> > > > application does not need it, rte_dmadev_desc_t can be NULL.
> > > > So that it won't have any cost in the datapath. I think, we can go to
> > > > different API
> > > > cases if we can not abstract problems without performance impact.
> > > > Otherwise, it will be too much
> > > > pain for applications.
> > >
> > > Yes, currently we plan to use different API for different case, e.g.
> > >   rte_dmadev_memcpy()  -- deal with local to local memcopy
> > >   rte_dmadev_memset()  -- deal with fill with local memory with pattern
> > > maybe:
> > >   rte_dmadev_imm_data()  --deal with copy very little data
> > >   rte_dmadev_p2pcopy()   --deal with peer-to-peer copy of diffenet PCIE addr
> > >
> > > These API capabilities will be reflected in the device capability set so that
> > > application could know by standard API.
> >
> >
> > There will be a lot of combination of that it will be like M x N cross
> > base case, It won't scale.
> >
>
> What are the various cases that are so significantly different? Using the
> examples above, the "imm_data" and "p2p_copy" operations are still copy
> ops, and the fact of it being a small copy or a p2p one can be expressed
> just using flags? [Also, you are not likely to want to offload a small
> copy, are you?]

I meant, p2p version can have memcpy, memset, _imm_data. So it is gone
to 4 to 6 now,
If we add one more op, it becomes 8 function.

IMO, a separate function is good if driver need to do radically
different thing. In our hardware,
it is about updating the descriptor field differently, Is it so with
other HW? If so, _prep() makes life easy.


>
> > >
> > > >
> > > > Just to understand, I think, we need to HW capabilities and how to
> > > > have a common API.
> > > > I assume HW will have some HW JOB descriptors which will be filled in
> > > > SW and submitted to HW.
> > > > In our HW,  Job descriptor has the following main elements
> > > >
> > > > - Channel   // We don't expect the application to change per transfer
> > > > - Source address - It can be scatter-gather too - Will be changed per transfer
> > > > - Destination address - It can be scatter-gather too - Will be changed
> > > > per transfer
> > > > - Transfer Length - - It can be scatter-gather too - Will be changed
> > > > per transfer
> > > > - IOVA address where HW post Job completion status PER Job descriptor
> > > > - Will be changed per transfer
> > > > - Another sideband information related to channel  // We don't expect
> > > > the application to change per transfer
> > > > - As an option, Job completion can be posted as an event to
> > > > rte_event_queue  too // We don't expect the application to change per
> > > > transfer
> > >
> > > The 'option' field looks like a software interface field, but not HW descriptor.
> >
> > It is in HW descriptor.
> >
> > >
> > > >
> > > > @Richardson, Bruce @fengchengwen @Hemant Agrawal
> > > >
> > > > Could you share the options for your HW descriptors  which you are
> > > > planning to expose through API like above so that we can easily
> > > > converge on fastpath API
> > > >
> > >
> > > Kunpeng HW descriptor is self-describing, and don't need refer context info.
> > >
> > > Maybe the fields which was fix with some transfer type could setup by driver, and
> > > don't expose to application.
> >
> > Yes. I agree.I think, that reason why I though to have
> > rte_dmadev_prep() call to convert DPDK DMA transfer attributes to HW
> > specific descriptors
>
> What are all these attributes? Do you have a reference link for them?
>
Bruce Richardson June 23, 2021, 2:19 p.m. UTC | #48
On Wed, Jun 23, 2021 at 05:10:22PM +0530, Jerin Jacob wrote:
> On Wed, Jun 23, 2021 at 3:07 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Wed, Jun 23, 2021 at 12:51:07PM +0530, Jerin Jacob wrote:
> > > On Wed, Jun 23, 2021 at 9:00 AM fengchengwen <fengchengwen@huawei.com> wrote:
> > > >
> 
> > > >
> > > > Currently, it is hard to define generic dma descriptor, I think the well-defined
> > > > APIs is feasible.
> > >
> > > I would like to understand why not feasible? if we move the
> > > preparation to the slow path.
> > >
> > > i.e
> > >
> > > struct rte_dmadev_desc defines all the "attributes" of all DMA devices available
> > > using capability. I believe with the scheme, we can scale and
> > > incorporate all features of
> > > all DMA HW without any performance impact.
> > >
> > > something like:
> > >
> > > struct rte_dmadev_desc {
> > >   /* Attributes all DMA transfer available for all HW under capability. */
> > >   channel or port;
> > >   ops ; // copy, fill etc..
> > >  /* impemention opqueue memory as zero length array,
> > > rte_dmadev_desc_prep() update this memory with HW specific information
> > > */
> > >   uint8_t impl_opq[];
> > > }
> > >
> > > // allocate the memory for dma decriptor
> > > struct rte_dmadev_desc *rte_dmadev_desc_alloc(devid);
> > > // Convert DPDK specific descriptors to HW specific descriptors in slowpath */
> > > rte_dmadev_desc_prep(devid, struct rte_dmadev_desc *desc);
> > > // Free dma descriptor memory
> > > rte_dmadev_desc_free(devid, struct rte_dmadev_desc *desc )
> > >
> > > The above calls in slow path.
> > >
> > > Only below call in fastpath.
> > > // Here desc can be NULL(in case you don't need any specific attribute
> > > attached to transfer, if needed, it can be an object which is gone
> > > through rte_dmadev_desc_prep())
> > > rte_dmadev_enq(devid, struct rte_dmadev_desc *desc, void *src, void
> > > *dest, unsigned int len, cookie)
> > >
> >
> > The trouble here is the performance penalty due to building up and tearing
> > down structures and passing those structures into functions via function
> > pointer. With the APIs for enqueue/dequeue that have been discussed here,
> > all parameters will be passed in registers, and then each driver can do a
> > write of the actual hardware descriptor straight to cache/memory from
> > registers. With the scheme you propose above, the register contains a
> > pointer to the data which must then be loaded into the CPU before being
> > written out again. This increases our offload cost.
> 
> See below.
> 
> >
> > However, assuming that the desc_prep call is just for slowpath or
> > initialization time, I'd be ok to have the functions take an extra
> > hw-specific parameter for each call prepared with tx_prep. It would still
> > allow all other parameters to be passed in registers. How much data are you
> > looking to store in this desc struct? It can't all be represented as flags,
> > for example?
> 
> There is around 128bit of metadata for octeontx2. New HW may
> completely different metata
> http://code.dpdk.org/dpdk/v21.05/source/drivers/raw/octeontx2_dma/otx2_dpi_rawdev.h#L149
> 
> I see following issue with flags scheme:
> 
> - We need to start populate in fastpath, Since it based on capabality,
> application needs to have
> different versions of fastpath code
> - Not future proof, Not easy add other stuff as needed when new HW
> comes with new
> transfer attributes.
> 
> 

Understood. Would the "tx_prep" (or perhaps op_prep)  function you proposed
solve that problem, if it were passed (along with flags) to the
enqueue_copy function? i.e. would the below work for you, and if so, what
parameters would you see passed to the prep function?

metad = rte_dma_op_prep(dev_id, ....)

rte_dma_enqueue_copy(dev_id, src, dst, len, flags, metad)


> >
> > As for the individual APIs, we could do a generic "enqueue" API, which
> > takes the op as a parameter, I prefer having each operation as a separate
> > function, in order to increase the readability of the code and to reduce
> 
> Only issue I see, all application needs have two path for doing the stuff,
> one with _prep() and separate function() and drivers need to support both.
> 
If prep is not called per-op, we could always mandate it be called before
the actual enqueue functions, even if some drivers ignore the argument.

> > the number of parameters needed per function i.e. thereby saving registers
> > needing to be used and potentially making the function calls and offload
> 
> My worry is, struct rte_dmadev can hold only function pointers for <=
> 8 fastpath functions for 64B cache line.
> When you say new op, say fill, need a new function, What will be the
> change wrt HW
> driver point of view? Is it updating HW descriptor with op as _fill_
> vs _copy_? something beyond that?

Well, from a user view-point each operation takes different parameters, so
for a fill operation, you have a destination address, but instead of a
source address for copy you have pattern for fill.

Internally, for those two ops, the only different in input and descriptor
writing is indeed in the op flag, and no additional context or metadata is
needed for a copy other than source, address and length (+ plus maybe some
flags e.g. for caching behaviour or the like), so having the extra prep
function adds no value for us, and loading data from a prebuilt structure
just adds more IO overhead. Therefore, I'd ok to add it for enabling other
hardware, but only in such a way as it doesn't impact the offload cost.

If we want to in future look at adding more advanced or complex
capabilities, I'm ok for adding a general "enqueue_op" function which takes
multiple op types, but for very simple ops like copy, where we need to keep
the offload cost down to a minimum, having fastpath specific copy functions
makes a lot of sense to me.

> If it is about, HW descriptor update, then _prep() can do all work,
> just driver need to copy desc to
> to HW.
> 
> I believe upto to 6 arguments passed over registers in x86(it is 8 in
> arm64). if so,
> the desc pointer(already populated in HW descriptor format by _prep())
> is in register, and
> would  be simple 64bit/128bit copy from desc pointer to HW memory on
> driver enq(). I dont see
> any overhead on that, On other side, we if keep adding arguments, it
> will spill out
> to stack.
> 
For a copy operation, we should never need more than 6 arguments - see
proposal above which has 6 including a set of flags and arbitrary void *
pointer for extensibility. If anything more complex than that is needed,
the generic "enqueue_op" function can be used instead. Let's fast-path the
common, simple case, since that what is most likely to be used most!

/Bruce
Bruce Richardson June 23, 2021, 2:22 p.m. UTC | #49
On Wed, Jun 23, 2021 at 05:16:28PM +0530, Jerin Jacob wrote:
> On Wed, Jun 23, 2021 at 3:11 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Tue, Jun 22, 2021 at 10:55:24PM +0530, Jerin Jacob wrote:
> > > On Fri, Jun 18, 2021 at 3:11 PM fengchengwen
> > > <fengchengwen@huawei.com> wrote:
> > > >
> > > > On 2021/6/18 13:52, Jerin Jacob wrote:
> > > > > On Thu, Jun 17, 2021 at 2:46 PM Bruce Richardson
> > > > > <bruce.richardson@intel.com> wrote:
> > > > >>
> > > > >> On Wed, Jun 16, 2021 at 08:07:26PM +0530, Jerin Jacob wrote:
> > > > >>> On Wed, Jun 16, 2021 at 3:47 PM fengchengwen
> > > > >>> <fengchengwen@huawei.com> wrote:
> > > > >>>>
> > > > >>>> On 2021/6/16 15:09, Morten Brørup wrote:
> > > > >>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > > > >>>>>> Richardson Sent: Tuesday, 15 June 2021 18.39
> > > > >>>>>>
> > > > >>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng
> > > > >>>>>> wrote:
> > > > >>>>>>> This patch introduces 'dmadevice' which is a generic type
> > > > >>>>>>> of DMA device.
> > > > >>>>>>>
> > > > >>>>>>> The APIs of dmadev library exposes some generic operations
> > > > >>>>>>> which can enable configuration and I/O with the DMA
> > > > >>>>>>> devices.
> > > > >>>>>>>
> > > > >>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com> ---
> > > > >>>>>> Thanks for sending this.
> > > > >>>>>>
> > > > >>>>>> Of most interest to me right now are the key data-plane
> > > > >>>>>> APIs. While we are still in the prototyping phase, below is
> > > > >>>>>> a draft of what we are thinking for the key
> > > > >>>>>> enqueue/perform_ops/completed_ops APIs.
> > > > >>>>>>
> > > > >>>>>> Some key differences I note in below vs your original RFC: *
> > > > >>>>>> Use of void pointers rather than iova addresses. While using
> > > > >>>>>> iova's makes sense in the general case when using hardware,
> > > > >>>>>> in that it can work with both physical addresses and virtual
> > > > >>>>>> addresses, if we change the APIs to use void pointers
> > > > >>>>>> instead it will still work for DPDK in VA mode, while at the
> > > > >>>>>> same time allow use of software fallbacks in error cases,
> > > > >>>>>> and also a stub driver than uses memcpy in the background.
> > > > >>>>>> Finally, using iova's makes the APIs a lot more awkward to
> > > > >>>>>> use with anything but mbufs or similar buffers where we
> > > > >>>>>> already have a pre-computed physical address.  * Use of id
> > > > >>>>>> values rather than user-provided handles. Allowing the
> > > > >>>>>> user/app to manage the amount of data stored per operation
> > > > >>>>>> is a better solution, I feel than proscribing a certain
> > > > >>>>>> about of in-driver tracking. Some apps may not care about
> > > > >>>>>> anything other than a job being completed, while other apps
> > > > >>>>>> may have significant metadata to be tracked. Taking the
> > > > >>>>>> user-context handles out of the API also makes the driver
> > > > >>>>>> code simpler.  * I've kept a single combined API for
> > > > >>>>>> completions, which differs from the separate error handling
> > > > >>>>>> completion API you propose. I need to give the two function
> > > > >>>>>> approach a bit of thought, but likely both could work.  If
> > > > >>>>>> we (likely) never expect failed ops, then the specifics of
> > > > >>>>>> error handling should not matter that much.
> > > > >>>>>>
> > > > >>>>>> For the rest, the control / setup APIs are likely to be
> > > > >>>>>> rather uncontroversial, I suspect. However, I think that
> > > > >>>>>> rather than xstats APIs, the library should first provide a
> > > > >>>>>> set of standardized stats like ethdev does. If
> > > > >>>>>> driver-specific stats are needed, we can add xstats later to
> > > > >>>>>> the API.
> > > > >>>>>>
> > > > >>>>>> Appreciate your further thoughts on this, thanks.
> > > > >>>>>>
> > > > >>>>>> Regards, /Bruce
> > > > >>>>>
> > > > >>>>> I generally agree with Bruce's points above.
> > > > >>>>>
> > > > >>>>> I would like to share a couple of ideas for further
> > > > >>>>> discussion:
> > > > >>>
> > > > >>>
> > > > >>> I believe some of the other requirements and comments for
> > > > >>> generic DMA will be
> > > > >>>
> > > > >>> 1) Support for the _channel_, Each channel may have different
> > > > >>> capabilities and functionalities.  Typical cases are, each
> > > > >>> channel have separate source and destination devices like DMA
> > > > >>> between PCIe EP to Host memory, Host memory to Host memory,
> > > > >>> PCIe EP to PCIe EP.  So we need some notion of the channel in
> > > > >>> the specification.
> > > > >>>
> > > > >>
> > > > >> Can you share a bit more detail on what constitutes a channel in
> > > > >> this case?  Is it equivalent to a device queue (which we are
> > > > >> flattening to individual devices in this API), or to a specific
> > > > >> configuration on a queue?
> > > > >
> > > > > It not a queue. It is one of the attributes for transfer.  I.e in
> > > > > the same queue, for a given transfer it can specify the different
> > > > > "source" and "destination" device.  Like CPU to Sound card, CPU
> > > > > to network card etc.
> > > > >
> > > > >
> > > > >>
> > > > >>> 2) I assume current data plane APIs are not thread-safe. Is it
> > > > >>> right?
> > > > >>>
> > > > >> Yes.
> > > > >>
> > > > >>>
> > > > >>> 3) Cookie scheme outlined earlier looks good to me. Instead of
> > > > >>> having generic dequeue() API
> > > > >>>
> > > > >>> 4) Can split the rte_dmadev_enqueue_copy(uint16_t dev_id, void
> > > > >>> * src, void * dst, unsigned int length); to two stage API like,
> > > > >>> Where one will be used in fastpath and other one will use used
> > > > >>> in slowpath.
> > > > >>>
> > > > >>> - slowpath API will for take channel and take other attributes
> > > > >>> for transfer
> > > > >>>
> > > > >>> Example syantx will be:
> > > > >>>
> > > > >>> struct rte_dmadev_desc { channel id; ops ; // copy, xor, fill
> > > > >>> etc other arguments specific to dma transfer // it can be set
> > > > >>> based on capability.
> > > > >>>
> > > > >>> };
> > > > >>>
> > > > >>> rte_dmadev_desc_t rte_dmadev_preprare(uint16_t dev_id,  struct
> > > > >>> rte_dmadev_desc *dec);
> > > > >>>
> > > > >>> - Fastpath takes arguments that need to change per transfer
> > > > >>> along with slow-path handle.
> > > > >>>
> > > > >>> rte_dmadev_enqueue(uint16_t dev_id, void * src, void * dst,
> > > > >>> unsigned int length,  rte_dmadev_desc_t desc)
> > > > >>>
> > > > >>> This will help to driver to -Former API form the
> > > > >>> device-specific descriptors in slow path  for a given channel
> > > > >>> and fixed attributes per transfer -Later API blend "variable"
> > > > >>> arguments such as src, dest address with slow-path created
> > > > >>> descriptors
> > > > >>>
> > > > >>
> > > > >> This seems like an API for a context-aware device, where the
> > > > >> channel is the config data/context that is preserved across
> > > > >> operations - is that correct?  At least from the Intel DMA
> > > > >> accelerators side, we have no concept of this context, and each
> > > > >> operation is completely self-described. The location or type of
> > > > >> memory for copies is irrelevant, you just pass the src/dst
> > > > >> addresses to reference.
> > > > >
> > > > > it is not context-aware device. Each HW JOB is self-described.
> > > > > You can view it different attributes of transfer.
> > > > >
> > > > >
> > > > >>
> > > > >>> The above will give better performance and is the best
> > > > >>> trade-off c between performance and per transfer variables.
> > > > >>
> > > > >> We may need to have different APIs for context-aware and
> > > > >> context-unaware processing, with which to use determined by the
> > > > >> capabilities discovery.  Given that for these DMA devices the
> > > > >> offload cost is critical, more so than any other dev class I've
> > > > >> looked at before, I'd like to avoid having APIs with extra
> > > > >> parameters than need to be passed about since that just adds
> > > > >> extra CPU cycles to the offload.
> > > > >
> > > > > If driver does not support additional attributes and/or the
> > > > > application does not need it, rte_dmadev_desc_t can be NULL.  So
> > > > > that it won't have any cost in the datapath. I think, we can go
> > > > > to different API cases if we can not abstract problems without
> > > > > performance impact.  Otherwise, it will be too much pain for
> > > > > applications.
> > > >
> > > > Yes, currently we plan to use different API for different case,
> > > > e.g.  rte_dmadev_memcpy()  -- deal with local to local memcopy
> > > > rte_dmadev_memset()  -- deal with fill with local memory with
> > > > pattern maybe: rte_dmadev_imm_data()  --deal with copy very little
> > > > data rte_dmadev_p2pcopy()   --deal with peer-to-peer copy of
> > > > diffenet PCIE addr
> > > >
> > > > These API capabilities will be reflected in the device capability
> > > > set so that application could know by standard API.
> > >
> > >
> > > There will be a lot of combination of that it will be like M x N
> > > cross base case, It won't scale.
> > >
> >
> > What are the various cases that are so significantly different? Using
> > the examples above, the "imm_data" and "p2p_copy" operations are still
> > copy ops, and the fact of it being a small copy or a p2p one can be
> > expressed just using flags? [Also, you are not likely to want to
> > offload a small copy, are you?]
> 
> I meant, p2p version can have memcpy, memset, _imm_data. So it is gone to
> 4 to 6 now, If we add one more op, it becomes 8 function.
> 
> IMO, a separate function is good if driver need to do radically different
> thing. In our hardware, it is about updating the descriptor field
> differently, Is it so with other HW? If so, _prep() makes life easy.
>
I disagree. Sure, there are a matrix of possibilities, but using the set
above, memcpy == copy, both memset and imm_data seem like a "fill op" to
me, so to have those work with both p2p and DRAM you should only need two
functions with a flag to indicate p2p or mem-mem (or two flags if you want
to indicate src and dest in pci or memory independently). I'm just not
seeing where the massive structs need to be passed around and slow things
down.

/Bruce
Bruce Richardson June 23, 2021, 2:56 p.m. UTC | #50
This is developing into quite a long discussion with multiple threads
ongoing at the same time. Since it's getting relatively hard to follow (at
least for me), can I suggest that we actually hold a call to discuss
"dmadev" and to move things along. Since most of the dicussion participants
I believe are in the eastern timezones can I suggest 8AM UTC as a suitable
timeslot (which would be 9AM Irish time, 1:30PM India and 4PM PRC time).
Would 8AM UTC on Friday suit people? The usual tool for such community
discussion is Jitsi (meet.jit.si/DPDK), so I would suggest re-using that
for this discussion.

Can anyone interested in participating in this discussion let me know
[offlist, so we don't spam everyone], and I can try and co-ordinate if
everyone is ok with above suggested timeslot and send out calendar invite.

Regards,
/Bruce

On Wed, Jun 23, 2021 at 11:50:48AM +0800, fengchengwen wrote:
> On 2021/6/23 1:51, Jerin Jacob wrote:
> > On Fri, Jun 18, 2021 at 2:22 PM fengchengwen <fengchengwen@huawei.com> wrote:
> >>
> >> On 2021/6/17 22:18, Bruce Richardson wrote:
> >>> On Thu, Jun 17, 2021 at 12:02:00PM +0100, Bruce Richardson wrote:
> >>>> On Thu, Jun 17, 2021 at 05:48:05PM +0800, fengchengwen wrote:
> >>>>> On 2021/6/17 1:31, Bruce Richardson wrote:
> >>>>>> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> >>>>>>> On 2021/6/16 0:38, Bruce Richardson wrote:
> >>>>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> >>>>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
> >>>>>>>>> device.
> >>>>>>>>>
> >>>>>>>>> The APIs of dmadev library exposes some generic operations which can
> >>>>>>>>> enable configuration and I/O with the DMA devices.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> >>>>>>>>> ---
> >>>>>>>> Thanks for sending this.
> >>>>>>>>
> >>>>>>>> Of most interest to me right now are the key data-plane APIs. While we are
> >>>>>>>> still in the prototyping phase, below is a draft of what we are thinking
> >>>>>>>> for the key enqueue/perform_ops/completed_ops APIs.
> >>>>>>>>
> >>>>>>>> Some key differences I note in below vs your original RFC:
> >>>>>>>> * Use of void pointers rather than iova addresses. While using iova's makes
> >>>>>>>>   sense in the general case when using hardware, in that it can work with
> >>>>>>>>   both physical addresses and virtual addresses, if we change the APIs to use
> >>>>>>>>   void pointers instead it will still work for DPDK in VA mode, while at the
> >>>>>>>>   same time allow use of software fallbacks in error cases, and also a stub
> >>>>>>>>   driver than uses memcpy in the background. Finally, using iova's makes the
> >>>>>>>>   APIs a lot more awkward to use with anything but mbufs or similar buffers
> >>>>>>>>   where we already have a pre-computed physical address.
> >>>>>>>
> >>>>>>> The iova is an hint to application, and widely used in DPDK.
> >>>>>>> If switch to void, how to pass the address (iova or just va ?)
> >>>>>>> this may introduce implementation dependencies here.
> >>>>>>>
> >>>>>>> Or always pass the va, and the driver performs address translation, and this
> >>>>>>> translation may cost too much cpu I think.
> >>>>>>>
> >>>>>>
> >>>>>> On the latter point, about driver doing address translation I would agree.
> >>>>>> However, we probably need more discussion about the use of iova vs just
> >>>>>> virtual addresses. My thinking on this is that if we specify the API using
> >>>>>> iovas it will severely hurt usability of the API, since it forces the user
> >>>>>> to take more inefficient codepaths in a large number of cases. Given a
> >>>>>> pointer to the middle of an mbuf, one cannot just pass that straight as an
> >>>>>> iova but must instead do a translation into offset from mbuf pointer and
> >>>>>> then readd the offset to the mbuf base address.
> >>>>>>
> >>>>>> My preference therefore is to require the use of an IOMMU when using a
> >>>>>> dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
> >>>>>> present, DPDK will run in VA mode, allowing virtual addresses to our
> >>>>>> hugepage memory to be sent directly to hardware. Also, when using
> >>>>>> dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> >>>>>> management for the app, removing further the restrictions on what memory
> >>>>>> can be addressed by hardware.
> >>>>>
> >>>>> Some DMA devices many don't support IOMMU or IOMMU bypass default, so driver may
> >>>>> should call rte_mem_virt2phy() do the address translate, but the rte_mem_virt2phy()
> >>>>> cost too many CPU cycles.
> >>>>>
> >>>>> If the API defined as iova, it will work fine in:
> >>>>> 1) If DMA don't support IOMMU or IOMMU bypass, then start application with
> >>>>>    --iova-mode=pa
> >>>>> 2) If DMA support IOMMU, --iova-mode=pa/va work both fine
> >>>>>
> >>>>
> >>>> I suppose if we keep the iova as the datatype, we can just cast "void *"
> >>>> pointers to that in the case that virtual addresses can be used directly. I
> >>>> believe your RFC included a capability query API - "uses void * as iova"
> >>>> should probably be one of those capabilities, and that would resolve this.
> >>>> If DPDK is in iova=va mode because of the presence of an iommu, all drivers
> >>>> could report this capability too.
> >>>>
> >>>>>>
> >>>>>>>> * Use of id values rather than user-provided handles. Allowing the user/app
> >>>>>>>>   to manage the amount of data stored per operation is a better solution, I
> >>>>>>>>   feel than proscribing a certain about of in-driver tracking. Some apps may
> >>>>>>>>   not care about anything other than a job being completed, while other apps
> >>>>>>>>   may have significant metadata to be tracked. Taking the user-context
> >>>>>>>>   handles out of the API also makes the driver code simpler.
> >>>>>>>
> >>>>>>> The user-provided handle was mainly used to simply application implementation,
> >>>>>>> It provides the ability to quickly locate contexts.
> >>>>>>>
> >>>>>>> The "use of id values" seem like the dma_cookie of Linux DMA engine framework,
> >>>>>>> user will get a unique dma_cookie after calling dmaengine_submit(), and then
> >>>>>>> could use it to call dma_async_is_tx_complete() to get completion status.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, the idea of the id is the same - to locate contexts. The main
> >>>>>> difference is that if we have the driver manage contexts or pointer to
> >>>>>> contexts, as well as giving more work to the driver, it complicates the APIs
> >>>>>> for measuring completions. If we use an ID-based approach, where the app
> >>>>>> maintains its own ring of contexts (if any), it avoids the need to have an
> >>>>>> "out" parameter array for returning those contexts, which needs to be
> >>>>>> appropriately sized. Instead we can just report that all ids up to N are
> >>>>>> completed. [This would be similar to your suggestion that N jobs be
> >>>>>> reported as done, in that no contexts are provided, it's just that knowing
> >>>>>> the ID of what is completed is generally more useful than the number (which
> >>>>>> can be obviously got by subtracting the old value)]
> >>>>>>
> >>>>>> We are still working on prototyping all this, but would hope to have a
> >>>>>> functional example of all this soon.
> >>>>>>
> >>>>>>> How about define the copy prototype as following:
> >>>>>>>   dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx)
> >>>>>>> while the dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
> >>>>>>> enqueue successful else fail.
> >>>>>>> when complete the dmadev will return latest completed dma_cookie, and the
> >>>>>>> application could use the dma_cookie to quick locate contexts.
> >>>>>>>
> >>>>>>
> >>>>>> If I understand this correctly, I believe this is largely what I was
> >>>>>> suggesting - just with the typedef for the type? In which case it obviously
> >>>>>> looks good to me.
> >>>>>>
> >>>>>>>> * I've kept a single combined API for completions, which differs from the
> >>>>>>>>   separate error handling completion API you propose. I need to give the
> >>>>>>>>   two function approach a bit of thought, but likely both could work. If we
> >>>>>>>>   (likely) never expect failed ops, then the specifics of error handling
> >>>>>>>>   should not matter that much.
> >>>>>>>
> >>>>>>> The rte_ioat_completed_ops API is too complex, and consider some applications
> >>>>>>> may never copy fail, so split them as two API.
> >>>>>>> It's indeed not friendly to other scenarios that always require error handling.
> >>>>>>>
> >>>>>>> I prefer use completed operations number as return value other than the ID so
> >>>>>>> that application could simple judge whether have new completed operations, and
> >>>>>>> the new prototype:
> >>>>>>>  uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint32_t *status, uint16_t max_status, uint16_t *num_fails);
> >>>>>>>
> >>>>>>> 1) for normal case which never expect failed ops:
> >>>>>>>    just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL);
> >>>>>>> 2) for other case:
> >>>>>>>    ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status, &fails);
> >>>>>>>    at this point the fails <= ret <= max_status
> >>>>>>>
> >>>>>> Completely agree that we need to plan for the happy-day case where all is
> >>>>>> passing. Looking at the prototypes you have above, I am ok with returning
> >>>>>> number of completed ops as the return value with the final completed cookie
> >>>>>> as an "out" parameter.
> >>>>>> For handling errors, I'm ok with what you propose above, just with one
> >>>>>> small adjustment - I would remove the restriction that ret <= max_status.
> >>>>>>
> >>>>>> In case of zero-failures, we can report as many ops succeeding as we like,
> >>>>>> and even in case of failure, we can still report as many successful ops as
> >>>>>> we like before we start filling in the status field. For example, if 32 ops
> >>>>>> are completed, and the last one fails, we can just fill in one entry into
> >>>>>> status, and return 32. Alternatively if the 4th last one fails we fill in 4
> >>>>>> entries and return 32. The only requirements would be:
> >>>>>> * fails <= max_status
> >>>>>> * fails <= ret
> >>>>>> * cookie holds the id of the last entry in status.
> >>>>>
> >>>>> I think we understand the same:
> >>>>>
> >>>>> The fails <= ret <= max_status include following situation:
> >>>>> 1) If max_status is 32, and there are 32 completed ops, then the ret will be 32
> >>>>> no matter which ops is failed
> >>>>> 2) If max_status is 33, and there are 32 completed ops, then the ret will be 32
> >>>>> 3) If max_status is 16, and there are 32 completed ops, then the ret will be 16
> >>>>>
> >>>>> and the cookie always hold the id of the last returned completed ops, no matter
> >>>>> it's completed successful or failed
> >>>>>
> >>>>
> >>>> I actually disagree on the #3. If max_status is 16, there are 32 completed
> >>>> ops, and *no failures* the ret will be 32, not 16, because we are not
> >>>> returning any status entries so max_status need not apply. Keeping that
> >>>> same scenario #3, depending on the number of failures and the point of
> >>>> them, the return value may similarly vary, for example:
> >>>> * if job #28 fails, then ret could still be 32, cookie would be the cookie
> >>>>   for that job, "fails" parameter would return as 4, with status holding the
> >>>>   failure of 28 plus the succeeded status of jobs 29-31, i.e. 4 elements.
> >>>> * if job #5 fails, then we can't fit the status list from 5 though 31 in an
> >>>>   array of 16, so "fails" == 16(max_status) and status contains the 16
> >>>>   statuses starting from #5, which means that cookie contains the value for
> >>>>   job #20 and ret is 21.
> >>>>
> >>>> In other words, ignore max_status and status parameters *unless we have an
> >>>> error to return*, meaning the fast-path/happy-day case works as fast as
> >>>> possible. You don't need to worry about sizing your status array to be big,
> >>>> and you always get back a large number of completions when available. Your
> >>>> fastpath code only need check the "fails" parameter to see if status needs
> >>>> to ever be consulted, and in normal case it doesn't.
> >>>>
> >>>> If this is too complicated, maybe we can simplify a little by returning just
> >>>> one failure at a time, though at the cost of making error handling slower?
> >>>>
> >>>> rte_dmadev_completed(dev_id, &cookie, &failure_status)
> >>>>
> >>>> In this case, we always return the number of completed ops on success,
> >>>> while on failure, we return the first error code. For a single error, this
> >>>> works fine, but if we get a burst of errors together, things will work
> >>>> slower - which may be acceptable if errors are very rare. However, for idxd
> >>>> at least if a fence occurs after a failure all jobs in the batch after the
> >>>> fence would be skipped, which would lead to the "burst of errors" case.
> >>>> Therefore, I'd prefer to have the original suggestion allowing multiple
> >>>> errors to be reported at a time.
> >>>>
> >>>> /Bruce
> >>>
> >>> Apologies for self-reply, but thinking about it more, a combination of
> >>> normal-case and error-case APIs may be just simpler:
> >>>
> >>> int rte_dmadev_completed(dev_id, &cookie)
> >>>
> >>> returns number of items completed and cookie of last item. If there is an
> >>> error, returns all successfull values up to the error entry and returns -1
> >>> on subsequent call.
> >>>
> >>> int rte_dmadev_completed_status(dev_id, &cookie, max_status, status_array,
> >>>       &error_count)
> >>>
> >>> this is a slower completion API which behaves like you originally said
> >>> above, returning number of completions x, 0 <= x <= max_status, with x
> >>> status values filled into array, and the number of unsuccessful values in
> >>> the error_count value.
> >>>
> >>> This would allow code to be written in the application to use
> >>> rte_dmadev_completed() in the normal case, and on getting a "-1" value, use
> >>> rte_dmadev_completed_status() to get the error details. If strings of
> >>> errors might be expected, the app can continually use the
> >>> completed_status() function until error_count returns 0, and then switch
> >>> back to the faster/simpler version.
> >>
> >> This two-function simplify the status_array's maintenance because we don't need init it to zero.
> >> I think it's a good trade-off between performance and rich error info (status code).
> >>
> >> Here I'd like to discuss the 'burst size', which is widely used in DPDK application (e.g.
> >> nic polling or ring en/dequeue).
> >> Currently we don't define a max completed ops in rte_dmadev_completed() API, the return
> >> value may greater than 'burst size' of application, this may result in the application need to
> >> maintain (or remember) the return value of the function and special handling at the next poll.
> >>
> >> Also consider there may multiple calls rte_dmadev_completed to check fail, it may make it
> >> difficult for the application to use.
> >>
> >> So I prefer following prototype:
> >>   uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_cpls, bool *has_error)
> >>     -- nb_cpls: indicate max process operations number
> >>     -- has_error: indicate if there is an error
> >>     -- return value: the number of successful completed operations.
> >>     -- example:
> >>        1) If there are already 32 completed ops, and 4th is error, and nb_cpls is 32, then
> >>           the ret will be 3(because 1/2/3th is OK), and has_error will be true.
> >>        2) If there are already 32 completed ops, and all successful completed, then the ret
> >>           will be min(32, nb_cpls), and has_error will be false.
> >>        3) If there are already 32 completed ops, and all failed completed, then the ret will
> >>           be 0, and has_error will be true.
> >>   uint16_t rte_dmadev_completed_status(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_status, uint32_t *status)
> >>     -- return value: the number of failed completed operations.
> > 
> > 
> > 
> > In typical storage use cases etc, Sometimes application need to
> > provide scatter-gather list,
> > At least in our hardware sg list gives a "single completion result"
> > and it stops on the first failure to restart
> > the transfer by application. Have you thought of scatter-gather use
> > case and how it is in other  HW?
> 
> cookie and request are in a one-to-one correspondence, whether the request is a single or sg-list.
> Kunpeng9x0 don't support sg-list, I'm still investigating other hardware.
> 
> The above 'restart the transfer by application' mean re-schedule request (and have one new cookie) or
> just re-enable current failed request (this may introduce new API) ?
> 
> > 
> > prototype like the following works for us:
> > rte_dmadev_enq_sg(void **src, void **dest, unsigned int **length, int
> > nb_segments, cookie, ,,,)
> 
> OK, we could define one scatter-list struct to wrap src/dest/length.
> 
> > 
> > 
> >>
> >> The application use the following invocation order when polling:
> >>   has_error = false; // could be init to false by dmadev API, we need discuss
> >>   ret = rte_dmadev_completed(dev_id, &cookie, bust_size, &has_error);
> >>   // process successful completed case:
> >>   for (int i = 0; i < ret; i++) {
> >>   }
> >>   if (unlikely(has_error)) {
> >>     // process failed completed case
> >>     ret = rte_dmadev_completed_status(dev_id, &cookie, burst_size - ret, status_array);
> >>     for (int i = 0; i < ret; i++) {
> >>       // ...
> >>     }
> >>   }
> >>
> >>
> >>>
> >>> This two-function approach also allows future support for other DMA
> >>> functions such as comparison, where a status value is always required. Any
> >>> apps using that functionality would just always use the "_status" function
> >>> for completions.
> >>>
> >>> /Bruce
> >>>
> >>> .
> >>>
> >>
> > 
> > .
> > 
>
Jerin Jacob June 24, 2021, 6:49 a.m. UTC | #51
On Wed, Jun 23, 2021 at 7:50 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Wed, Jun 23, 2021 at 05:10:22PM +0530, Jerin Jacob wrote:
> > On Wed, Jun 23, 2021 at 3:07 PM Bruce Richardson
> > <bruce.richardson@intel.com> wrote:
> > >
> > > On Wed, Jun 23, 2021 at 12:51:07PM +0530, Jerin Jacob wrote:
> > > > On Wed, Jun 23, 2021 at 9:00 AM fengchengwen <fengchengwen@huawei.com> wrote:
> > > > >
> >
> > > > >
> > > > > Currently, it is hard to define generic dma descriptor, I think the well-defined
> > > > > APIs is feasible.
> > > >
> > > > I would like to understand why not feasible? if we move the
> > > > preparation to the slow path.
> > > >
> > > > i.e
> > > >
> > > > struct rte_dmadev_desc defines all the "attributes" of all DMA devices available
> > > > using capability. I believe with the scheme, we can scale and
> > > > incorporate all features of
> > > > all DMA HW without any performance impact.
> > > >
> > > > something like:
> > > >
> > > > struct rte_dmadev_desc {
> > > >   /* Attributes all DMA transfer available for all HW under capability. */
> > > >   channel or port;
> > > >   ops ; // copy, fill etc..
> > > >  /* impemention opqueue memory as zero length array,
> > > > rte_dmadev_desc_prep() update this memory with HW specific information
> > > > */
> > > >   uint8_t impl_opq[];
> > > > }
> > > >
> > > > // allocate the memory for dma decriptor
> > > > struct rte_dmadev_desc *rte_dmadev_desc_alloc(devid);
> > > > // Convert DPDK specific descriptors to HW specific descriptors in slowpath */
> > > > rte_dmadev_desc_prep(devid, struct rte_dmadev_desc *desc);
> > > > // Free dma descriptor memory
> > > > rte_dmadev_desc_free(devid, struct rte_dmadev_desc *desc )
> > > >
> > > > The above calls in slow path.
> > > >
> > > > Only below call in fastpath.
> > > > // Here desc can be NULL(in case you don't need any specific attribute
> > > > attached to transfer, if needed, it can be an object which is gone
> > > > through rte_dmadev_desc_prep())
> > > > rte_dmadev_enq(devid, struct rte_dmadev_desc *desc, void *src, void
> > > > *dest, unsigned int len, cookie)
> > > >
> > >
> > > The trouble here is the performance penalty due to building up and tearing
> > > down structures and passing those structures into functions via function
> > > pointer. With the APIs for enqueue/dequeue that have been discussed here,
> > > all parameters will be passed in registers, and then each driver can do a
> > > write of the actual hardware descriptor straight to cache/memory from
> > > registers. With the scheme you propose above, the register contains a
> > > pointer to the data which must then be loaded into the CPU before being
> > > written out again. This increases our offload cost.
> >
> > See below.
> >
> > >
> > > However, assuming that the desc_prep call is just for slowpath or
> > > initialization time, I'd be ok to have the functions take an extra
> > > hw-specific parameter for each call prepared with tx_prep. It would still
> > > allow all other parameters to be passed in registers. How much data are you
> > > looking to store in this desc struct? It can't all be represented as flags,
> > > for example?
> >
> > There is around 128bit of metadata for octeontx2. New HW may
> > completely different metata
> > http://code.dpdk.org/dpdk/v21.05/source/drivers/raw/octeontx2_dma/otx2_dpi_rawdev.h#L149
> >
> > I see following issue with flags scheme:
> >
> > - We need to start populate in fastpath, Since it based on capabality,
> > application needs to have
> > different versions of fastpath code
> > - Not future proof, Not easy add other stuff as needed when new HW
> > comes with new
> > transfer attributes.
> >
> >
>
> Understood. Would the "tx_prep" (or perhaps op_prep)  function you proposed
> solve that problem, if it were passed (along with flags) to the
> enqueue_copy function? i.e. would the below work for you, and if so, what
> parameters would you see passed to the prep function?

Below prototype loooks good to me. But we need make sure we need to
encode the items the "flags"
needs to supported by all PMDs aka it should be generic. I.e
application should not have the capability check
 in fastpath for flags.

>
> metad = rte_dma_op_prep(dev_id, ....)
>
> rte_dma_enqueue_copy(dev_id, src, dst, len, flags, metad)
>
>
> > >
> > > As for the individual APIs, we could do a generic "enqueue" API, which
> > > takes the op as a parameter, I prefer having each operation as a separate
> > > function, in order to increase the readability of the code and to reduce
> >
> > Only issue I see, all application needs have two path for doing the stuff,
> > one with _prep() and separate function() and drivers need to support both.
> >
> If prep is not called per-op, we could always mandate it be called before
> the actual enqueue functions, even if some drivers ignore the argument.

Thats works.

>
> > > the number of parameters needed per function i.e. thereby saving registers
> > > needing to be used and potentially making the function calls and offload
> >
> > My worry is, struct rte_dmadev can hold only function pointers for <=
> > 8 fastpath functions for 64B cache line.
> > When you say new op, say fill, need a new function, What will be the
> > change wrt HW
> > driver point of view? Is it updating HW descriptor with op as _fill_
> > vs _copy_? something beyond that?
>
> Well, from a user view-point each operation takes different parameters, so
> for a fill operation, you have a destination address, but instead of a
> source address for copy you have pattern for fill.

OK.

>
> Internally, for those two ops, the only different in input and descriptor
> writing is indeed in the op flag, and no additional context or metadata is
> needed for a copy other than source, address and length (+ plus maybe some
> flags e.g. for caching behaviour or the like), so having the extra prep
> function adds no value for us, and loading data from a prebuilt structure
> just adds more IO overhead. Therefore, I'd ok to add it for enabling other
> hardware, but only in such a way as it doesn't impact the offload cost.
>
> If we want to in future look at adding more advanced or complex
> capabilities, I'm ok for adding a general "enqueue_op" function which takes
> multiple op types, but for very simple ops like copy, where we need to keep
> the offload cost down to a minimum, having fastpath specific copy functions
> makes a lot of sense to me.

OK.

>
> > If it is about, HW descriptor update, then _prep() can do all work,
> > just driver need to copy desc to
> > to HW.
> >
> > I believe upto to 6 arguments passed over registers in x86(it is 8 in
> > arm64). if so,
> > the desc pointer(already populated in HW descriptor format by _prep())
> > is in register, and
> > would  be simple 64bit/128bit copy from desc pointer to HW memory on
> > driver enq(). I dont see
> > any overhead on that, On other side, we if keep adding arguments, it
> > will spill out
> > to stack.
> >
> For a copy operation, we should never need more than 6 arguments - see
> proposal above which has 6 including a set of flags and arbitrary void *
> pointer for extensibility. If anything more complex than that is needed,
> the generic "enqueue_op" function can be used instead. Let's fast-path the
> common, simple case, since that what is most likely to be used most!

OK.

>
> /Bruce
Jerin Jacob June 24, 2021, 7:03 a.m. UTC | #52
On Wed, Jun 23, 2021 at 9:20 AM fengchengwen <fengchengwen@huawei.com> wrote:
>

> >>
> >> So I prefer following prototype:
> >>   uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_cpls, bool *has_error)
> >>     -- nb_cpls: indicate max process operations number
> >>     -- has_error: indicate if there is an error
> >>     -- return value: the number of successful completed operations.
> >>     -- example:
> >>        1) If there are already 32 completed ops, and 4th is error, and nb_cpls is 32, then
> >>           the ret will be 3(because 1/2/3th is OK), and has_error will be true.
> >>        2) If there are already 32 completed ops, and all successful completed, then the ret
> >>           will be min(32, nb_cpls), and has_error will be false.
> >>        3) If there are already 32 completed ops, and all failed completed, then the ret will
> >>           be 0, and has_error will be true.
> >>   uint16_t rte_dmadev_completed_status(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_status, uint32_t *status)
> >>     -- return value: the number of failed completed operations.
> >
> >
> >
> > In typical storage use cases etc, Sometimes application need to
> > provide scatter-gather list,
> > At least in our hardware sg list gives a "single completion result"
> > and it stops on the first failure to restart
> > the transfer by application. Have you thought of scatter-gather use
> > case and how it is in other  HW?
>
> cookie and request are in a one-to-one correspondence, whether the request is a single or sg-list.
> Kunpeng9x0 don't support sg-list, I'm still investigating other hardware.
>
> The above 'restart the transfer by application' mean re-schedule request (and have one new cookie) or
> just re-enable current failed request (this may introduce new API) ?
>
> >
> > prototype like the following works for us:
> > rte_dmadev_enq_sg(void **src, void **dest, unsigned int **length, int
> > nb_segments, cookie, ,,,)
>
> OK, we could define one scatter-list struct to wrap src/dest/length.


Inspired from following system call [1]

[1]
https://man7.org/linux/man-pages/man2/process_vm_readv.2.html

I propose the following style syntax for the sg list

struct rte_dma_iovec {
    void  *iov_base;    /* Starting address */
    size_t iov_len;     /* Number of bytes to transfer */
};

rte_dmadev_enq_sg(const struct rte_dma_iovec  *src_iov, unsigned int srcvcnt,
const struct rte_dma_iovec  *dst_iov, unsigned int dstvcnt, ....)

The reason for separating iov_len for src and dest is the many to one case and
one to many cases.  Example:
Copy of Multiple 2MB of 15 source segments to one 30MB dest. Quite use
full in storage use cases.


>
> >
> >
> >>
> >> The application use the following invocation order when polling:
> >>   has_error = false; // could be init to false by dmadev API, we need discuss
> >>   ret = rte_dmadev_completed(dev_id, &cookie, bust_size, &has_error);
> >>   // process successful completed case:
> >>   for (int i = 0; i < ret; i++) {
> >>   }
> >>   if (unlikely(has_error)) {
> >>     // process failed completed case
> >>     ret = rte_dmadev_completed_status(dev_id, &cookie, burst_size - ret, status_array);
> >>     for (int i = 0; i < ret; i++) {
> >>       // ...
> >>     }
> >>   }
> >>
> >>
> >>>
> >>> This two-function approach also allows future support for other DMA
> >>> functions such as comparison, where a status value is always required. Any
> >>> apps using that functionality would just always use the "_status" function
> >>> for completions.
> >>>
> >>> /Bruce
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>
Morten Brørup June 24, 2021, 7:59 a.m. UTC | #53
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
> Sent: Thursday, 24 June 2021 09.03
> 
> On Wed, Jun 23, 2021 at 9:20 AM fengchengwen <fengchengwen@huawei.com>
> wrote:
> >
> 
> > >
> > > prototype like the following works for us:
> > > rte_dmadev_enq_sg(void **src, void **dest, unsigned int **length,
> int
> > > nb_segments, cookie, ,,,)
> >
> > OK, we could define one scatter-list struct to wrap src/dest/length.
> 
> 
> Inspired from following system call [1]
> 
> [1]
> https://man7.org/linux/man-pages/man2/process_vm_readv.2.html
> 
> I propose the following style syntax for the sg list
> 
> struct rte_dma_iovec {
>     void  *iov_base;    /* Starting address */
>     size_t iov_len;     /* Number of bytes to transfer */
> };
> 
> rte_dmadev_enq_sg(const struct rte_dma_iovec  *src_iov, unsigned int
> srcvcnt,
> const struct rte_dma_iovec  *dst_iov, unsigned int dstvcnt, ....)
> 
> The reason for separating iov_len for src and dest is the many to one
> case and
> one to many cases.  Example:
> Copy of Multiple 2MB of 15 source segments to one 30MB dest. Quite use
> full in storage use cases.

The process_vm_readv system call can do more than many-to-one and one-to-many. It allows copying three 4 MB source segments into four 3 MB destination segments or vice versa. I don't know if that is really necessary to support.

We should consider limiting the DMA device API to only provide functions where we can present realistic DPDK application use cases. Otherwise it will be like the NIC device API, were every NIC vendor adds all sorts of exotic functions, only to expose their new and shiny NIC features in DPDK. (I'm exaggerating here - you get the point!)

The main purpose of the DMA device API is to provide a common interface, so an application can call the same function, regardless of the different underlying hardware.

If the API becomes too hardware specific, the API will just become a set of wrappers to specific hardware, and the application will need to consider which DMA device hardware is present and adapt its behavior accordingly. Just like RSS hash functions on the NICs; the application needs to adapt to which RSS hash functions the underlying NICs support.


-Morten
Jerin Jacob June 24, 2021, 8:05 a.m. UTC | #54
On Thu, Jun 24, 2021 at 1:29 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
> > Sent: Thursday, 24 June 2021 09.03
> >
> > On Wed, Jun 23, 2021 at 9:20 AM fengchengwen <fengchengwen@huawei.com>
> > wrote:
> > >
> >
> > > >
> > > > prototype like the following works for us:
> > > > rte_dmadev_enq_sg(void **src, void **dest, unsigned int **length,
> > int
> > > > nb_segments, cookie, ,,,)
> > >
> > > OK, we could define one scatter-list struct to wrap src/dest/length.
> >
> >
> > Inspired from following system call [1]
> >
> > [1]
> > https://man7.org/linux/man-pages/man2/process_vm_readv.2.html
> >
> > I propose the following style syntax for the sg list
> >
> > struct rte_dma_iovec {
> >     void  *iov_base;    /* Starting address */
> >     size_t iov_len;     /* Number of bytes to transfer */
> > };
> >
> > rte_dmadev_enq_sg(const struct rte_dma_iovec  *src_iov, unsigned int
> > srcvcnt,
> > const struct rte_dma_iovec  *dst_iov, unsigned int dstvcnt, ....)
> >
> > The reason for separating iov_len for src and dest is the many to one
> > case and
> > one to many cases.  Example:
> > Copy of Multiple 2MB of 15 source segments to one 30MB dest. Quite use
> > full in storage use cases.
>
> The process_vm_readv system call can do more than many-to-one and one-to-many. It allows copying three 4 MB source segments into four 3 MB destination segments or vice versa. I don't know if that is really necessary to support.

We only need to support sum of src seg len == sum of dest seg len.
That is only a DMA use case. This one used multiple use cases like
storage etc.

>
> We should consider limiting the DMA device API to only provide functions where we can present realistic DPDK application use cases. Otherwise it will be like the NIC device API, were every NIC vendor adds all sorts of exotic functions, only to expose their new and shiny NIC features in DPDK. (I'm exaggerating here - you get the point!)
>
> The main purpose of the DMA device API is to provide a common interface, so an application can call the same function, regardless of the different underlying hardware.
>
> If the API becomes too hardware specific, the API will just become a set of wrappers to specific hardware, and the application will need to consider which DMA device hardware is present and adapt its behavior accordingly. Just like RSS hash functions on the NICs; the application needs to adapt to which RSS hash functions the underlying NICs support.

There will be base APIs that work on all the HWs. We should not limit
the API for advanced usage based on the capability.


>
>
> -Morten
>
fengchengwen June 24, 2021, 12:19 p.m. UTC | #55
OK, thank Bruce

How about next week ?
PS: I still working on V2 and hope it as a basis for discussion.

On 2021/6/23 22:56, Bruce Richardson wrote:
> This is developing into quite a long discussion with multiple threads
> ongoing at the same time. Since it's getting relatively hard to follow (at
> least for me), can I suggest that we actually hold a call to discuss
> "dmadev" and to move things along. Since most of the dicussion participants
> I believe are in the eastern timezones can I suggest 8AM UTC as a suitable
> timeslot (which would be 9AM Irish time, 1:30PM India and 4PM PRC time).
> Would 8AM UTC on Friday suit people? The usual tool for such community
> discussion is Jitsi (meet.jit.si/DPDK), so I would suggest re-using that
> for this discussion.
> 
> Can anyone interested in participating in this discussion let me know
> [offlist, so we don't spam everyone], and I can try and co-ordinate if
> everyone is ok with above suggested timeslot and send out calendar invite.
> 
> Regards,
> /Bruce
> 
> On Wed, Jun 23, 2021 at 11:50:48AM +0800, fengchengwen wrote:
>> On 2021/6/23 1:51, Jerin Jacob wrote:
>>> On Fri, Jun 18, 2021 at 2:22 PM fengchengwen <fengchengwen@huawei.com> wrote:
>>>>
>>>> On 2021/6/17 22:18, Bruce Richardson wrote:
>>>>> On Thu, Jun 17, 2021 at 12:02:00PM +0100, Bruce Richardson wrote:
>>>>>> On Thu, Jun 17, 2021 at 05:48:05PM +0800, fengchengwen wrote:
>>>>>>> On 2021/6/17 1:31, Bruce Richardson wrote:
>>>>>>>> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
>>>>>>>>> On 2021/6/16 0:38, Bruce Richardson wrote:
>>>>>>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
>>>>>>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
>>>>>>>>>>> device.
>>>>>>>>>>>
>>>>>>>>>>> The APIs of dmadev library exposes some generic operations which can
>>>>>>>>>>> enable configuration and I/O with the DMA devices.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>>>>>>>>>>> ---
>>>>>>>>>> Thanks for sending this.
>>>>>>>>>>
>>>>>>>>>> Of most interest to me right now are the key data-plane APIs. While we are
>>>>>>>>>> still in the prototyping phase, below is a draft of what we are thinking
>>>>>>>>>> for the key enqueue/perform_ops/completed_ops APIs.
>>>>>>>>>>
>>>>>>>>>> Some key differences I note in below vs your original RFC:
>>>>>>>>>> * Use of void pointers rather than iova addresses. While using iova's makes
>>>>>>>>>>   sense in the general case when using hardware, in that it can work with
>>>>>>>>>>   both physical addresses and virtual addresses, if we change the APIs to use
>>>>>>>>>>   void pointers instead it will still work for DPDK in VA mode, while at the
>>>>>>>>>>   same time allow use of software fallbacks in error cases, and also a stub
>>>>>>>>>>   driver than uses memcpy in the background. Finally, using iova's makes the
>>>>>>>>>>   APIs a lot more awkward to use with anything but mbufs or similar buffers
>>>>>>>>>>   where we already have a pre-computed physical address.
>>>>>>>>>
>>>>>>>>> The iova is an hint to application, and widely used in DPDK.
>>>>>>>>> If switch to void, how to pass the address (iova or just va ?)
>>>>>>>>> this may introduce implementation dependencies here.
>>>>>>>>>
>>>>>>>>> Or always pass the va, and the driver performs address translation, and this
>>>>>>>>> translation may cost too much cpu I think.
>>>>>>>>>
>>>>>>>>
>>>>>>>> On the latter point, about driver doing address translation I would agree.
>>>>>>>> However, we probably need more discussion about the use of iova vs just
>>>>>>>> virtual addresses. My thinking on this is that if we specify the API using
>>>>>>>> iovas it will severely hurt usability of the API, since it forces the user
>>>>>>>> to take more inefficient codepaths in a large number of cases. Given a
>>>>>>>> pointer to the middle of an mbuf, one cannot just pass that straight as an
>>>>>>>> iova but must instead do a translation into offset from mbuf pointer and
>>>>>>>> then readd the offset to the mbuf base address.
>>>>>>>>
>>>>>>>> My preference therefore is to require the use of an IOMMU when using a
>>>>>>>> dmadev, so that it can be a much closer analog of memcpy. Once an iommu is
>>>>>>>> present, DPDK will run in VA mode, allowing virtual addresses to our
>>>>>>>> hugepage memory to be sent directly to hardware. Also, when using
>>>>>>>> dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
>>>>>>>> management for the app, removing further the restrictions on what memory
>>>>>>>> can be addressed by hardware.
>>>>>>>
>>>>>>> Some DMA devices many don't support IOMMU or IOMMU bypass default, so driver may
>>>>>>> should call rte_mem_virt2phy() do the address translate, but the rte_mem_virt2phy()
>>>>>>> cost too many CPU cycles.
>>>>>>>
>>>>>>> If the API defined as iova, it will work fine in:
>>>>>>> 1) If DMA don't support IOMMU or IOMMU bypass, then start application with
>>>>>>>    --iova-mode=pa
>>>>>>> 2) If DMA support IOMMU, --iova-mode=pa/va work both fine
>>>>>>>
>>>>>>
>>>>>> I suppose if we keep the iova as the datatype, we can just cast "void *"
>>>>>> pointers to that in the case that virtual addresses can be used directly. I
>>>>>> believe your RFC included a capability query API - "uses void * as iova"
>>>>>> should probably be one of those capabilities, and that would resolve this.
>>>>>> If DPDK is in iova=va mode because of the presence of an iommu, all drivers
>>>>>> could report this capability too.
>>>>>>
>>>>>>>>
>>>>>>>>>> * Use of id values rather than user-provided handles. Allowing the user/app
>>>>>>>>>>   to manage the amount of data stored per operation is a better solution, I
>>>>>>>>>>   feel than proscribing a certain about of in-driver tracking. Some apps may
>>>>>>>>>>   not care about anything other than a job being completed, while other apps
>>>>>>>>>>   may have significant metadata to be tracked. Taking the user-context
>>>>>>>>>>   handles out of the API also makes the driver code simpler.
>>>>>>>>>
>>>>>>>>> The user-provided handle was mainly used to simply application implementation,
>>>>>>>>> It provides the ability to quickly locate contexts.
>>>>>>>>>
>>>>>>>>> The "use of id values" seem like the dma_cookie of Linux DMA engine framework,
>>>>>>>>> user will get a unique dma_cookie after calling dmaengine_submit(), and then
>>>>>>>>> could use it to call dma_async_is_tx_complete() to get completion status.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, the idea of the id is the same - to locate contexts. The main
>>>>>>>> difference is that if we have the driver manage contexts or pointer to
>>>>>>>> contexts, as well as giving more work to the driver, it complicates the APIs
>>>>>>>> for measuring completions. If we use an ID-based approach, where the app
>>>>>>>> maintains its own ring of contexts (if any), it avoids the need to have an
>>>>>>>> "out" parameter array for returning those contexts, which needs to be
>>>>>>>> appropriately sized. Instead we can just report that all ids up to N are
>>>>>>>> completed. [This would be similar to your suggestion that N jobs be
>>>>>>>> reported as done, in that no contexts are provided, it's just that knowing
>>>>>>>> the ID of what is completed is generally more useful than the number (which
>>>>>>>> can be obviously got by subtracting the old value)]
>>>>>>>>
>>>>>>>> We are still working on prototyping all this, but would hope to have a
>>>>>>>> functional example of all this soon.
>>>>>>>>
>>>>>>>>> How about define the copy prototype as following:
>>>>>>>>>   dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx)
>>>>>>>>> while the dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
>>>>>>>>> enqueue successful else fail.
>>>>>>>>> when complete the dmadev will return latest completed dma_cookie, and the
>>>>>>>>> application could use the dma_cookie to quick locate contexts.
>>>>>>>>>
>>>>>>>>
>>>>>>>> If I understand this correctly, I believe this is largely what I was
>>>>>>>> suggesting - just with the typedef for the type? In which case it obviously
>>>>>>>> looks good to me.
>>>>>>>>
>>>>>>>>>> * I've kept a single combined API for completions, which differs from the
>>>>>>>>>>   separate error handling completion API you propose. I need to give the
>>>>>>>>>>   two function approach a bit of thought, but likely both could work. If we
>>>>>>>>>>   (likely) never expect failed ops, then the specifics of error handling
>>>>>>>>>>   should not matter that much.
>>>>>>>>>
>>>>>>>>> The rte_ioat_completed_ops API is too complex, and consider some applications
>>>>>>>>> may never copy fail, so split them as two API.
>>>>>>>>> It's indeed not friendly to other scenarios that always require error handling.
>>>>>>>>>
>>>>>>>>> I prefer use completed operations number as return value other than the ID so
>>>>>>>>> that application could simple judge whether have new completed operations, and
>>>>>>>>> the new prototype:
>>>>>>>>>  uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint32_t *status, uint16_t max_status, uint16_t *num_fails);
>>>>>>>>>
>>>>>>>>> 1) for normal case which never expect failed ops:
>>>>>>>>>    just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL);
>>>>>>>>> 2) for other case:
>>>>>>>>>    ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status, &fails);
>>>>>>>>>    at this point the fails <= ret <= max_status
>>>>>>>>>
>>>>>>>> Completely agree that we need to plan for the happy-day case where all is
>>>>>>>> passing. Looking at the prototypes you have above, I am ok with returning
>>>>>>>> number of completed ops as the return value with the final completed cookie
>>>>>>>> as an "out" parameter.
>>>>>>>> For handling errors, I'm ok with what you propose above, just with one
>>>>>>>> small adjustment - I would remove the restriction that ret <= max_status.
>>>>>>>>
>>>>>>>> In case of zero-failures, we can report as many ops succeeding as we like,
>>>>>>>> and even in case of failure, we can still report as many successful ops as
>>>>>>>> we like before we start filling in the status field. For example, if 32 ops
>>>>>>>> are completed, and the last one fails, we can just fill in one entry into
>>>>>>>> status, and return 32. Alternatively if the 4th last one fails we fill in 4
>>>>>>>> entries and return 32. The only requirements would be:
>>>>>>>> * fails <= max_status
>>>>>>>> * fails <= ret
>>>>>>>> * cookie holds the id of the last entry in status.
>>>>>>>
>>>>>>> I think we understand the same:
>>>>>>>
>>>>>>> The fails <= ret <= max_status include following situation:
>>>>>>> 1) If max_status is 32, and there are 32 completed ops, then the ret will be 32
>>>>>>> no matter which ops is failed
>>>>>>> 2) If max_status is 33, and there are 32 completed ops, then the ret will be 32
>>>>>>> 3) If max_status is 16, and there are 32 completed ops, then the ret will be 16
>>>>>>>
>>>>>>> and the cookie always hold the id of the last returned completed ops, no matter
>>>>>>> it's completed successful or failed
>>>>>>>
>>>>>>
>>>>>> I actually disagree on the #3. If max_status is 16, there are 32 completed
>>>>>> ops, and *no failures* the ret will be 32, not 16, because we are not
>>>>>> returning any status entries so max_status need not apply. Keeping that
>>>>>> same scenario #3, depending on the number of failures and the point of
>>>>>> them, the return value may similarly vary, for example:
>>>>>> * if job #28 fails, then ret could still be 32, cookie would be the cookie
>>>>>>   for that job, "fails" parameter would return as 4, with status holding the
>>>>>>   failure of 28 plus the succeeded status of jobs 29-31, i.e. 4 elements.
>>>>>> * if job #5 fails, then we can't fit the status list from 5 though 31 in an
>>>>>>   array of 16, so "fails" == 16(max_status) and status contains the 16
>>>>>>   statuses starting from #5, which means that cookie contains the value for
>>>>>>   job #20 and ret is 21.
>>>>>>
>>>>>> In other words, ignore max_status and status parameters *unless we have an
>>>>>> error to return*, meaning the fast-path/happy-day case works as fast as
>>>>>> possible. You don't need to worry about sizing your status array to be big,
>>>>>> and you always get back a large number of completions when available. Your
>>>>>> fastpath code only need check the "fails" parameter to see if status needs
>>>>>> to ever be consulted, and in normal case it doesn't.
>>>>>>
>>>>>> If this is too complicated, maybe we can simplify a little by returning just
>>>>>> one failure at a time, though at the cost of making error handling slower?
>>>>>>
>>>>>> rte_dmadev_completed(dev_id, &cookie, &failure_status)
>>>>>>
>>>>>> In this case, we always return the number of completed ops on success,
>>>>>> while on failure, we return the first error code. For a single error, this
>>>>>> works fine, but if we get a burst of errors together, things will work
>>>>>> slower - which may be acceptable if errors are very rare. However, for idxd
>>>>>> at least if a fence occurs after a failure all jobs in the batch after the
>>>>>> fence would be skipped, which would lead to the "burst of errors" case.
>>>>>> Therefore, I'd prefer to have the original suggestion allowing multiple
>>>>>> errors to be reported at a time.
>>>>>>
>>>>>> /Bruce
>>>>>
>>>>> Apologies for self-reply, but thinking about it more, a combination of
>>>>> normal-case and error-case APIs may be just simpler:
>>>>>
>>>>> int rte_dmadev_completed(dev_id, &cookie)
>>>>>
>>>>> returns number of items completed and cookie of last item. If there is an
>>>>> error, returns all successfull values up to the error entry and returns -1
>>>>> on subsequent call.
>>>>>
>>>>> int rte_dmadev_completed_status(dev_id, &cookie, max_status, status_array,
>>>>>       &error_count)
>>>>>
>>>>> this is a slower completion API which behaves like you originally said
>>>>> above, returning number of completions x, 0 <= x <= max_status, with x
>>>>> status values filled into array, and the number of unsuccessful values in
>>>>> the error_count value.
>>>>>
>>>>> This would allow code to be written in the application to use
>>>>> rte_dmadev_completed() in the normal case, and on getting a "-1" value, use
>>>>> rte_dmadev_completed_status() to get the error details. If strings of
>>>>> errors might be expected, the app can continually use the
>>>>> completed_status() function until error_count returns 0, and then switch
>>>>> back to the faster/simpler version.
>>>>
>>>> This two-function simplify the status_array's maintenance because we don't need init it to zero.
>>>> I think it's a good trade-off between performance and rich error info (status code).
>>>>
>>>> Here I'd like to discuss the 'burst size', which is widely used in DPDK application (e.g.
>>>> nic polling or ring en/dequeue).
>>>> Currently we don't define a max completed ops in rte_dmadev_completed() API, the return
>>>> value may greater than 'burst size' of application, this may result in the application need to
>>>> maintain (or remember) the return value of the function and special handling at the next poll.
>>>>
>>>> Also consider there may multiple calls rte_dmadev_completed to check fail, it may make it
>>>> difficult for the application to use.
>>>>
>>>> So I prefer following prototype:
>>>>   uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_cpls, bool *has_error)
>>>>     -- nb_cpls: indicate max process operations number
>>>>     -- has_error: indicate if there is an error
>>>>     -- return value: the number of successful completed operations.
>>>>     -- example:
>>>>        1) If there are already 32 completed ops, and 4th is error, and nb_cpls is 32, then
>>>>           the ret will be 3(because 1/2/3th is OK), and has_error will be true.
>>>>        2) If there are already 32 completed ops, and all successful completed, then the ret
>>>>           will be min(32, nb_cpls), and has_error will be false.
>>>>        3) If there are already 32 completed ops, and all failed completed, then the ret will
>>>>           be 0, and has_error will be true.
>>>>   uint16_t rte_dmadev_completed_status(uint16_t dev_id, dma_cookie_t *cookie, uint16_t nb_status, uint32_t *status)
>>>>     -- return value: the number of failed completed operations.
>>>
>>>
>>>
>>> In typical storage use cases etc, Sometimes application need to
>>> provide scatter-gather list,
>>> At least in our hardware sg list gives a "single completion result"
>>> and it stops on the first failure to restart
>>> the transfer by application. Have you thought of scatter-gather use
>>> case and how it is in other  HW?
>>
>> cookie and request are in a one-to-one correspondence, whether the request is a single or sg-list.
>> Kunpeng9x0 don't support sg-list, I'm still investigating other hardware.
>>
>> The above 'restart the transfer by application' mean re-schedule request (and have one new cookie) or
>> just re-enable current failed request (this may introduce new API) ?
>>
>>>
>>> prototype like the following works for us:
>>> rte_dmadev_enq_sg(void **src, void **dest, unsigned int **length, int
>>> nb_segments, cookie, ,,,)
>>
>> OK, we could define one scatter-list struct to wrap src/dest/length.
>>
>>>
>>>
>>>>
>>>> The application use the following invocation order when polling:
>>>>   has_error = false; // could be init to false by dmadev API, we need discuss
>>>>   ret = rte_dmadev_completed(dev_id, &cookie, bust_size, &has_error);
>>>>   // process successful completed case:
>>>>   for (int i = 0; i < ret; i++) {
>>>>   }
>>>>   if (unlikely(has_error)) {
>>>>     // process failed completed case
>>>>     ret = rte_dmadev_completed_status(dev_id, &cookie, burst_size - ret, status_array);
>>>>     for (int i = 0; i < ret; i++) {
>>>>       // ...
>>>>     }
>>>>   }
>>>>
>>>>
>>>>>
>>>>> This two-function approach also allows future support for other DMA
>>>>> functions such as comparison, where a status value is always required. Any
>>>>> apps using that functionality would just always use the "_status" function
>>>>> for completions.
>>>>>
>>>>> /Bruce
>>>>>
>>>>> .
>>>>>
>>>>
>>>
>>> .
>>>
>>
> 
> .
>
fengchengwen June 26, 2021, 3:59 a.m. UTC | #56
Hi, all
  I analyzed the current DPAM DMA driver and drew this summary in conjunction
with the previous discussion, and this will as a basis for the V2 implementation.
  Feedback is welcome, thanks


dpaa2_qdma:
  [probe]: mainly obtains the number of hardware queues.
  [dev_configure]: has following parameters:
      max_hw_queues_per_core:
      max_vqs: max number of virt-queue
      fle_queue_pool_cnt: the size of FLE pool
  [queue_setup]: setup up one virt-queue, has following parameters:
      lcore_id:
      flags: some control params, e.g. sg-list, longformat desc, exclusive HW
             queue...
      rbp: some misc field which impact the descriptor
      Note: this API return the index of virt-queue which was successful
            setuped.
  [enqueue_bufs]: data-plane API, the key fields:
      vq_id: the index of virt-queue
	  job: the pointer of job array
	  nb_jobs:
	  Note: one job has src/dest/len/flag/cnxt/status/vq_id/use_elem fields,
            the flag field indicate whether src/dst is PHY addr.
  [dequeue_bufs]: get the completed jobs's pointer

  [key point]:
      ------------    ------------
      |virt-queue|    |virt-queue|
      ------------    ------------
             \           /
              \         /
               \       /
             ------------     ------------
             | HW-queue |     | HW-queue |
             ------------     ------------
                    \            /
                     \          /
                      \        /
                      core/rawdev
      1) In the probe stage, driver tell how many HW-queues could use.
      2) User could specify the maximum number of HW-queues managed by a single
         core in the dev_configure stage.
      3) User could create one virt-queue by queue_setup API, the virt-queue has
         two types: a) exclusive HW-queue, b) shared HW-queue(as described
         above), this is achieved by the corresponding bit of flags field.
      4) In this mode, queue management is simplified. User do not need to
         specify the HW-queue to be applied for and create a virt-queue on the
         HW-queue. All you need to do is say on which core I want to create a
         virt-queue.
      5) The virt-queue could have different capability, e.g. virt-queue-0
         support scatter-gather format, and virt-queue-1 don't support sg, this
         was control by flags and rbp fields in queue_setup stage.
      6) The data-plane API use the definition similar to rte_mbuf and
         rte_eth_rx/tx_burst().
      PS: I still don't understand how sg-list enqueue/dequeue, and user how to
          use RTE_QDMA_VQ_NO_RESPONSE.

      Overall, I think it's a flexible design with many scalability. Especially
      the queue resource pool architecture, simplifies user invocations,
      although the 'core' introduces a bit abruptly.


octeontx2_dma:
  [dev_configure]: has one parameters:
      chunk_pool: it's strange why it's not managed internally by the driver,
                  but passed in through the API.
  [enqueue_bufs]: has three important parameters:
      context: this is what Jerin referred to 'channel', it could hold the
               completed ring of the job.
      buffers: hold the pointer array of dpi_dma_buf_ptr_s
      count: how many dpi_dma_buf_ptr_s
	  Note: one dpi_dma_buf_ptr_s may has many src and dst pairs (it's scatter-
            gather list), and has one completed_ptr (when HW complete it will
            write one value to this ptr), current the completed_ptr pointer
            struct:
                struct dpi_dma_req_compl_s {
                    uint64_t cdata;  --driver init and HW update result to this.
                    void (*compl_cb)(void *dev, void *arg);
                    void *cb_data;
                };
  [dequeue_bufs]: has two important parameters:
      context: driver will scan it's completed ring to get complete info.
      buffers: hold the pointer array of completed_ptr.

  [key point]:
      -----------    -----------
      | channel |    | channel |
      -----------    -----------
             \           /
              \         /
               \       /
             ------------
             | HW-queue |
             ------------
                   |
                --------
                |rawdev|
                --------
      1) User could create one channel by init context(dpi_dma_queue_ctx_s),
         this interface is not standardized and needs to be implemented by
         users.
      2) Different channels can support different transmissions, e.g. one for
         inner m2m, and other for inbound copy.

      Overall, I think the 'channel' is similar the 'virt-queue' of dpaa2_qdma.
      The difference is that dpaa2_qdma supports multiple hardware queues. The
      'channel' has following
      1) A channel is an operable unit at the user level. User can create a
         channel for each transfer type, for example, a local-to-local channel,
         and a local-to-host channel. User could also get the completed status
         of one channel.
      2) Multiple channels can run on the same HW-queue. In terms of API design,
         this design reduces the number of data-plane API parameters. The
         channel could has context info which will referred by data-plane APIs
         execute.


ioat:
  [probe]: create multiple rawdev if it's DSA device and has multiple HW-queues.
  [dev_configure]: has three parameters:
      ring_size: the HW descriptor size.
      hdls_disable: whether ignore user-supplied handle params
      no_prefetch_completions:
  [rte_ioat_enqueue_copy]: has dev_id/src/dst/length/src_hdl/dst_hdl parameters.
  [rte_ioat_completed_ops]: has dev_id/max_copies/status/num_unsuccessful/
                            src_hdls/dst_hdls parameters.

  Overall, one HW-queue one rawdev, and don't have many 'channel' which similar
  to octeontx2_dma.


Kunpeng_dma:
  1) The hardmware support multiple modes(e.g. local-to-local/local-to-pciehost/
     pciehost-to-local/immediated-to-local copy).
     Note: Currently, we only implement local-to-local copy.
  2) The hardmware support multiple HW-queues.


Summary:
  1) The dpaa2/octeontx2/Kunpeng are all ARM soc, there may acts as endpoint of
     x86 host (e.g. smart NIC), multiple memory transfer requirements may exist,
     e.g. local-to-host/local-to-host..., from the point of view of API design,
     I think we should adopt a similar 'channel' or 'virt-queue' concept.
  2) Whether to create a separate dmadev for each HW-queue? We previously
     discussed this, and due HW-queue could indepent management (like
     Kunpeng_dma and Intel DSA), we prefer create a separate dmadev for each
     HW-queue before. But I'm not sure if that's the case with dpaa. I think
     that can be left to the specific driver, no restriction is imposed on the
     framework API layer.
  3) I think we could setup following abstraction at dmadev device:
      ------------    ------------
      |virt-queue|    |virt-queue|
      ------------    ------------
             \           /
              \         /
               \       /
             ------------     ------------
             | HW-queue |     | HW-queue |
             ------------     ------------
                    \            /
                     \          /
                      \        /
                        dmadev
  4) The driver's ops design (here we only list key points):
     [dev_info_get]: mainly return the number of HW-queues
     [dev_configure]: nothing important
     [queue_setup]: create one virt-queue, has following main parameters:
         HW-queue-index: the HW-queue index used
         nb_desc: the number of HW descriptors
         opaque: driver's specific info
         Note1: this API return virt-queue index which will used in later API.
                If user want create multiple virt-queue one the same HW-queue,
                they could achieved by call queue_setup with the same
                HW-queue-index.
         Note2: I think it's hard to define queue_setup config paramter, and
                also this is control API, so I think it's OK to use opaque
                pointer to implement it.
      [dma_copy/memset/sg]: all has vq_id input parameter.
         Note: I notice dpaa can't support single and sg in one virt-queue, and
               I think it's maybe software implement policy other than HW
               restriction because virt-queue could share the same HW-queue.
      Here we use vq_id to tackle different scenario, like local-to-local/
      local-to-host and etc.
  5) And the dmadev public data-plane API (just prototype):
     dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len, flags)
       -- flags: used as an extended parameter, it could be uint32_t
     dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags)
     dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len, flags)
       -- sg: struct dma_scatterlist array
     uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t *cookie,
                                   uint16_t nb_cpls, bool *has_error)
       -- nb_cpls: indicate max process operations number
       -- has_error: indicate if there is an error
       -- return value: the number of successful completed operations.
       -- example:
          1) If there are already 32 completed ops, and 4th is error, and
             nb_cpls is 32, then the ret will be 3(because 1/2/3th is OK), and
             has_error will be true.
          2) If there are already 32 completed ops, and all successful
             completed, then the ret will be min(32, nb_cpls), and has_error
             will be false.
          3) If there are already 32 completed ops, and all failed completed,
             then the ret will be 0, and has_error will be true.
     uint16_t rte_dmadev_completed_status(dev_id, vq_id, dma_cookie_t *cookie,
                                          uint16_t nb_status, uint32_t *status)
       -- return value: the number of failed completed operations.
     And here I agree with Morten: we should design API which adapts to DPDK
     service scenarios. So we don't support some sound-cards DMA, and 2D memory
     copy which mainly used in video scenarios.
  6) The dma_cookie_t is signed int type, when <0 it mean error, it's
     monotonically increasing base on HW-queue (other than virt-queue). The
     driver needs to make sure this because the damdev framework don't manage
     the dma_cookie's creation.
  7) Because data-plane APIs are not thread-safe, and user could determine
     virt-queue to HW-queue's map (at the queue-setup stage), so it is user's
     duty to ensure thread-safe.
  8) One example:
     vq_id = rte_dmadev_queue_setup(dev, config.{HW-queue-index=x, opaque});
     if (vq_id < 0) {
        // create virt-queue failed
        return;
     }
     // submit memcpy task
     cookit = rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags);
     if (cookie < 0) {
        // submit failed
        return;
     }
     // get complete task
     ret = rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error);
     if (!has_error && ret == 1) {
        // the memcpy successful complete
     }
  9) As octeontx2_dma support sg-list which has many valid buffers in
     dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API.
  10) As ioat, it could delcare support one HW-queue at dev_configure stage, and
      only support create one virt-queue.
  11) As dpaa2_qdma, I think it could migrate to new framework, but still wait
      for dpaa2_qdma guys feedback.
  12) About the prototype src/dst parameters of rte_dmadev_memcpy API, we have
      two candidates which are iova and void *, how about introduce dma_addr_t
      type which could be va or iova ?
Bruce Richardson June 28, 2021, 10 a.m. UTC | #57
On Sat, Jun 26, 2021 at 11:59:49AM +0800, fengchengwen wrote:
> Hi, all
>   I analyzed the current DPAM DMA driver and drew this summary in conjunction
> with the previous discussion, and this will as a basis for the V2 implementation.
>   Feedback is welcome, thanks
>
Fantastic review and summary, many thanks for the work. Some comments
inline in API part below, but nothing too major, I hope.

/Bruce
 
<snip> 
> 
> Summary:
>   1) The dpaa2/octeontx2/Kunpeng are all ARM soc, there may acts as endpoint of
>      x86 host (e.g. smart NIC), multiple memory transfer requirements may exist,
>      e.g. local-to-host/local-to-host..., from the point of view of API design,
>      I think we should adopt a similar 'channel' or 'virt-queue' concept.
>   2) Whether to create a separate dmadev for each HW-queue? We previously
>      discussed this, and due HW-queue could indepent management (like
>      Kunpeng_dma and Intel DSA), we prefer create a separate dmadev for each
>      HW-queue before. But I'm not sure if that's the case with dpaa. I think
>      that can be left to the specific driver, no restriction is imposed on the
>      framework API layer.
>   3) I think we could setup following abstraction at dmadev device:
>       ------------    ------------
>       |virt-queue|    |virt-queue|
>       ------------    ------------
>              \           /
>               \         /
>                \       /
>              ------------     ------------
>              | HW-queue |     | HW-queue |
>              ------------     ------------
>                     \            /
>                      \          /
>                       \        /
>                         dmadev
>   4) The driver's ops design (here we only list key points):
>      [dev_info_get]: mainly return the number of HW-queues
>      [dev_configure]: nothing important
>      [queue_setup]: create one virt-queue, has following main parameters:
>          HW-queue-index: the HW-queue index used
>          nb_desc: the number of HW descriptors
>          opaque: driver's specific info
>          Note1: this API return virt-queue index which will used in later API.
>                 If user want create multiple virt-queue one the same HW-queue,
>                 they could achieved by call queue_setup with the same
>                 HW-queue-index.
>          Note2: I think it's hard to define queue_setup config paramter, and
>                 also this is control API, so I think it's OK to use opaque
>                 pointer to implement it.
I'm not sure opaque pointer will work in practice, so I think we should try
and standardize the parameters as much as possible. Since it's a control
plane API, using a struct with a superset of parameters may be workable.
Let's start with a minimum set and build up from there.

>       [dma_copy/memset/sg]: all has vq_id input parameter.
>          Note: I notice dpaa can't support single and sg in one virt-queue, and
>                I think it's maybe software implement policy other than HW
>                restriction because virt-queue could share the same HW-queue.
Presumably for queues which support sq, the single-enqueue APIs can use a
single sg list internally?

>       Here we use vq_id to tackle different scenario, like local-to-local/
>       local-to-host and etc.
>   5) And the dmadev public data-plane API (just prototype):
>      dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len, flags)
>        -- flags: used as an extended parameter, it could be uint32_t

Suggest uint64_t rather than uint32_t to ensure we have expansion room?
Otherwise +1

>      dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags)
+1

>      dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len, flags)
>        -- sg: struct dma_scatterlist array
I don't think our drivers will be directly implementing this API, but so
long as SG support is listed as a capability flag I'm fine with this as an
API. [We can't fudge it as a bunch of single copies, because that would
cause us to have multiple cookies rather than one]

>      uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t *cookie,
>                                    uint16_t nb_cpls, bool *has_error)
>        -- nb_cpls: indicate max process operations number
>        -- has_error: indicate if there is an error
>        -- return value: the number of successful completed operations.
>        -- example:
>           1) If there are already 32 completed ops, and 4th is error, and
>              nb_cpls is 32, then the ret will be 3(because 1/2/3th is OK), and
>              has_error will be true.
>           2) If there are already 32 completed ops, and all successful
>              completed, then the ret will be min(32, nb_cpls), and has_error
>              will be false.
>           3) If there are already 32 completed ops, and all failed completed,
>              then the ret will be 0, and has_error will be true.
+1 for this

>      uint16_t rte_dmadev_completed_status(dev_id, vq_id, dma_cookie_t *cookie,
>                                           uint16_t nb_status, uint32_t *status)
>        -- return value: the number of failed completed operations.
>      And here I agree with Morten: we should design API which adapts to DPDK
>      service scenarios. So we don't support some sound-cards DMA, and 2D memory
>      copy which mainly used in video scenarios.

Can I suggest a few adjustments here to the semantics of this API. In
future we may have operations which return a status value, e.g. our
hardware can support ops like compare equal/not-equal, which means that
this API would be meaningful even in case of success. Therefore, I suggest
that the return value be changed to allow success also to be returned in
the array, and the return value is not the number of failed ops, but the
number of ops for which status is being returned.

Also for consideration: when trying to implement this in a prototype in our
driver, it would be easier if we relax the restriction on the "completed"
API so that we can flag has_error when an error is detected rather than
guaranteeing to return all elements right up to the error. For example, if
we have a burst of packets and one is problematic, it may be easier to flag
the error at the start of the burst and then have a few successful entries
at the start of the completed_status array. [Separate from this] We should
also have a "has_error" or "more_errors" flag on this API too, to indicate
when the user can switch back to using the regular "completed" API. This
means that apps switch from one API to the other when "has_error" is true,
and only switch back when it becomes false again.

>   6) The dma_cookie_t is signed int type, when <0 it mean error, it's
>      monotonically increasing base on HW-queue (other than virt-queue). The
>      driver needs to make sure this because the damdev framework don't manage
>      the dma_cookie's creation.
+1 to this.
I think we also should specify that the cookie is guaranteed to wrap at a
power of 2 value (UINT16_MAX??). This allows it to be used as an
index into a circular buffer just by masking.

>   7) Because data-plane APIs are not thread-safe, and user could determine
>      virt-queue to HW-queue's map (at the queue-setup stage), so it is user's
>      duty to ensure thread-safe.
>   8) One example:
>      vq_id = rte_dmadev_queue_setup(dev, config.{HW-queue-index=x, opaque});
>      if (vq_id < 0) {
>         // create virt-queue failed
>         return;
>      }
>      // submit memcpy task
>      cookit = rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags);
>      if (cookie < 0) {
>         // submit failed
>         return;
>      }
>      // get complete task
>      ret = rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error);
>      if (!has_error && ret == 1) {
>         // the memcpy successful complete
>      }
+1

>   9) As octeontx2_dma support sg-list which has many valid buffers in
>      dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API.
>   10) As ioat, it could delcare support one HW-queue at dev_configure stage, and
>       only support create one virt-queue.
+1

>   11) As dpaa2_qdma, I think it could migrate to new framework, but still wait
>       for dpaa2_qdma guys feedback.
>   12) About the prototype src/dst parameters of rte_dmadev_memcpy API, we have
>       two candidates which are iova and void *, how about introduce dma_addr_t
>       type which could be va or iova ?
> 

Many thanks again.
Ananyev, Konstantin June 28, 2021, 11:14 a.m. UTC | #58
Hi everyone,

> On Sat, Jun 26, 2021 at 11:59:49AM +0800, fengchengwen wrote:
> > Hi, all
> >   I analyzed the current DPAM DMA driver and drew this summary in conjunction
> > with the previous discussion, and this will as a basis for the V2 implementation.
> >   Feedback is welcome, thanks
> >
> Fantastic review and summary, many thanks for the work. Some comments
> inline in API part below, but nothing too major, I hope.
> 
> /Bruce
> 
> <snip>
> >
> > Summary:
> >   1) The dpaa2/octeontx2/Kunpeng are all ARM soc, there may acts as endpoint of
> >      x86 host (e.g. smart NIC), multiple memory transfer requirements may exist,
> >      e.g. local-to-host/local-to-host..., from the point of view of API design,
> >      I think we should adopt a similar 'channel' or 'virt-queue' concept.
> >   2) Whether to create a separate dmadev for each HW-queue? We previously
> >      discussed this, and due HW-queue could indepent management (like
> >      Kunpeng_dma and Intel DSA), we prefer create a separate dmadev for each
> >      HW-queue before. But I'm not sure if that's the case with dpaa. I think
> >      that can be left to the specific driver, no restriction is imposed on the
> >      framework API layer.
> >   3) I think we could setup following abstraction at dmadev device:
> >       ------------    ------------
> >       |virt-queue|    |virt-queue|
> >       ------------    ------------
> >              \           /
> >               \         /
> >                \       /
> >              ------------     ------------
> >              | HW-queue |     | HW-queue |
> >              ------------     ------------
> >                     \            /
> >                      \          /
> >                       \        /
> >                         dmadev
> >   4) The driver's ops design (here we only list key points):
> >      [dev_info_get]: mainly return the number of HW-queues
> >      [dev_configure]: nothing important
> >      [queue_setup]: create one virt-queue, has following main parameters:
> >          HW-queue-index: the HW-queue index used
> >          nb_desc: the number of HW descriptors
> >          opaque: driver's specific info
> >          Note1: this API return virt-queue index which will used in later API.
> >                 If user want create multiple virt-queue one the same HW-queue,
> >                 they could achieved by call queue_setup with the same
> >                 HW-queue-index.
> >          Note2: I think it's hard to define queue_setup config paramter, and
> >                 also this is control API, so I think it's OK to use opaque
> >                 pointer to implement it.
> I'm not sure opaque pointer will work in practice, so I think we should try
> and standardize the parameters as much as possible. Since it's a control
> plane API, using a struct with a superset of parameters may be workable.
> Let's start with a minimum set and build up from there.
> 
> >       [dma_copy/memset/sg]: all has vq_id input parameter.
> >          Note: I notice dpaa can't support single and sg in one virt-queue, and
> >                I think it's maybe software implement policy other than HW
> >                restriction because virt-queue could share the same HW-queue.
> Presumably for queues which support sq, the single-enqueue APIs can use a
> single sg list internally?
> 
> >       Here we use vq_id to tackle different scenario, like local-to-local/
> >       local-to-host and etc.
> >   5) And the dmadev public data-plane API (just prototype):
> >      dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len, flags)
> >        -- flags: used as an extended parameter, it could be uint32_t
> 
> Suggest uint64_t rather than uint32_t to ensure we have expansion room?
> Otherwise +1
> 
> >      dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags)
> +1
> 
> >      dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len, flags)
> >        -- sg: struct dma_scatterlist array
> I don't think our drivers will be directly implementing this API, but so
> long as SG support is listed as a capability flag I'm fine with this as an
> API. [We can't fudge it as a bunch of single copies, because that would
> cause us to have multiple cookies rather than one]
> 
> >      uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t *cookie,
> >                                    uint16_t nb_cpls, bool *has_error)
> >        -- nb_cpls: indicate max process operations number
> >        -- has_error: indicate if there is an error
> >        -- return value: the number of successful completed operations.
> >        -- example:
> >           1) If there are already 32 completed ops, and 4th is error, and
> >              nb_cpls is 32, then the ret will be 3(because 1/2/3th is OK), and
> >              has_error will be true.
> >           2) If there are already 32 completed ops, and all successful
> >              completed, then the ret will be min(32, nb_cpls), and has_error
> >              will be false.
> >           3) If there are already 32 completed ops, and all failed completed,
> >              then the ret will be 0, and has_error will be true.
> +1 for this
> 
> >      uint16_t rte_dmadev_completed_status(dev_id, vq_id, dma_cookie_t *cookie,
> >                                           uint16_t nb_status, uint32_t *status)
> >        -- return value: the number of failed completed operations.
> >      And here I agree with Morten: we should design API which adapts to DPDK
> >      service scenarios. So we don't support some sound-cards DMA, and 2D memory
> >      copy which mainly used in video scenarios.
> 
> Can I suggest a few adjustments here to the semantics of this API. In
> future we may have operations which return a status value, e.g. our
> hardware can support ops like compare equal/not-equal, which means that
> this API would be meaningful even in case of success. Therefore, I suggest
> that the return value be changed to allow success also to be returned in
> the array, and the return value is not the number of failed ops, but the
> number of ops for which status is being returned.
> 
> Also for consideration: when trying to implement this in a prototype in our
> driver, it would be easier if we relax the restriction on the "completed"
> API so that we can flag has_error when an error is detected rather than
> guaranteeing to return all elements right up to the error. For example, if
> we have a burst of packets and one is problematic, it may be easier to flag
> the error at the start of the burst and then have a few successful entries
> at the start of the completed_status array. [Separate from this] We should
> also have a "has_error" or "more_errors" flag on this API too, to indicate
> when the user can switch back to using the regular "completed" API. This
> means that apps switch from one API to the other when "has_error" is true,
> and only switch back when it becomes false again.
> 
> >   6) The dma_cookie_t is signed int type, when <0 it mean error, it's
> >      monotonically increasing base on HW-queue (other than virt-queue). The
> >      driver needs to make sure this because the damdev framework don't manage
> >      the dma_cookie's creation.
> +1 to this.
> I think we also should specify that the cookie is guaranteed to wrap at a
> power of 2 value (UINT16_MAX??). This allows it to be used as an
> index into a circular buffer just by masking.
> 
> >   7) Because data-plane APIs are not thread-safe, and user could determine
> >      virt-queue to HW-queue's map (at the queue-setup stage), so it is user's
> >      duty to ensure thread-safe.
> >   8) One example:
> >      vq_id = rte_dmadev_queue_setup(dev, config.{HW-queue-index=x, opaque});
> >      if (vq_id < 0) {
> >         // create virt-queue failed
> >         return;
> >      }
> >      // submit memcpy task
> >      cookit = rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags);
> >      if (cookie < 0) {
> >         // submit failed
> >         return;
> >      }
> >      // get complete task
> >      ret = rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error);
> >      if (!has_error && ret == 1) {
> >         // the memcpy successful complete
> >      }
> +1

I have two questions on the proposed API:
1. Would it make sense to split submission API into two stages:
    a) reserve and prepare
    b) actual submit.
Similar to what DPDK ioat/idxd PMDs have right now:
/* reserve and prepare */
 for (i=0;i<num;i++) {cookie = rte_dmadev_memcpy(...);}
/* submit to HW */
rte_dmadev_issue_pending(...);

For those PMDs that prefer to do actual submission to HW at rte_dmadev_memcpy(),
issue_pending()  will be just a NOP.

As I can it will make API more flexible and will help PMD developers to choose
most suitable approach for their HW.  
As a side notice - linux DMA framework uses such approach too.

2) I wonder what would be MT-safe requirements for submit/completion API?
I.E. should all PMD support the case when one thread does rte_dmadev_memcpy(..)
while another one does  rte_dmadev_completed(...) on the same queue simultaneously?
Or should such combination be ST only?
Or might be new capability flag per device?

> 
> >   9) As octeontx2_dma support sg-list which has many valid buffers in
> >      dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API.
> >   10) As ioat, it could delcare support one HW-queue at dev_configure stage, and
> >       only support create one virt-queue.
> +1
> 
> >   11) As dpaa2_qdma, I think it could migrate to new framework, but still wait
> >       for dpaa2_qdma guys feedback.
> >   12) About the prototype src/dst parameters of rte_dmadev_memcpy API, we have
> >       two candidates which are iova and void *, how about introduce dma_addr_t
> >       type which could be va or iova ?
> >
> 
> Many thanks again.
Bruce Richardson June 28, 2021, 12:53 p.m. UTC | #59
On Mon, Jun 28, 2021 at 12:14:31PM +0100, Ananyev, Konstantin wrote:
> 
> Hi everyone,
> 
> > On Sat, Jun 26, 2021 at 11:59:49AM +0800, fengchengwen wrote:
> > > Hi, all
> > >   I analyzed the current DPAM DMA driver and drew this summary in conjunction
> > > with the previous discussion, and this will as a basis for the V2 implementation.
> > >   Feedback is welcome, thanks
> > >
> > Fantastic review and summary, many thanks for the work. Some comments
> > inline in API part below, but nothing too major, I hope.
> >
> > /Bruce
> >
> > <snip>
> > >
> > > Summary:
> > >   1) The dpaa2/octeontx2/Kunpeng are all ARM soc, there may acts as endpoint of
> > >      x86 host (e.g. smart NIC), multiple memory transfer requirements may exist,
> > >      e.g. local-to-host/local-to-host..., from the point of view of API design,
> > >      I think we should adopt a similar 'channel' or 'virt-queue' concept.
> > >   2) Whether to create a separate dmadev for each HW-queue? We previously
> > >      discussed this, and due HW-queue could indepent management (like
> > >      Kunpeng_dma and Intel DSA), we prefer create a separate dmadev for each
> > >      HW-queue before. But I'm not sure if that's the case with dpaa. I think
> > >      that can be left to the specific driver, no restriction is imposed on the
> > >      framework API layer.
> > >   3) I think we could setup following abstraction at dmadev device:
> > >       ------------    ------------
> > >       |virt-queue|    |virt-queue|
> > >       ------------    ------------
> > >              \           /
> > >               \         /
> > >                \       /
> > >              ------------     ------------
> > >              | HW-queue |     | HW-queue |
> > >              ------------     ------------
> > >                     \            /
> > >                      \          /
> > >                       \        /
> > >                         dmadev
> > >   4) The driver's ops design (here we only list key points):
> > >      [dev_info_get]: mainly return the number of HW-queues
> > >      [dev_configure]: nothing important
> > >      [queue_setup]: create one virt-queue, has following main parameters:
> > >          HW-queue-index: the HW-queue index used
> > >          nb_desc: the number of HW descriptors
> > >          opaque: driver's specific info
> > >          Note1: this API return virt-queue index which will used in later API.
> > >                 If user want create multiple virt-queue one the same HW-queue,
> > >                 they could achieved by call queue_setup with the same
> > >                 HW-queue-index.
> > >          Note2: I think it's hard to define queue_setup config paramter, and
> > >                 also this is control API, so I think it's OK to use opaque
> > >                 pointer to implement it.
> > I'm not sure opaque pointer will work in practice, so I think we should try
> > and standardize the parameters as much as possible. Since it's a control
> > plane API, using a struct with a superset of parameters may be workable.
> > Let's start with a minimum set and build up from there.
> >
> > >       [dma_copy/memset/sg]: all has vq_id input parameter.
> > >          Note: I notice dpaa can't support single and sg in one virt-queue, and
> > >                I think it's maybe software implement policy other than HW
> > >                restriction because virt-queue could share the same HW-queue.
> > Presumably for queues which support sq, the single-enqueue APIs can use a
> > single sg list internally?
> >
> > >       Here we use vq_id to tackle different scenario, like local-to-local/
> > >       local-to-host and etc.
> > >   5) And the dmadev public data-plane API (just prototype):
> > >      dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len, flags)
> > >        -- flags: used as an extended parameter, it could be uint32_t
> >
> > Suggest uint64_t rather than uint32_t to ensure we have expansion room?
> > Otherwise +1
> >
> > >      dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags)
> > +1
> >
> > >      dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len, flags)
> > >        -- sg: struct dma_scatterlist array
> > I don't think our drivers will be directly implementing this API, but so
> > long as SG support is listed as a capability flag I'm fine with this as an
> > API. [We can't fudge it as a bunch of single copies, because that would
> > cause us to have multiple cookies rather than one]
> >
> > >      uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t *cookie,
> > >                                    uint16_t nb_cpls, bool *has_error)
> > >        -- nb_cpls: indicate max process operations number
> > >        -- has_error: indicate if there is an error
> > >        -- return value: the number of successful completed operations.
> > >        -- example:
> > >           1) If there are already 32 completed ops, and 4th is error, and
> > >              nb_cpls is 32, then the ret will be 3(because 1/2/3th is OK), and
> > >              has_error will be true.
> > >           2) If there are already 32 completed ops, and all successful
> > >              completed, then the ret will be min(32, nb_cpls), and has_error
> > >              will be false.
> > >           3) If there are already 32 completed ops, and all failed completed,
> > >              then the ret will be 0, and has_error will be true.
> > +1 for this
> >
> > >      uint16_t rte_dmadev_completed_status(dev_id, vq_id, dma_cookie_t *cookie,
> > >                                           uint16_t nb_status, uint32_t *status)
> > >        -- return value: the number of failed completed operations.
> > >      And here I agree with Morten: we should design API which adapts to DPDK
> > >      service scenarios. So we don't support some sound-cards DMA, and 2D memory
> > >      copy which mainly used in video scenarios.
> >
> > Can I suggest a few adjustments here to the semantics of this API. In
> > future we may have operations which return a status value, e.g. our
> > hardware can support ops like compare equal/not-equal, which means that
> > this API would be meaningful even in case of success. Therefore, I suggest
> > that the return value be changed to allow success also to be returned in
> > the array, and the return value is not the number of failed ops, but the
> > number of ops for which status is being returned.
> >
> > Also for consideration: when trying to implement this in a prototype in our
> > driver, it would be easier if we relax the restriction on the "completed"
> > API so that we can flag has_error when an error is detected rather than
> > guaranteeing to return all elements right up to the error. For example, if
> > we have a burst of packets and one is problematic, it may be easier to flag
> > the error at the start of the burst and then have a few successful entries
> > at the start of the completed_status array. [Separate from this] We should
> > also have a "has_error" or "more_errors" flag on this API too, to indicate
> > when the user can switch back to using the regular "completed" API. This
> > means that apps switch from one API to the other when "has_error" is true,
> > and only switch back when it becomes false again.
> >
> > >   6) The dma_cookie_t is signed int type, when <0 it mean error, it's
> > >      monotonically increasing base on HW-queue (other than virt-queue). The
> > >      driver needs to make sure this because the damdev framework don't manage
> > >      the dma_cookie's creation.
> > +1 to this.
> > I think we also should specify that the cookie is guaranteed to wrap at a
> > power of 2 value (UINT16_MAX??). This allows it to be used as an
> > index into a circular buffer just by masking.
> >
> > >   7) Because data-plane APIs are not thread-safe, and user could determine
> > >      virt-queue to HW-queue's map (at the queue-setup stage), so it is user's
> > >      duty to ensure thread-safe.
> > >   8) One example:
> > >      vq_id = rte_dmadev_queue_setup(dev, config.{HW-queue-index=x, opaque});
> > >      if (vq_id < 0) {
> > >         // create virt-queue failed
> > >         return;
> > >      }
> > >      // submit memcpy task
> > >      cookit = rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags);
> > >      if (cookie < 0) {
> > >         // submit failed
> > >         return;
> > >      }
> > >      // get complete task
> > >      ret = rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error);
> > >      if (!has_error && ret == 1) {
> > >         // the memcpy successful complete
> > >      }
> > +1
> 
> I have two questions on the proposed API:
> 1. Would it make sense to split submission API into two stages:
>     a) reserve and prepare
>     b) actual submit.
> Similar to what DPDK ioat/idxd PMDs have right now:
> /* reserve and prepare */
>  for (i=0;i<num;i++) {cookie = rte_dmadev_memcpy(...);}
> /* submit to HW */
> rte_dmadev_issue_pending(...);
> 
> For those PMDs that prefer to do actual submission to HW at rte_dmadev_memcpy(),
> issue_pending()  will be just a NOP.
> 
> As I can it will make API more flexible and will help PMD developers to choose
> most suitable approach for their HW.
> As a side notice - linux DMA framework uses such approach too.
> 

Thanks for pointing out the omission Konstantin. I understood that to be
part of the original API proposals since we weren't doing burst enqueues,
but it would be good to see it explicitly called out here.

> 2) I wonder what would be MT-safe requirements for submit/completion API?
> I.E. should all PMD support the case when one thread does rte_dmadev_memcpy(..)
> while another one does  rte_dmadev_completed(...) on the same queue simultaneously?
> Or should such combination be ST only?
> Or might be new capability flag per device?
> 

I suggest we just add a capability flag for it into our library. It was
something we looked to support with ioat in the past and may do so again in
the future.

/Bruce
Jerin Jacob July 1, 2021, 3:01 p.m. UTC | #60
On Sat, Jun 26, 2021 at 9:29 AM fengchengwen <fengchengwen@huawei.com> wrote:
>
> Hi, all
>   I analyzed the current DPAM DMA driver and drew this summary in conjunction
> with the previous discussion, and this will as a basis for the V2 implementation.
>   Feedback is welcome, thanks

Thanks for the write-up.

>
> dpaa2_qdma:
>   [probe]: mainly obtains the number of hardware queues.
>   [dev_configure]: has following parameters:
>       max_hw_queues_per_core:
>       max_vqs: max number of virt-queue
>       fle_queue_pool_cnt: the size of FLE pool
>   [queue_setup]: setup up one virt-queue, has following parameters:
>       lcore_id:
>       flags: some control params, e.g. sg-list, longformat desc, exclusive HW
>              queue...
>       rbp: some misc field which impact the descriptor
>       Note: this API return the index of virt-queue which was successful
>             setuped.
>   [enqueue_bufs]: data-plane API, the key fields:
>       vq_id: the index of virt-queue
>           job: the pointer of job array
>           nb_jobs:
>           Note: one job has src/dest/len/flag/cnxt/status/vq_id/use_elem fields,
>             the flag field indicate whether src/dst is PHY addr.
>   [dequeue_bufs]: get the completed jobs's pointer
>
>   [key point]:
>       ------------    ------------
>       |virt-queue|    |virt-queue|
>       ------------    ------------
>              \           /
>               \         /
>                \       /
>              ------------     ------------
>              | HW-queue |     | HW-queue |
>              ------------     ------------
>                     \            /
>                      \          /
>                       \        /
>                       core/rawdev
>       1) In the probe stage, driver tell how many HW-queues could use.
>       2) User could specify the maximum number of HW-queues managed by a single
>          core in the dev_configure stage.
>       3) User could create one virt-queue by queue_setup API, the virt-queue has
>          two types: a) exclusive HW-queue, b) shared HW-queue(as described
>          above), this is achieved by the corresponding bit of flags field.
>       4) In this mode, queue management is simplified. User do not need to
>          specify the HW-queue to be applied for and create a virt-queue on the
>          HW-queue. All you need to do is say on which core I want to create a
>          virt-queue.
>       5) The virt-queue could have different capability, e.g. virt-queue-0
>          support scatter-gather format, and virt-queue-1 don't support sg, this
>          was control by flags and rbp fields in queue_setup stage.
>       6) The data-plane API use the definition similar to rte_mbuf and
>          rte_eth_rx/tx_burst().
>       PS: I still don't understand how sg-list enqueue/dequeue, and user how to
>           use RTE_QDMA_VQ_NO_RESPONSE.
>
>       Overall, I think it's a flexible design with many scalability. Especially
>       the queue resource pool architecture, simplifies user invocations,
>       although the 'core' introduces a bit abruptly.
>
>
> octeontx2_dma:
>   [dev_configure]: has one parameters:
>       chunk_pool: it's strange why it's not managed internally by the driver,
>                   but passed in through the API.
>   [enqueue_bufs]: has three important parameters:
>       context: this is what Jerin referred to 'channel', it could hold the
>                completed ring of the job.
>       buffers: hold the pointer array of dpi_dma_buf_ptr_s
>       count: how many dpi_dma_buf_ptr_s
>           Note: one dpi_dma_buf_ptr_s may has many src and dst pairs (it's scatter-
>             gather list), and has one completed_ptr (when HW complete it will
>             write one value to this ptr), current the completed_ptr pointer
>             struct:
>                 struct dpi_dma_req_compl_s {
>                     uint64_t cdata;  --driver init and HW update result to this.
>                     void (*compl_cb)(void *dev, void *arg);
>                     void *cb_data;
>                 };
>   [dequeue_bufs]: has two important parameters:
>       context: driver will scan it's completed ring to get complete info.
>       buffers: hold the pointer array of completed_ptr.
>
>   [key point]:
>       -----------    -----------
>       | channel |    | channel |
>       -----------    -----------
>              \           /
>               \         /
>                \       /
>              ------------
>              | HW-queue |
>              ------------
>                    |
>                 --------
>                 |rawdev|
>                 --------
>       1) User could create one channel by init context(dpi_dma_queue_ctx_s),
>          this interface is not standardized and needs to be implemented by
>          users.
>       2) Different channels can support different transmissions, e.g. one for
>          inner m2m, and other for inbound copy.
>
>       Overall, I think the 'channel' is similar the 'virt-queue' of dpaa2_qdma.
>       The difference is that dpaa2_qdma supports multiple hardware queues. The
>       'channel' has following

If dpaa2_qdma supports more than one HW queue, I think, it is good to
have the queue notion
in DPDK just like other DPDK device classes. It will be good to have
confirmation from dpaa2 folks, @Hemant Agrawal,
if there are really more than 1 HW queue in dppa device.


IMO, Channel is a better name than a virtual queue. The reason is,
virtual queue is more
implementation-specific notation. No need to have this in API specification.


>       1) A channel is an operable unit at the user level. User can create a
>          channel for each transfer type, for example, a local-to-local channel,
>          and a local-to-host channel. User could also get the completed status
>          of one channel.
>       2) Multiple channels can run on the same HW-queue. In terms of API design,
>          this design reduces the number of data-plane API parameters. The
>          channel could has context info which will referred by data-plane APIs
>          execute.
>
>
> ioat:
>   [probe]: create multiple rawdev if it's DSA device and has multiple HW-queues.
>   [dev_configure]: has three parameters:
>       ring_size: the HW descriptor size.
>       hdls_disable: whether ignore user-supplied handle params
>       no_prefetch_completions:
>   [rte_ioat_enqueue_copy]: has dev_id/src/dst/length/src_hdl/dst_hdl parameters.
>   [rte_ioat_completed_ops]: has dev_id/max_copies/status/num_unsuccessful/
>                             src_hdls/dst_hdls parameters.
>
>   Overall, one HW-queue one rawdev, and don't have many 'channel' which similar
>   to octeontx2_dma.
>
>
> Kunpeng_dma:
>   1) The hardmware support multiple modes(e.g. local-to-local/local-to-pciehost/
>      pciehost-to-local/immediated-to-local copy).
>      Note: Currently, we only implement local-to-local copy.
>   2) The hardmware support multiple HW-queues.
>
>
> Summary:
>   1) The dpaa2/octeontx2/Kunpeng are all ARM soc, there may acts as endpoint of
>      x86 host (e.g. smart NIC), multiple memory transfer requirements may exist,
>      e.g. local-to-host/local-to-host..., from the point of view of API design,
>      I think we should adopt a similar 'channel' or 'virt-queue' concept.

+1 for analysis.

>   2) Whether to create a separate dmadev for each HW-queue? We previously
>      discussed this, and due HW-queue could indepent management (like
>      Kunpeng_dma and Intel DSA), we prefer create a separate dmadev for each
>      HW-queue before. But I'm not sure if that's the case with dpaa. I think
>      that can be left to the specific driver, no restriction is imposed on the
>      framework API layer.

+1

>   3) I think we could setup following abstraction at dmadev device:
>       ------------    ------------
>       |virt-queue|    |virt-queue|
>       ------------    ------------
>              \           /
>               \         /
>                \       /
>              ------------     ------------
>              | HW-queue |     | HW-queue |
>              ------------     ------------
>                     \            /
>                      \          /
>                       \        /
>                         dmadev

other than name virt-queue vs channel. +1

>   4) The driver's ops design (here we only list key points):
>      [dev_info_get]: mainly return the number of HW-queues

max number of channel/virt queues too.

>      [dev_configure]: nothing important

Need to say a number of channel/virt-queues needs to configure for this device.

>      [queue_setup]: create one virt-queue, has following main parameters:
>          HW-queue-index: the HW-queue index used
>          nb_desc: the number of HW descriptors
>          opaque: driver's specific info
>          Note1: this API return virt-queue index which will used in later API.
>                 If user want create multiple virt-queue one the same HW-queue,
>                 they could achieved by call queue_setup with the same
>                 HW-queue-index.
>          Note2: I think it's hard to define queue_setup config paramter, and
>                 also this is control API, so I think it's OK to use opaque
>                 pointer to implement it.
>       [dma_copy/memset/sg]: all has vq_id input parameter.
>          Note: I notice dpaa can't support single and sg in one virt-queue, and
>                I think it's maybe software implement policy other than HW
>                restriction because virt-queue could share the same HW-queue.
>       Here we use vq_id to tackle different scenario, like local-to-local/
>       local-to-host and etc.

IMO, The index representation has an additional overhead as one needs
to translate it
to memory pointer. I prefer to avoid by having object handle and use
_lookup() API get to make it work
in multi-process cases to avoid the additional indirection. Like mempool object.

>   5) And the dmadev public data-plane API (just prototype):
>      dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len, flags)
>        -- flags: used as an extended parameter, it could be uint32_t
>      dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags)
>      dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len, flags)
>        -- sg: struct dma_scatterlist array
>      uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t *cookie,
>                                    uint16_t nb_cpls, bool *has_error)
>        -- nb_cpls: indicate max process operations number
>        -- has_error: indicate if there is an error
>        -- return value: the number of successful completed operations.
>        -- example:
>           1) If there are already 32 completed ops, and 4th is error, and
>              nb_cpls is 32, then the ret will be 3(because 1/2/3th is OK), and
>              has_error will be true.
>           2) If there are already 32 completed ops, and all successful
>              completed, then the ret will be min(32, nb_cpls), and has_error
>              will be false.
>           3) If there are already 32 completed ops, and all failed completed,
>              then the ret will be 0, and has_error will be true.

+1. IMO, it is better to call ring_idx instead of a cookie. To enforce
that it the ring index.

>      uint16_t rte_dmadev_completed_status(dev_id, vq_id, dma_cookie_t *cookie,
>                                           uint16_t nb_status, uint32_t *status)
>        -- return value: the number of failed completed operations.

See above. Here we are assuming it is an index otherwise we need to
pass an array
cookies.

>      And here I agree with Morten: we should design API which adapts to DPDK
>      service scenarios. So we don't support some sound-cards DMA, and 2D memory
>      copy which mainly used in video scenarios.
>   6) The dma_cookie_t is signed int type, when <0 it mean error, it's
>      monotonically increasing base on HW-queue (other than virt-queue). The
>      driver needs to make sure this because the damdev framework don't manage
>      the dma_cookie's creation.

+1 and see above.

>   7) Because data-plane APIs are not thread-safe, and user could determine
>      virt-queue to HW-queue's map (at the queue-setup stage), so it is user's
>      duty to ensure thread-safe.

+1. But I am not sure how easy for the fast-path application to have this logic,
Instead, I think, it is better to tell the capa for queue by driver
and in channel configuration,
the application can request for requirement (Is multiple producers enq
to the same HW queue or not).
Based on the request, the implementation can pick the correct function
pointer for enq.(lock vs lockless version if HW does not support
lockless)


>   8) One example:
>      vq_id = rte_dmadev_queue_setup(dev, config.{HW-queue-index=x, opaque});
>      if (vq_id < 0) {
>         // create virt-queue failed
>         return;
>      }
>      // submit memcpy task
>      cookit = rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags);
>      if (cookie < 0) {
>         // submit failed
>         return;
>      }
>      // get complete task
>      ret = rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error);
>      if (!has_error && ret == 1) {
>         // the memcpy successful complete
>      }

+1

>   9) As octeontx2_dma support sg-list which has many valid buffers in
>      dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API.

+1

>   10) As ioat, it could delcare support one HW-queue at dev_configure stage, and
>       only support create one virt-queue.
>   11) As dpaa2_qdma, I think it could migrate to new framework, but still wait
>       for dpaa2_qdma guys feedback.
>   12) About the prototype src/dst parameters of rte_dmadev_memcpy API, we have
>       two candidates which are iova and void *, how about introduce dma_addr_t
>       type which could be va or iova ?

I think, conversion looks ugly, better to have void * and share the
constraints of void *
as limitation/capability using flag. So that driver can update it.


>
Bruce Richardson July 1, 2021, 4:33 p.m. UTC | #61
On Thu, Jul 01, 2021 at 08:31:00PM +0530, Jerin Jacob wrote:
> On Sat, Jun 26, 2021 at 9:29 AM fengchengwen <fengchengwen@huawei.com> wrote:
> >
> > Hi, all
> >   I analyzed the current DPAM DMA driver and drew this summary in conjunction
> > with the previous discussion, and this will as a basis for the V2 implementation.
> >   Feedback is welcome, thanks
> 
> Thanks for the write-up.
> 
> >
<snip> 
> >      [queue_setup]: create one virt-queue, has following main parameters:
> >          HW-queue-index: the HW-queue index used
> >          nb_desc: the number of HW descriptors
> >          opaque: driver's specific info
> >          Note1: this API return virt-queue index which will used in later API.
> >                 If user want create multiple virt-queue one the same HW-queue,
> >                 they could achieved by call queue_setup with the same
> >                 HW-queue-index.
> >          Note2: I think it's hard to define queue_setup config paramter, and
> >                 also this is control API, so I think it's OK to use opaque
> >                 pointer to implement it.
> >       [dma_copy/memset/sg]: all has vq_id input parameter.
> >          Note: I notice dpaa can't support single and sg in one virt-queue, and
> >                I think it's maybe software implement policy other than HW
> >                restriction because virt-queue could share the same HW-queue.
> >       Here we use vq_id to tackle different scenario, like local-to-local/
> >       local-to-host and etc.
> 
> IMO, The index representation has an additional overhead as one needs
> to translate it
> to memory pointer. I prefer to avoid by having object handle and use
> _lookup() API get to make it work
> in multi-process cases to avoid the additional indirection. Like mempool object.

While it doesn't matter to me that much which is chosen, an index seems
cleaner to me and more consistent with other device types in DPDK. The
objects pointed to by the memory pointers you refer too can't just be
stored in an internal array in your driver and accessed directly by index,
saving that layer of redirection?

> >   5) And the dmadev public data-plane API (just prototype):
> >      dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len, flags)
> >        -- flags: used as an extended parameter, it could be uint32_t
> >      dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags)
> >      dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len, flags)
> >        -- sg: struct dma_scatterlist array
> >      uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t *cookie,
> >                                    uint16_t nb_cpls, bool *has_error)
> >        -- nb_cpls: indicate max process operations number
> >        -- has_error: indicate if there is an error
> >        -- return value: the number of successful completed operations.
> >        -- example:
> >           1) If there are already 32 completed ops, and 4th is error, and
> >              nb_cpls is 32, then the ret will be 3(because 1/2/3th is OK), and
> >              has_error will be true.
> >           2) If there are already 32 completed ops, and all successful
> >              completed, then the ret will be min(32, nb_cpls), and has_error
> >              will be false.
> >           3) If there are already 32 completed ops, and all failed completed,
> >              then the ret will be 0, and has_error will be true.
> 
> +1. IMO, it is better to call ring_idx instead of a cookie. To enforce
> that it the ring index.
> 
+1, I like that name better too.

> >      uint16_t rte_dmadev_completed_status(dev_id, vq_id, dma_cookie_t *cookie,
> >                                           uint16_t nb_status, uint32_t *status)
> >        -- return value: the number of failed completed operations.
> 
> See above. Here we are assuming it is an index otherwise we need to
> pass an array
> cookies.
> 
> >      And here I agree with Morten: we should design API which adapts to DPDK
> >      service scenarios. So we don't support some sound-cards DMA, and 2D memory
> >      copy which mainly used in video scenarios.
> >   6) The dma_cookie_t is signed int type, when <0 it mean error, it's
> >      monotonically increasing base on HW-queue (other than virt-queue). The
> >      driver needs to make sure this because the damdev framework don't manage
> >      the dma_cookie's creation.
> 
> +1 and see above.
> 
> >   7) Because data-plane APIs are not thread-safe, and user could determine
> >      virt-queue to HW-queue's map (at the queue-setup stage), so it is user's
> >      duty to ensure thread-safe.
> 
> +1. But I am not sure how easy for the fast-path application to have this logic,
> Instead, I think, it is better to tell the capa for queue by driver
> and in channel configuration,
> the application can request for requirement (Is multiple producers enq
> to the same HW queue or not).
> Based on the request, the implementation can pick the correct function
> pointer for enq.(lock vs lockless version if HW does not support
> lockless)
> 

Non-thread safety is the norm in DPDK for all other queue resources,
however, haivng multi-thread safety as a capability sounds reasonable.
> 
> >   8) One example: vq_id = rte_dmadev_queue_setup(dev,
> >   config.{HW-queue-index=x, opaque}); if (vq_id < 0) { // create
> >   virt-queue failed return; } // submit memcpy task cookit =
> >   rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags); if (cookie < 0)
> >   { // submit failed return; } // get complete task ret =
> >   rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error); if
> >   (!has_error && ret == 1) { // the memcpy successful complete }
> 
> +1
> 
> >   9) As octeontx2_dma support sg-list which has many valid buffers in
> >   dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API.
> 
> +1
> 
> >   10) As ioat, it could delcare support one HW-queue at dev_configure
> >   stage, and only support create one virt-queue.  11) As dpaa2_qdma, I
> >   think it could migrate to new framework, but still wait for
> >   dpaa2_qdma guys feedback.  12) About the prototype src/dst parameters
> >   of rte_dmadev_memcpy API, we have two candidates which are iova and
> >   void *, how about introduce dma_addr_t type which could be va or iova
> >   ?
> 
> I think, conversion looks ugly, better to have void * and share the
> constraints of void * as limitation/capability using flag. So that driver
> can update it.
>
I'm ok with either rte_iova_t or void * as parameter type. Let's not define
a new type though, and +1 to just using capabilities to define what kinds
of addressing are supported by the device instances.

/Bruce
Liang Ma July 2, 2021, 7:07 a.m. UTC | #62
On Sat, Jun 26, 2021 at 11:59:49AM +0800, fengchengwen wrote:
> Hi, all
>   I analyzed the current DPAM DMA driver and drew this summary in conjunction
> with the previous discussion, and this will as a basis for the V2 implementation.
>   Feedback is welcome, thanks
> 
> 
> dpaa2_qdma:
>   [probe]: mainly obtains the number of hardware queues.
>   [dev_configure]: has following parameters:
>       max_hw_queues_per_core:
>       max_vqs: max number of virt-queue
>       fle_queue_pool_cnt: the size of FLE pool
>   [queue_setup]: setup up one virt-queue, has following parameters:
>       lcore_id:
>       flags: some control params, e.g. sg-list, longformat desc, exclusive HW
>              queue...
>       rbp: some misc field which impact the descriptor
>       Note: this API return the index of virt-queue which was successful
>             setuped.
>   [enqueue_bufs]: data-plane API, the key fields:
>       vq_id: the index of virt-queue
> 	  job: the pointer of job array
> 	  nb_jobs:
> 	  Note: one job has src/dest/len/flag/cnxt/status/vq_id/use_elem fields,
>             the flag field indicate whether src/dst is PHY addr.
>   [dequeue_bufs]: get the completed jobs's pointer
> 
>   [key point]:
>       ------------    ------------
>       |virt-queue|    |virt-queue|
>       ------------    ------------
>              \           /
>               \         /
>                \       /
>              ------------     ------------
>              | HW-queue |     | HW-queue |
>              ------------     ------------
>                     \            /
>                      \          /
>                       \        /
>                       core/rawdev
>       1) In the probe stage, driver tell how many HW-queues could use.
>       2) User could specify the maximum number of HW-queues managed by a single
>          core in the dev_configure stage.
>       3) User could create one virt-queue by queue_setup API, the virt-queue has
>          two types: a) exclusive HW-queue, b) shared HW-queue(as described
>          above), this is achieved by the corresponding bit of flags field.
>       4) In this mode, queue management is simplified. User do not need to
>          specify the HW-queue to be applied for and create a virt-queue on the
>          HW-queue. All you need to do is say on which core I want to create a
>          virt-queue.
>       5) The virt-queue could have different capability, e.g. virt-queue-0
>          support scatter-gather format, and virt-queue-1 don't support sg, this
>          was control by flags and rbp fields in queue_setup stage.
>       6) The data-plane API use the definition similar to rte_mbuf and
>          rte_eth_rx/tx_burst().
>       PS: I still don't understand how sg-list enqueue/dequeue, and user how to
>           use RTE_QDMA_VQ_NO_RESPONSE.
> 
>       Overall, I think it's a flexible design with many scalability. Especially
>       the queue resource pool architecture, simplifies user invocations,
>       although the 'core' introduces a bit abruptly.
> 
> 
> octeontx2_dma:
>   [dev_configure]: has one parameters:
>       chunk_pool: it's strange why it's not managed internally by the driver,
>                   but passed in through the API.
>   [enqueue_bufs]: has three important parameters:
>       context: this is what Jerin referred to 'channel', it could hold the
>                completed ring of the job.
>       buffers: hold the pointer array of dpi_dma_buf_ptr_s
>       count: how many dpi_dma_buf_ptr_s
> 	  Note: one dpi_dma_buf_ptr_s may has many src and dst pairs (it's scatter-
>             gather list), and has one completed_ptr (when HW complete it will
>             write one value to this ptr), current the completed_ptr pointer
>             struct:
>                 struct dpi_dma_req_compl_s {
>                     uint64_t cdata;  --driver init and HW update result to this.
>                     void (*compl_cb)(void *dev, void *arg);
>                     void *cb_data;
>                 };
>   [dequeue_bufs]: has two important parameters:
>       context: driver will scan it's completed ring to get complete info.
>       buffers: hold the pointer array of completed_ptr.
> 
>   [key point]:
>       -----------    -----------
>       | channel |    | channel |
>       -----------    -----------
>              \           /
>               \         /
>                \       /
>              ------------
>              | HW-queue |
>              ------------
>                    |
>                 --------
>                 |rawdev|
>                 --------
>       1) User could create one channel by init context(dpi_dma_queue_ctx_s),
>          this interface is not standardized and needs to be implemented by
>          users.
>       2) Different channels can support different transmissions, e.g. one for
>          inner m2m, and other for inbound copy.
> 
>       Overall, I think the 'channel' is similar the 'virt-queue' of dpaa2_qdma.
>       The difference is that dpaa2_qdma supports multiple hardware queues. The
>       'channel' has following
>       1) A channel is an operable unit at the user level. User can create a
>          channel for each transfer type, for example, a local-to-local channel,
>          and a local-to-host channel. User could also get the completed status
>          of one channel.
>       2) Multiple channels can run on the same HW-queue. In terms of API design,
>          this design reduces the number of data-plane API parameters. The
>          channel could has context info which will referred by data-plane APIs
>          execute.
> 
> 
> ioat:
>   [probe]: create multiple rawdev if it's DSA device and has multiple HW-queues.
>   [dev_configure]: has three parameters:
>       ring_size: the HW descriptor size.
>       hdls_disable: whether ignore user-supplied handle params
>       no_prefetch_completions:
>   [rte_ioat_enqueue_copy]: has dev_id/src/dst/length/src_hdl/dst_hdl parameters.
>   [rte_ioat_completed_ops]: has dev_id/max_copies/status/num_unsuccessful/
>                             src_hdls/dst_hdls parameters.
> 
>   Overall, one HW-queue one rawdev, and don't have many 'channel' which similar
>   to octeontx2_dma.
> 
> 
> Kunpeng_dma:
>   1) The hardmware support multiple modes(e.g. local-to-local/local-to-pciehost/
>      pciehost-to-local/immediated-to-local copy).
>      Note: Currently, we only implement local-to-local copy.
>   2) The hardmware support multiple HW-queues.
> 
> 
> Summary:
>   1) The dpaa2/octeontx2/Kunpeng are all ARM soc, there may acts as endpoint of
>      x86 host (e.g. smart NIC), multiple memory transfer requirements may exist,
>      e.g. local-to-host/local-to-host..., from the point of view of API design,
>      I think we should adopt a similar 'channel' or 'virt-queue' concept.
>   2) Whether to create a separate dmadev for each HW-queue? We previously
>      discussed this, and due HW-queue could indepent management (like
>      Kunpeng_dma and Intel DSA), we prefer create a separate dmadev for each
>      HW-queue before. But I'm not sure if that's the case with dpaa. I think
>      that can be left to the specific driver, no restriction is imposed on the
>      framework API layer.
>   3) I think we could setup following abstraction at dmadev device:
>       ------------    ------------
>       |virt-queue|    |virt-queue|
>       ------------    ------------
>              \           /
>               \         /
>                \       /
>              ------------     ------------
>              | HW-queue |     | HW-queue |
>              ------------     ------------
>                     \            /
>                      \          /
>                       \        /
>                         dmadev
>   4) The driver's ops design (here we only list key points):
>      [dev_info_get]: mainly return the number of HW-queues
>      [dev_configure]: nothing important
>      [queue_setup]: create one virt-queue, has following main parameters:
>          HW-queue-index: the HW-queue index used
>          nb_desc: the number of HW descriptors
>          opaque: driver's specific info
>          Note1: this API return virt-queue index which will used in later API.
>                 If user want create multiple virt-queue one the same HW-queue,
>                 they could achieved by call queue_setup with the same
>                 HW-queue-index.
>          Note2: I think it's hard to define queue_setup config paramter, and
>                 also this is control API, so I think it's OK to use opaque
>                 pointer to implement it.
>       [dma_copy/memset/sg]: all has vq_id input parameter.
>          Note: I notice dpaa can't support single and sg in one virt-queue, and
>                I think it's maybe software implement policy other than HW
>                restriction because virt-queue could share the same HW-queue.
>       Here we use vq_id to tackle different scenario, like local-to-local/
>       local-to-host and etc.
>   5) And the dmadev public data-plane API (just prototype):
>      dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len, flags)
>        -- flags: used as an extended parameter, it could be uint32_t
>      dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags)
>      dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len, flags)
>        -- sg: struct dma_scatterlist array
>      uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t *cookie,
>                                    uint16_t nb_cpls, bool *has_error)
>        -- nb_cpls: indicate max process operations number
>        -- has_error: indicate if there is an error
>        -- return value: the number of successful completed operations.
>        -- example:
>           1) If there are already 32 completed ops, and 4th is error, and
>              nb_cpls is 32, then the ret will be 3(because 1/2/3th is OK), and
>              has_error will be true.
>           2) If there are already 32 completed ops, and all successful
>              completed, then the ret will be min(32, nb_cpls), and has_error
>              will be false.
>           3) If there are already 32 completed ops, and all failed completed,
>              then the ret will be 0, and has_error will be true.
>      uint16_t rte_dmadev_completed_status(dev_id, vq_id, dma_cookie_t *cookie,
>                                           uint16_t nb_status, uint32_t *status)
>        -- return value: the number of failed completed operations.
>      And here I agree with Morten: we should design API which adapts to DPDK
>      service scenarios. So we don't support some sound-cards DMA, and 2D memory
>      copy which mainly used in video scenarios.
>   6) The dma_cookie_t is signed int type, when <0 it mean error, it's
>      monotonically increasing base on HW-queue (other than virt-queue). The
>      driver needs to make sure this because the damdev framework don't manage
>      the dma_cookie's creation.
>   7) Because data-plane APIs are not thread-safe, and user could determine
>      virt-queue to HW-queue's map (at the queue-setup stage), so it is user's
>      duty to ensure thread-safe.
>   8) One example:
>      vq_id = rte_dmadev_queue_setup(dev, config.{HW-queue-index=x, opaque});
>      if (vq_id < 0) {
>         // create virt-queue failed
>         return;
>      }
>      // submit memcpy task
>      cookit = rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags);
>      if (cookie < 0) {
>         // submit failed
>         return;
>      }
IMO
rte_dmadev_memcpy should return ops number successfully submitted
that's easier to do re-submit if previous session is not fully
submitted.
>      // get complete task
>      ret = rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error);
>      if (!has_error && ret == 1) {
>         // the memcpy successful complete
>      }
>   9) As octeontx2_dma support sg-list which has many valid buffers in
>      dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API.
>   10) As ioat, it could delcare support one HW-queue at dev_configure stage, and
>       only support create one virt-queue.
>   11) As dpaa2_qdma, I think it could migrate to new framework, but still wait
>       for dpaa2_qdma guys feedback.
>   12) About the prototype src/dst parameters of rte_dmadev_memcpy API, we have
>       two candidates which are iova and void *, how about introduce dma_addr_t
>       type which could be va or iova ?
>
Morten Brørup July 2, 2021, 7:39 a.m. UTC | #63
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Thursday, 1 July 2021 18.34
> 
> On Thu, Jul 01, 2021 at 08:31:00PM +0530, Jerin Jacob wrote:
> > On Sat, Jun 26, 2021 at 9:29 AM fengchengwen
> <fengchengwen@huawei.com> wrote:
> > >
> > > Hi, all
> > >   I analyzed the current DPAM DMA driver and drew this summary in
> conjunction
> > > with the previous discussion, and this will as a basis for the V2
> implementation.
> > >   Feedback is welcome, thanks
> >
> > Thanks for the write-up.
> >
> > >
> <snip>
> > >      [queue_setup]: create one virt-queue, has following main
> parameters:
> > >          HW-queue-index: the HW-queue index used
> > >          nb_desc: the number of HW descriptors
> > >          opaque: driver's specific info
> > >          Note1: this API return virt-queue index which will used in
> later API.
> > >                 If user want create multiple virt-queue one the
> same HW-queue,
> > >                 they could achieved by call queue_setup with the
> same
> > >                 HW-queue-index.
> > >          Note2: I think it's hard to define queue_setup config
> paramter, and
> > >                 also this is control API, so I think it's OK to use
> opaque
> > >                 pointer to implement it.
> > >       [dma_copy/memset/sg]: all has vq_id input parameter.
> > >          Note: I notice dpaa can't support single and sg in one
> virt-queue, and
> > >                I think it's maybe software implement policy other
> than HW
> > >                restriction because virt-queue could share the same
> HW-queue.
> > >       Here we use vq_id to tackle different scenario, like local-
> to-local/
> > >       local-to-host and etc.
> >
> > IMO, The index representation has an additional overhead as one needs
> > to translate it
> > to memory pointer. I prefer to avoid by having object handle and use
> > _lookup() API get to make it work
> > in multi-process cases to avoid the additional indirection. Like
> mempool object.
> 
> While it doesn't matter to me that much which is chosen, an index seems
> cleaner to me and more consistent with other device types in DPDK. The
> objects pointed to by the memory pointers you refer too can't just be
> stored in an internal array in your driver and accessed directly by
> index,
> saving that layer of redirection?

The rte_eth_rx/tx_burst() functions use the parameter "uint16_t port_id" to identify the device.

rte_eth_rx/tx_burst() needs to look up the rte_eth_dev pointer in the rte_eth_devices array, which could be avoided by using "struct rte_eth_dev *dev" instead of "uint16_t port_id". It would be faster.

However, in the case of rte_eth_rx/tx_burst(), the port_id is rapidly changing at runtime, and often comes from the mbuf or similar, where it is preferable storing a 16 bit value rather than a 64 bit pointer.

I think that the choice of DMA device (and virt-queue) for DMA fast path operations will be much more constant than the choice of port_id (and queue_id) in Ethdev fast path operations. If you agree with this, we should use "struct rte_dma_dev *dev" rather than "uint16_t dev_id" as parameter to the DMA fast path functions.

Going even further, why do we need to pass both dev_id and virt_queue_id to the DMA fast path functions, instead of just passing a pointer to the virt-queue? That pointer could be returned by the rte_dma_queue_setup() function. And it could be opaque or a well defined structure. Or even more exotic: It could be a structure where the first part is common and well defined, and the rest of the structure is driver specific.

> 
> > >   5) And the dmadev public data-plane API (just prototype):
> > >      dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len,
> flags)
> > >        -- flags: used as an extended parameter, it could be
> uint32_t
> > >      dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len,
> flags)
> > >      dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len,
> flags)
> > >        -- sg: struct dma_scatterlist array
> > >      uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t
> *cookie,
> > >                                    uint16_t nb_cpls, bool
> *has_error)
> > >        -- nb_cpls: indicate max process operations number
> > >        -- has_error: indicate if there is an error
> > >        -- return value: the number of successful completed
> operations.
> > >        -- example:
> > >           1) If there are already 32 completed ops, and 4th is
> error, and
> > >              nb_cpls is 32, then the ret will be 3(because 1/2/3th
> is OK), and
> > >              has_error will be true.
> > >           2) If there are already 32 completed ops, and all
> successful
> > >              completed, then the ret will be min(32, nb_cpls), and
> has_error
> > >              will be false.
> > >           3) If there are already 32 completed ops, and all failed
> completed,
> > >              then the ret will be 0, and has_error will be true.
> >
> > +1. IMO, it is better to call ring_idx instead of a cookie. To
> enforce
> > that it the ring index.
> >
> +1, I like that name better too.

If it is "ring index" then it probably should be an uintXX_t type, and thus the functions cannot return <0 to indicate error.

By "ring index", do you actually mean "descriptor index" (for the DMA engine's descriptors)? Does the application need to process this value as anything but an opaque handle? Wouldn't an opaque pointer type provide performance? It would also allow using NULL to indicate error.

Do you expect applications to store the cookie in structures that are instantiated many times (e.g. like the port_id is stored in the mbuf structure), so there is an advantage to using an uint16_t instead of a pointer?

> 
> > >      uint16_t rte_dmadev_completed_status(dev_id, vq_id,
> dma_cookie_t *cookie,
> > >                                           uint16_t nb_status,
> uint32_t *status)
> > >        -- return value: the number of failed completed operations.
> >
> > See above. Here we are assuming it is an index otherwise we need to
> > pass an array
> > cookies.
> >
> > >      And here I agree with Morten: we should design API which
> adapts to DPDK
> > >      service scenarios. So we don't support some sound-cards DMA,
> and 2D memory
> > >      copy which mainly used in video scenarios.
> > >   6) The dma_cookie_t is signed int type, when <0 it mean error,
> it's
> > >      monotonically increasing base on HW-queue (other than virt-
> queue). The
> > >      driver needs to make sure this because the damdev framework
> don't manage
> > >      the dma_cookie's creation.
> >
> > +1 and see above.
> >
> > >   7) Because data-plane APIs are not thread-safe, and user could
> determine
> > >      virt-queue to HW-queue's map (at the queue-setup stage), so it
> is user's
> > >      duty to ensure thread-safe.
> >
> > +1. But I am not sure how easy for the fast-path application to have
> this logic,
> > Instead, I think, it is better to tell the capa for queue by driver
> > and in channel configuration,
> > the application can request for requirement (Is multiple producers
> enq
> > to the same HW queue or not).
> > Based on the request, the implementation can pick the correct
> function
> > pointer for enq.(lock vs lockless version if HW does not support
> > lockless)
> >
> 
> Non-thread safety is the norm in DPDK for all other queue resources,
> however, haivng multi-thread safety as a capability sounds reasonable.
> >
> > >   8) One example: vq_id = rte_dmadev_queue_setup(dev,
> > >   config.{HW-queue-index=x, opaque}); if (vq_id < 0) { // create
> > >   virt-queue failed return; } // submit memcpy task cookit =
> > >   rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags); if (cookie <
> 0)
> > >   { // submit failed return; } // get complete task ret =
> > >   rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error); if
> > >   (!has_error && ret == 1) { // the memcpy successful complete }
> >
> > +1
> >
> > >   9) As octeontx2_dma support sg-list which has many valid buffers
> in
> > >   dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API.
> >
> > +1
> >
> > >   10) As ioat, it could delcare support one HW-queue at
> dev_configure
> > >   stage, and only support create one virt-queue.  11) As
> dpaa2_qdma, I
> > >   think it could migrate to new framework, but still wait for
> > >   dpaa2_qdma guys feedback.  12) About the prototype src/dst
> parameters
> > >   of rte_dmadev_memcpy API, we have two candidates which are iova
> and
> > >   void *, how about introduce dma_addr_t type which could be va or
> iova
> > >   ?
> >
> > I think, conversion looks ugly, better to have void * and share the
> > constraints of void * as limitation/capability using flag. So that
> driver
> > can update it.
> >
> I'm ok with either rte_iova_t or void * as parameter type. Let's not
> define a new type though,

+1

> and +1 to just using capabilities to define what kinds
> of addressing are supported by the device instances.

+1

> 
> /Bruce

And regarding naming: Consider rte_dma_xx() instead of rte_dmadev_xx().

-Morten
Bruce Richardson July 2, 2021, 10:05 a.m. UTC | #64
On Fri, Jul 02, 2021 at 09:39:10AM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com] Sent:
> > Thursday, 1 July 2021 18.34
> > 
> > On Thu, Jul 01, 2021 at 08:31:00PM +0530, Jerin Jacob wrote:
> > > On Sat, Jun 26, 2021 at 9:29 AM fengchengwen
> > <fengchengwen@huawei.com> wrote:
> > > >
> > > > Hi, all I analyzed the current DPAM DMA driver and drew this
> > > > summary in
> > conjunction
> > > > with the previous discussion, and this will as a basis for the V2
> > implementation.
> > > >   Feedback is welcome, thanks
> > >
> > > Thanks for the write-up.
> > >
> > > >
> > <snip>
> > > >      [queue_setup]: create one virt-queue, has following main
> > parameters:
> > > >          HW-queue-index: the HW-queue index used nb_desc: the
> > > >          number of HW descriptors opaque: driver's specific info
> > > >          Note1: this API return virt-queue index which will used in
> > later API.
> > > >                 If user want create multiple virt-queue one the
> > same HW-queue,
> > > >                 they could achieved by call queue_setup with the
> > same
> > > >                 HW-queue-index.  Note2: I think it's hard to define
> > > >                 queue_setup config
> > paramter, and
> > > >                 also this is control API, so I think it's OK to use
> > opaque
> > > >                 pointer to implement it.  [dma_copy/memset/sg]: all
> > > >                 has vq_id input parameter.  Note: I notice dpaa
> > > >                 can't support single and sg in one
> > virt-queue, and
> > > >                I think it's maybe software implement policy other
> > than HW
> > > >                restriction because virt-queue could share the same
> > HW-queue.
> > > >       Here we use vq_id to tackle different scenario, like local-
> > to-local/
> > > >       local-to-host and etc.
> > >
> > > IMO, The index representation has an additional overhead as one needs
> > > to translate it to memory pointer. I prefer to avoid by having object
> > > handle and use _lookup() API get to make it work in multi-process
> > > cases to avoid the additional indirection. Like
> > mempool object.
> > 
> > While it doesn't matter to me that much which is chosen, an index seems
> > cleaner to me and more consistent with other device types in DPDK. The
> > objects pointed to by the memory pointers you refer too can't just be
> > stored in an internal array in your driver and accessed directly by
> > index, saving that layer of redirection?
> 
> The rte_eth_rx/tx_burst() functions use the parameter "uint16_t port_id"
> to identify the device.
> 
> rte_eth_rx/tx_burst() needs to look up the rte_eth_dev pointer in the
> rte_eth_devices array, which could be avoided by using "struct
> rte_eth_dev *dev" instead of "uint16_t port_id". It would be faster.
> 
Actually, it looks up the structure directly since they are all in an
array, there is no ethdev pointer array, so when passing the structure
pointer to the individual functions there is just a little bit of
arithmetic to covert index to pointer.

In the case of RX/TX fastpath functions, yes, there is an additional lookup
because the individual queue on each ethdev needs to have its pointer
looked up.

> However, in the case of rte_eth_rx/tx_burst(), the port_id is rapidly changing at runtime, and often comes from the mbuf or similar, where it is preferable storing a 16 bit value rather than a 64 bit pointer.
> 
> I think that the choice of DMA device (and virt-queue) for DMA fast path operations will be much more constant than the choice of port_id (and queue_id) in Ethdev fast path operations. If you agree with this, we should use "struct rte_dma_dev *dev" rather than "uint16_t dev_id" as parameter to the DMA fast path functions.
> 
> Going even further, why do we need to pass both dev_id and virt_queue_id to the DMA fast path functions, instead of just passing a pointer to the virt-queue? That pointer could be returned by the rte_dma_queue_setup() function. And it could be opaque or a well defined structure. Or even more exotic: It could be a structure where the first part is common and well defined, and the rest of the structure is driver specific.
>
That is an interesting possibility, and I actually quite like the idea. I'm
not sure about having the well-defined common part, so I'd suggest
initially using a queue pointer typedefed to "void *". The "queue_setup"
(and virt-queue/channel setup) functions can return queue pointers which
would then be used as first parameter to the dataplane functions. In the
case of the copy engines on our Intel platforms which don't have this
virt-queue concept that pointer can just point to the dmadev private
structure directly, and make the APIs a little more efficient.
 
> > 
> > > >   5) And the dmadev public data-plane API (just prototype):
> > > >      dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len,
> > flags)
> > > >        -- flags: used as an extended parameter, it could be
> > uint32_t
> > > >      dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len,
> > flags)
> > > >      dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len,
> > flags)
> > > >        -- sg: struct dma_scatterlist array
> > > >      uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t
> > *cookie,
> > > >                                    uint16_t nb_cpls, bool
> > *has_error)
> > > >        -- nb_cpls: indicate max process operations number
> > > >        -- has_error: indicate if there is an error
> > > >        -- return value: the number of successful completed
> > operations.
> > > >        -- example:
> > > >           1) If there are already 32 completed ops, and 4th is
> > error, and
> > > >              nb_cpls is 32, then the ret will be 3(because 1/2/3th
> > is OK), and
> > > >              has_error will be true.
> > > >           2) If there are already 32 completed ops, and all
> > successful
> > > >              completed, then the ret will be min(32, nb_cpls), and
> > has_error
> > > >              will be false.
> > > >           3) If there are already 32 completed ops, and all failed
> > completed,
> > > >              then the ret will be 0, and has_error will be true.
> > >
> > > +1. IMO, it is better to call ring_idx instead of a cookie. To
> > enforce
> > > that it the ring index.
> > >
> > +1, I like that name better too.
> 
> If it is "ring index" then it probably should be an uintXX_t type, and thus the functions cannot return <0 to indicate error.
> 

The ring index is defined to be a uint16_t type, allowing the enqueue
functions to return int, i.e. negative on error, or uint16_t index on
success.

> By "ring index", do you actually mean "descriptor index" (for the DMA engine's descriptors)? Does the application need to process this value as anything but an opaque handle? Wouldn't an opaque pointer type provide performance? It would also allow using NULL to indicate error.
> 
> Do you expect applications to store the cookie in structures that are instantiated many times (e.g. like the port_id is stored in the mbuf structure), so there is an advantage to using an uint16_t instead of a pointer?
> 

This idea of using indexes rather than pointer handle comes from my
experience with the ioat driver. The initial prototype versions of that
driver I did (longer ago than I like to remember) I had it work as you
describe, manage a single pointer for each job. However, when I tried
integrating it into a simple application for copying packets, I discovered
that for many cases we need two opaque handles - one for the source buffer,
one for the destination buffer. That is what was implemented in the
upstreamed ioat driver.

However, no sooner was that added, than a request came from those looking
to integrate copy offload into vhost to remove the use of those handles as
their design did not require them. Other integrations in that area have
used other types of handles between 8 and 16 bytes.

The other consideration in this is also the format of the metadata. If for
each transaction we have a handle which points to a structure of metadata
for that transaction it may be sub-optimal. Taking again the simple case of
a source mbuf and destination mbuf, if you store a pointer to both of these
in a struct and the pointer to that struct as the opaque handle you pass to
the driver, when a job is completed and you get back the relevant handle(s)
the completed elements are not contiguous in the format you want them. The
source buffers may all need to be freed en-mass and the destination buffers
may need to be enqueued to another core or sent to NIC TX. In both these
cases you essentially need to regather the source and destination buffers
into flag arrays to call mempool_put_bulk() or tx_burst() on. Therefore,
a better arrangement is to have source and dest pointers stored in parallel
arrays, so on return of a set of jobs no gathering of handles is needed.

So, in summary, using indexes rather than opaque handles allows the end
user/application to choose the data format most relevant for data
management for the job - and ensures a minimal memory footprint. It also
has the advantages of:
* reducing the number of parameters to the functions, meaning fewer
  registers need to be set up for each function call (or that we don't
  overflow the register count per-function and start needing to write
  parameters to stack)
* We save on stores done by the completion call, as well as memory
  allocation needed for the return status parameter. If 32 jobs
  enqueued are completed, we can just return that 32, along with the
  info that the last job-id done was N. We save on 32x8-byte stores
  for returning opaque pointers for those jobs.

All of these should improve performance by reducing our offload cost.

> > 
> > > >      uint16_t rte_dmadev_completed_status(dev_id, vq_id,
> > dma_cookie_t *cookie,
> > > >                                           uint16_t nb_status,
> > uint32_t *status)
> > > >        -- return value: the number of failed completed operations.
> > >
> > > See above. Here we are assuming it is an index otherwise we need to
> > > pass an array
> > > cookies.
> > >
> > > >      And here I agree with Morten: we should design API which
> > adapts to DPDK
> > > >      service scenarios. So we don't support some sound-cards DMA,
> > and 2D memory
> > > >      copy which mainly used in video scenarios.
> > > >   6) The dma_cookie_t is signed int type, when <0 it mean error,
> > it's
> > > >      monotonically increasing base on HW-queue (other than virt-
> > queue). The
> > > >      driver needs to make sure this because the damdev framework
> > don't manage
> > > >      the dma_cookie's creation.
> > >
> > > +1 and see above.
> > >
> > > >   7) Because data-plane APIs are not thread-safe, and user could
> > determine
> > > >      virt-queue to HW-queue's map (at the queue-setup stage), so it
> > is user's
> > > >      duty to ensure thread-safe.
> > >
> > > +1. But I am not sure how easy for the fast-path application to have
> > this logic,
> > > Instead, I think, it is better to tell the capa for queue by driver
> > > and in channel configuration,
> > > the application can request for requirement (Is multiple producers
> > enq
> > > to the same HW queue or not).
> > > Based on the request, the implementation can pick the correct
> > function
> > > pointer for enq.(lock vs lockless version if HW does not support
> > > lockless)
> > >
> > 
> > Non-thread safety is the norm in DPDK for all other queue resources,
> > however, haivng multi-thread safety as a capability sounds reasonable.
> > >
> > > >   8) One example: vq_id = rte_dmadev_queue_setup(dev,
> > > >   config.{HW-queue-index=x, opaque}); if (vq_id < 0) { // create
> > > >   virt-queue failed return; } // submit memcpy task cookit =
> > > >   rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags); if (cookie <
> > 0)
> > > >   { // submit failed return; } // get complete task ret =
> > > >   rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error); if
> > > >   (!has_error && ret == 1) { // the memcpy successful complete }
> > >
> > > +1
> > >
> > > >   9) As octeontx2_dma support sg-list which has many valid buffers
> > in
> > > >   dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API.
> > >
> > > +1
> > >
> > > >   10) As ioat, it could delcare support one HW-queue at
> > dev_configure
> > > >   stage, and only support create one virt-queue.  11) As
> > dpaa2_qdma, I
> > > >   think it could migrate to new framework, but still wait for
> > > >   dpaa2_qdma guys feedback.  12) About the prototype src/dst
> > parameters
> > > >   of rte_dmadev_memcpy API, we have two candidates which are iova
> > and
> > > >   void *, how about introduce dma_addr_t type which could be va or
> > iova
> > > >   ?
> > >
> > > I think, conversion looks ugly, better to have void * and share the
> > > constraints of void * as limitation/capability using flag. So that
> > driver
> > > can update it.
> > >
> > I'm ok with either rte_iova_t or void * as parameter type. Let's not
> > define a new type though,
> 
> +1
> 
> > and +1 to just using capabilities to define what kinds
> > of addressing are supported by the device instances.
> 
> +1
> 
> > 
> > /Bruce
> 
> And regarding naming: Consider rte_dma_xx() instead of rte_dmadev_xx().

Definite +1.
fengchengwen July 2, 2021, 1:31 p.m. UTC | #65
On 2021/6/28 18:00, Bruce Richardson wrote:
>>   4) The driver's ops design (here we only list key points):
>>      [dev_info_get]: mainly return the number of HW-queues
>>      [dev_configure]: nothing important
>>      [queue_setup]: create one virt-queue, has following main parameters:
>>          HW-queue-index: the HW-queue index used
>>          nb_desc: the number of HW descriptors
>>          opaque: driver's specific info
>>          Note1: this API return virt-queue index which will used in later API.
>>                 If user want create multiple virt-queue one the same HW-queue,
>>                 they could achieved by call queue_setup with the same
>>                 HW-queue-index.
>>          Note2: I think it's hard to define queue_setup config paramter, and
>>                 also this is control API, so I think it's OK to use opaque
>>                 pointer to implement it.
> I'm not sure opaque pointer will work in practice, so I think we should try
> and standardize the parameters as much as possible. Since it's a control
> plane API, using a struct with a superset of parameters may be workable.
> Let's start with a minimum set and build up from there.

I tried to standardize a few parameters, which you can see on the new patch

>>      uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t *cookie,
>>                                    uint16_t nb_cpls, bool *has_error)
>>        -- nb_cpls: indicate max process operations number
>>        -- has_error: indicate if there is an error
>>        -- return value: the number of successful completed operations.
>>        -- example:
>>           1) If there are already 32 completed ops, and 4th is error, and
>>              nb_cpls is 32, then the ret will be 3(because 1/2/3th is OK), and
>>              has_error will be true.
>>           2) If there are already 32 completed ops, and all successful
>>              completed, then the ret will be min(32, nb_cpls), and has_error
>>              will be false.
>>           3) If there are already 32 completed ops, and all failed completed,
>>              then the ret will be 0, and has_error will be true.
> +1 for this
> 
>>      uint16_t rte_dmadev_completed_status(dev_id, vq_id, dma_cookie_t *cookie,
>>                                           uint16_t nb_status, uint32_t *status)
>>        -- return value: the number of failed completed operations.
>>      And here I agree with Morten: we should design API which adapts to DPDK
>>      service scenarios. So we don't support some sound-cards DMA, and 2D memory
>>      copy which mainly used in video scenarios.
> 
> Can I suggest a few adjustments here to the semantics of this API. In
> future we may have operations which return a status value, e.g. our
> hardware can support ops like compare equal/not-equal, which means that
> this API would be meaningful even in case of success. Therefore, I suggest
> that the return value be changed to allow success also to be returned in
> the array, and the return value is not the number of failed ops, but the
> number of ops for which status is being returned.
> 
> Also for consideration: when trying to implement this in a prototype in our
> driver, it would be easier if we relax the restriction on the "completed"
> API so that we can flag has_error when an error is detected rather than
> guaranteeing to return all elements right up to the error. For example, if
> we have a burst of packets and one is problematic, it may be easier to flag
> the error at the start of the burst and then have a few successful entries
> at the start of the completed_status array. [Separate from this] We should
> also have a "has_error" or "more_errors" flag on this API too, to indicate
> when the user can switch back to using the regular "completed" API. This
> means that apps switch from one API to the other when "has_error" is true,
> and only switch back when it becomes false again.
> 

We've discussed this before, and I prefer a relatively straightforward API,
so in the new version I'll explicitly name it as rte_dmadev_completed_fails.

We can continue this on the new patch, and I think that's probably the biggest
difference.
fengchengwen July 2, 2021, 1:45 p.m. UTC | #66
On 2021/7/1 23:01, Jerin Jacob wrote:
>>   [key point]:
>>       -----------    -----------
>>       | channel |    | channel |
>>       -----------    -----------
>>              \           /
>>               \         /
>>                \       /
>>              ------------
>>              | HW-queue |
>>              ------------
>>                    |
>>                 --------
>>                 |rawdev|
>>                 --------
>>       1) User could create one channel by init context(dpi_dma_queue_ctx_s),
>>          this interface is not standardized and needs to be implemented by
>>          users.
>>       2) Different channels can support different transmissions, e.g. one for
>>          inner m2m, and other for inbound copy.
>>
>>       Overall, I think the 'channel' is similar the 'virt-queue' of dpaa2_qdma.
>>       The difference is that dpaa2_qdma supports multiple hardware queues. The
>>       'channel' has following
> 
> If dpaa2_qdma supports more than one HW queue, I think, it is good to
> have the queue notion
> in DPDK just like other DPDK device classes. It will be good to have
> confirmation from dpaa2 folks, @Hemant Agrawal,
> if there are really more than 1 HW queue in dppa device.
> 
> 
> IMO, Channel is a better name than a virtual queue. The reason is,
> virtual queue is more
> implementation-specific notation. No need to have this in API specification.
> 

In the DPDK framework, many data-plane API names contain queues. e.g. eventdev/crypto..
The concept of virt queues has continuity.

>>       [dma_copy/memset/sg]: all has vq_id input parameter.
>>          Note: I notice dpaa can't support single and sg in one virt-queue, and
>>                I think it's maybe software implement policy other than HW
>>                restriction because virt-queue could share the same HW-queue.
>>       Here we use vq_id to tackle different scenario, like local-to-local/
>>       local-to-host and etc.
> 
> IMO, The index representation has an additional overhead as one needs
> to translate it
> to memory pointer. I prefer to avoid by having object handle and use
> _lookup() API get to make it work
> in multi-process cases to avoid the additional indirection. Like mempool object.
> 

This solution was first considered, similar to rte_hash returning a handle.
It is not intuitive and has no obvious performance advantage. The number of
jump times of the data-plane API index-driven callback function is not optimized.

>>   5) And the dmadev public data-plane API (just prototype):
>>      dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len, flags)
>>        -- flags: used as an extended parameter, it could be uint32_t
>>      dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags)
>>      dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len, flags)
>>        -- sg: struct dma_scatterlist array
>>      uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t *cookie,
>>                                    uint16_t nb_cpls, bool *has_error)
>>        -- nb_cpls: indicate max process operations number
>>        -- has_error: indicate if there is an error
>>        -- return value: the number of successful completed operations.
>>        -- example:
>>           1) If there are already 32 completed ops, and 4th is error, and
>>              nb_cpls is 32, then the ret will be 3(because 1/2/3th is OK), and
>>              has_error will be true.
>>           2) If there are already 32 completed ops, and all successful
>>              completed, then the ret will be min(32, nb_cpls), and has_error
>>              will be false.
>>           3) If there are already 32 completed ops, and all failed completed,
>>              then the ret will be 0, and has_error will be true.
> 
> +1. IMO, it is better to call ring_idx instead of a cookie. To enforce
> that it the ring index.
> 
>>      uint16_t rte_dmadev_completed_status(dev_id, vq_id, dma_cookie_t *cookie,
>>                                           uint16_t nb_status, uint32_t *status)
>>        -- return value: the number of failed completed operations.
> 
> See above. Here we are assuming it is an index otherwise we need to
> pass an array
> cookies.
> 
>>      And here I agree with Morten: we should design API which adapts to DPDK
>>      service scenarios. So we don't support some sound-cards DMA, and 2D memory
>>      copy which mainly used in video scenarios.
>>   6) The dma_cookie_t is signed int type, when <0 it mean error, it's
>>      monotonically increasing base on HW-queue (other than virt-queue). The
>>      driver needs to make sure this because the damdev framework don't manage
>>      the dma_cookie's creation.
> 
> +1 and see above.
> 
>>   7) Because data-plane APIs are not thread-safe, and user could determine
>>      virt-queue to HW-queue's map (at the queue-setup stage), so it is user's
>>      duty to ensure thread-safe.
> 
> +1. But I am not sure how easy for the fast-path application to have this logic,
> Instead, I think, it is better to tell the capa for queue by driver
> and in channel configuration,
> the application can request for requirement (Is multiple producers enq
> to the same HW queue or not).
> Based on the request, the implementation can pick the correct function
> pointer for enq.(lock vs lockless version if HW does not support
> lockless)
> 

already redesigned. Please check the latest patch.

> 
>>   8) One example:
>>      vq_id = rte_dmadev_queue_setup(dev, config.{HW-queue-index=x, opaque});
>>      if (vq_id < 0) {
>>         // create virt-queue failed
>>         return;
>>      }
>>      // submit memcpy task
>>      cookit = rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags);
>>      if (cookie < 0) {
>>         // submit failed
>>         return;
>>      }
>>      // get complete task
>>      ret = rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error);
>>      if (!has_error && ret == 1) {
>>         // the memcpy successful complete
>>      }
> 
> +1
> 
>>   9) As octeontx2_dma support sg-list which has many valid buffers in
>>      dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API.
> 
> +1
> 
>>   10) As ioat, it could delcare support one HW-queue at dev_configure stage, and
>>       only support create one virt-queue.
>>   11) As dpaa2_qdma, I think it could migrate to new framework, but still wait
>>       for dpaa2_qdma guys feedback.
>>   12) About the prototype src/dst parameters of rte_dmadev_memcpy API, we have
>>       two candidates which are iova and void *, how about introduce dma_addr_t
>>       type which could be va or iova ?
> 
> I think, conversion looks ugly, better to have void * and share the
> constraints of void *
> as limitation/capability using flag. So that driver can update it.
> 

already change to void *

> 
>>
> 
> .
>
fengchengwen July 2, 2021, 1:59 p.m. UTC | #67
On 2021/7/2 15:07, Liang Ma wrote:
>>   8) One example:
>>      vq_id = rte_dmadev_queue_setup(dev, config.{HW-queue-index=x, opaque});
>>      if (vq_id < 0) {
>>         // create virt-queue failed
>>         return;
>>      }
>>      // submit memcpy task
>>      cookit = rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags);
>>      if (cookie < 0) {
>>         // submit failed
>>         return;
>>      }
> IMO
> rte_dmadev_memcpy should return ops number successfully submitted
> that's easier to do re-submit if previous session is not fully
> submitted.

emm, I didn't get your point.

Welcome review the new patch, tks

>>      // get complete task
>>      ret = rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error);
>>      if (!has_error && ret == 1) {
>>         // the memcpy successful complete
>>      }
>>   9) As octeontx2_dma support sg-list which has many valid buffers in
>>      dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API.
>>   10) As ioat, it could delcare support one HW-queue at dev_configure stage, and
>>       only support create one virt-queue.
>>   11) As dpaa2_qdma, I think it could migrate to new framework, but still wait
>>       for dpaa2_qdma guys feedback.
>>   12) About the prototype src/dst parameters of rte_dmadev_memcpy API, we have
>>       two candidates which are iova and void *, how about introduce dma_addr_t
>>       type which could be va or iova ?
>>
> 
> .
>
Morten Brørup July 2, 2021, 2:57 p.m. UTC | #68
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of fengchengwen
> Sent: Friday, 2 July 2021 15.45
> 
> On 2021/7/1 23:01, Jerin Jacob wrote:
> >>   [key point]:
> >>       -----------    -----------
> >>       | channel |    | channel |
> >>       -----------    -----------
> >>              \           /
> >>               \         /
> >>                \       /
> >>              ------------
> >>              | HW-queue |
> >>              ------------
> >>                    |
> >>                 --------
> >>                 |rawdev|
> >>                 --------
> >>       1) User could create one channel by init
> context(dpi_dma_queue_ctx_s),
> >>          this interface is not standardized and needs to be
> implemented by
> >>          users.
> >>       2) Different channels can support different transmissions,
> e.g. one for
> >>          inner m2m, and other for inbound copy.
> >>
> >>       Overall, I think the 'channel' is similar the 'virt-queue' of
> dpaa2_qdma.
> >>       The difference is that dpaa2_qdma supports multiple hardware
> queues. The
> >>       'channel' has following
> >
> > If dpaa2_qdma supports more than one HW queue, I think, it is good to
> > have the queue notion
> > in DPDK just like other DPDK device classes. It will be good to have
> > confirmation from dpaa2 folks, @Hemant Agrawal,
> > if there are really more than 1 HW queue in dppa device.
> >
> >
> > IMO, Channel is a better name than a virtual queue. The reason is,
> > virtual queue is more
> > implementation-specific notation. No need to have this in API
> specification.
> >
> 
> In the DPDK framework, many data-plane API names contain queues. e.g.
> eventdev/crypto..
> The concept of virt queues has continuity.

I was also wondering about the name "virtual queue".

Usually, something "virtual" would be an abstraction of something physical, e.g. a software layer on top of something physical.

Back in the days, a "DMA channel" used to mean a DMA engine on a CPU. If a CPU had 2 DMA channels, they could both be set up simultaneously.

The current design has the "dmadev" representing a CPU or other chip, which has one or more "HW-queues" representing DMA channels (of the same type), and then "virt-queue" as a software abstraction on top, for using a DMA channel in different ways through individually configured contexts (virt-queues).

It makes sense to me, although I would consider renaming "HW-queue" to "channel" and perhaps "virt-queue" to "queue".

These names are not important to me. You can keep them or change them; I am happy any way.

But the names used for functions, types and parameters need to be cleaned up and match the names used in the documentation. E.g. rte_dmadev_queue_setup() does not set up a queue, it sets up a virt-queue, so that function name needs to be corrected.

Also, the rte_ prefix is missing in a few places, e.g. struct dma_scatterlist and enum dma_address_type. Obviously not important for this high level discussion based on draft source code, but important for the final implementation.

-Morten
fengchengwen July 3, 2021, 12:32 a.m. UTC | #69
On 2021/7/2 22:57, Morten Brørup wrote:
>> In the DPDK framework, many data-plane API names contain queues. e.g.
>> eventdev/crypto..
>> The concept of virt queues has continuity.
> 
> I was also wondering about the name "virtual queue".
> 
> Usually, something "virtual" would be an abstraction of something physical, e.g. a software layer on top of something physical.
> 
> Back in the days, a "DMA channel" used to mean a DMA engine on a CPU. If a CPU had 2 DMA channels, they could both be set up simultaneously.
> 
> The current design has the "dmadev" representing a CPU or other chip, which has one or more "HW-queues" representing DMA channels (of the same type), and then "virt-queue" as a software abstraction on top, for using a DMA channel in different ways through individually configured contexts (virt-queues).
> 
> It makes sense to me, although I would consider renaming "HW-queue" to "channel" and perhaps "virt-queue" to "queue".

The 'DMA channel' is more used than 'DMA queue', at least google show that there are at least 20+ times more.

It's a good idea build the abstraction layer: queue <> channel <> dma-controller.
In this way, the meaning of each layer is relatively easy to distinguish literally.

will fix in V2

> 
> These names are not important to me. You can keep them or change them; I am happy any way.
> 
> But the names used for functions, types and parameters need to be cleaned up and match the names used in the documentation. E.g. rte_dmadev_queue_setup() does not set up a queue, it sets up a virt-queue, so that function name needs to be corrected.
> 
> Also, the rte_ prefix is missing in a few places, e.g. struct dma_scatterlist and enum dma_address_type. Obviously not important for this high level discussion based on draft source code, but important for the final implementation.

will fix in V2

> 
> -Morten
>
Morten Brørup July 3, 2021, 8:53 a.m. UTC | #70
> From: fengchengwen [mailto:fengchengwen@huawei.com]
> Sent: Saturday, 3 July 2021 02.32
> 
> On 2021/7/2 22:57, Morten Brørup wrote:
> >> In the DPDK framework, many data-plane API names contain queues.
> e.g.
> >> eventdev/crypto..
> >> The concept of virt queues has continuity.
> >
> > I was also wondering about the name "virtual queue".
> >
> > Usually, something "virtual" would be an abstraction of something
> physical, e.g. a software layer on top of something physical.
> >
> > Back in the days, a "DMA channel" used to mean a DMA engine on a CPU.
> If a CPU had 2 DMA channels, they could both be set up simultaneously.
> >
> > The current design has the "dmadev" representing a CPU or other chip,
> which has one or more "HW-queues" representing DMA channels (of the
> same type), and then "virt-queue" as a software abstraction on top, for
> using a DMA channel in different ways through individually configured
> contexts (virt-queues).
> >
> > It makes sense to me, although I would consider renaming "HW-queue"
> to "channel" and perhaps "virt-queue" to "queue".
> 
> The 'DMA channel' is more used than 'DMA queue', at least google show
> that there are at least 20+ times more.
> 
> It's a good idea build the abstraction layer: queue <> channel <> dma-
> controller.
> In this way, the meaning of each layer is relatively easy to
> distinguish literally.
> 
> will fix in V2
> 

After re-reading all the mails in this thread, I have found one more important high level detail still not decided:

Bruce had suggested flattening the DMA channels, so each dmadev represents a DMA channel. And DMA controllers with multiple DMA channels will have to instantiate multiple dmadevs, one for each DMA channel.

Just like a four port NIC instantiates four ethdevs.

Then, like ethdevs, there would only be two abstraction layers: dmadev <> queue, where a dmadev is a DMA channel on a DMA controller.

However, this assumes that the fast path functions on the individual DMA channels of a DMA controller can be accessed completely independently and simultaneously by multiple threads. (Otherwise, the driver would need to implement critical regions or locking around accessing the common registers in the DMA controller shared by the DMA channels.)

Unless any of the DMA controller vendors claim that this assumption about independence of the DMA channels is wrong, I strongly support Bruce's flattening suggestion.

-Morten
Jerin Jacob July 3, 2021, 9:08 a.m. UTC | #71
On Sat, Jul 3, 2021 at 2:23 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: fengchengwen [mailto:fengchengwen@huawei.com]
> > Sent: Saturday, 3 July 2021 02.32
> >
> > On 2021/7/2 22:57, Morten Brørup wrote:
> > >> In the DPDK framework, many data-plane API names contain queues.
> > e.g.
> > >> eventdev/crypto..
> > >> The concept of virt queues has continuity.
> > >
> > > I was also wondering about the name "virtual queue".
> > >
> > > Usually, something "virtual" would be an abstraction of something
> > physical, e.g. a software layer on top of something physical.
> > >
> > > Back in the days, a "DMA channel" used to mean a DMA engine on a CPU.
> > If a CPU had 2 DMA channels, they could both be set up simultaneously.
> > >
> > > The current design has the "dmadev" representing a CPU or other chip,
> > which has one or more "HW-queues" representing DMA channels (of the
> > same type), and then "virt-queue" as a software abstraction on top, for
> > using a DMA channel in different ways through individually configured
> > contexts (virt-queues).
> > >
> > > It makes sense to me, although I would consider renaming "HW-queue"
> > to "channel" and perhaps "virt-queue" to "queue".
> >
> > The 'DMA channel' is more used than 'DMA queue', at least google show
> > that there are at least 20+ times more.
> >
> > It's a good idea build the abstraction layer: queue <> channel <> dma-
> > controller.
> > In this way, the meaning of each layer is relatively easy to
> > distinguish literally.
> >
> > will fix in V2
> >
>
> After re-reading all the mails in this thread, I have found one more important high level detail still not decided:
>
> Bruce had suggested flattening the DMA channels, so each dmadev represents a DMA channel. And DMA controllers with multiple DMA channels will have to instantiate multiple dmadevs, one for each DMA channel.
>
> Just like a four port NIC instantiates four ethdevs.
>
> Then, like ethdevs, there would only be two abstraction layers: dmadev <> queue, where a dmadev is a DMA channel on a DMA controller.
>
> However, this assumes that the fast path functions on the individual DMA channels of a DMA controller can be accessed completely independently and simultaneously by multiple threads. (Otherwise, the driver would need to implement critical regions or locking around accessing the common registers in the DMA controller shared by the DMA channels.)
>
> Unless any of the DMA controller vendors claim that this assumption about independence of the DMA channels is wrong, I strongly support Bruce's flattening suggestion.

It is wrong from alteast octeontx2_dma PoV.

# The PCI device is DMA controller where the driver/device is
mapped.(As device driver is based on PCI bus, We dont want to have
vdev for this)
# The PCI device has HW queue(s)
# Each HW queue has different channels.

In the current configuration, we have only one queue per device and it
has 4 channels. 4 channels are not threaded safe as it is based on
single queue.

I think, if we need to flatten it, I think, it makes sense to have
dmadev <> channel (and each channel can have thread-safe capability
based on how it mapped on HW queues based on the device driver
capability).






>
> -Morten
>
fengchengwen July 3, 2021, 9:45 a.m. UTC | #72
On 2021/7/3 16:53, Morten Brørup wrote:
>> From: fengchengwen [mailto:fengchengwen@huawei.com]
>> Sent: Saturday, 3 July 2021 02.32
>>
>> On 2021/7/2 22:57, Morten Brørup wrote:
>>>> In the DPDK framework, many data-plane API names contain queues.
>> e.g.
>>>> eventdev/crypto..
>>>> The concept of virt queues has continuity.
>>>
>>> I was also wondering about the name "virtual queue".
>>>
>>> Usually, something "virtual" would be an abstraction of something
>> physical, e.g. a software layer on top of something physical.
>>>
>>> Back in the days, a "DMA channel" used to mean a DMA engine on a CPU.
>> If a CPU had 2 DMA channels, they could both be set up simultaneously.
>>>
>>> The current design has the "dmadev" representing a CPU or other chip,
>> which has one or more "HW-queues" representing DMA channels (of the
>> same type), and then "virt-queue" as a software abstraction on top, for
>> using a DMA channel in different ways through individually configured
>> contexts (virt-queues).
>>>
>>> It makes sense to me, although I would consider renaming "HW-queue"
>> to "channel" and perhaps "virt-queue" to "queue".
>>
>> The 'DMA channel' is more used than 'DMA queue', at least google show
>> that there are at least 20+ times more.
>>
>> It's a good idea build the abstraction layer: queue <> channel <> dma-
>> controller.
>> In this way, the meaning of each layer is relatively easy to
>> distinguish literally.
>>
>> will fix in V2
>>
> 
> After re-reading all the mails in this thread, I have found one more important high level detail still not decided:
> 
> Bruce had suggested flattening the DMA channels, so each dmadev represents a DMA channel. And DMA controllers with multiple DMA channels will have to instantiate multiple dmadevs, one for each DMA channel.

The dpaa2_qdma support multiple DMA channels, I looked into the dpaa2_qdma
and found the control-plane interacts with the kernel, so if we use the
flattening model, there maybe interactions between dmadevs.

> 
> Just like a four port NIC instantiates four ethdevs.
> 
> Then, like ethdevs, there would only be two abstraction layers: dmadev <> queue, where a dmadev is a DMA channel on a DMA controller.

the dmadev <> channel <> queue model, there are three abstraction layers,
and two abstraction layers.

> 
> However, this assumes that the fast path functions on the individual DMA channels of a DMA controller can be accessed completely independently and simultaneously by multiple threads. (Otherwise, the driver would need to implement critical regions or locking around accessing the common registers in the DMA controller shared by the DMA channels.)

Yes, this scheme has a big implicit dependency, that is, the channels are
independent of each other.

> 
> Unless any of the DMA controller vendors claim that this assumption about independence of the DMA channels is wrong, I strongly support Bruce's flattening suggestion.
> 
> -Morten
>
Morten Brørup July 3, 2021, noon UTC | #73
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of fengchengwen
> Sent: Saturday, 3 July 2021 11.45
> 
> On 2021/7/3 16:53, Morten Brørup wrote:
> >> From: fengchengwen [mailto:fengchengwen@huawei.com]
> >> Sent: Saturday, 3 July 2021 02.32
> >>
> >> On 2021/7/2 22:57, Morten Brørup wrote:
> >>>> In the DPDK framework, many data-plane API names contain queues.
> >> e.g.
> >>>> eventdev/crypto..
> >>>> The concept of virt queues has continuity.
> >>>
> >>> I was also wondering about the name "virtual queue".
> >>>
> >>> Usually, something "virtual" would be an abstraction of something
> >> physical, e.g. a software layer on top of something physical.
> >>>
> >>> Back in the days, a "DMA channel" used to mean a DMA engine on a
> CPU.
> >> If a CPU had 2 DMA channels, they could both be set up
> simultaneously.
> >>>
> >>> The current design has the "dmadev" representing a CPU or other
> chip,
> >> which has one or more "HW-queues" representing DMA channels (of the
> >> same type), and then "virt-queue" as a software abstraction on top,
> for
> >> using a DMA channel in different ways through individually
> configured
> >> contexts (virt-queues).
> >>>
> >>> It makes sense to me, although I would consider renaming "HW-queue"
> >> to "channel" and perhaps "virt-queue" to "queue".
> >>
> >> The 'DMA channel' is more used than 'DMA queue', at least google
> show
> >> that there are at least 20+ times more.
> >>
> >> It's a good idea build the abstraction layer: queue <> channel <>
> dma-
> >> controller.
> >> In this way, the meaning of each layer is relatively easy to
> >> distinguish literally.
> >>
> >> will fix in V2
> >>
> >
> > After re-reading all the mails in this thread, I have found one more
> important high level detail still not decided:
> >
> > Bruce had suggested flattening the DMA channels, so each dmadev
> represents a DMA channel. And DMA controllers with multiple DMA
> channels will have to instantiate multiple dmadevs, one for each DMA
> channel.
> 
> The dpaa2_qdma support multiple DMA channels, I looked into the
> dpaa2_qdma
> and found the control-plane interacts with the kernel, so if we use the
> flattening model, there maybe interactions between dmadevs.

It is perfectly acceptable for the control-plane DMA controller functions to interact across multiple dmadevs, not being thread safe and using locks etc. to protect critical regions accessing shared registers.

The key question is: Can the data-plane dmadev functions (rte_dma_copy() etc.) be implemented to be thread safe, so multiple threads can use data-plane dmadev functions simultaneously?

> 
> >
> > Just like a four port NIC instantiates four ethdevs.
> >
> > Then, like ethdevs, there would only be two abstraction layers:
> dmadev <> queue, where a dmadev is a DMA channel on a DMA controller.
> 
> the dmadev <> channel <> queue model, there are three abstraction
> layers,
> and two abstraction layers.
> 
> >
> > However, this assumes that the fast path functions on the individual
> DMA channels of a DMA controller can be accessed completely
> independently and simultaneously by multiple threads. (Otherwise, the
> driver would need to implement critical regions or locking around
> accessing the common registers in the DMA controller shared by the DMA
> channels.)
> 
> Yes, this scheme has a big implicit dependency, that is, the channels
> are
> independent of each other.
> 
> >
> > Unless any of the DMA controller vendors claim that this assumption
> about independence of the DMA channels is wrong, I strongly support
> Bruce's flattening suggestion.
> >
> > -Morten
> >
>
Morten Brørup July 3, 2021, 12:24 p.m. UTC | #74
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
> Sent: Saturday, 3 July 2021 11.09
> 
> On Sat, Jul 3, 2021 at 2:23 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> >
> > > From: fengchengwen [mailto:fengchengwen@huawei.com]
> > > Sent: Saturday, 3 July 2021 02.32
> > >
> > > On 2021/7/2 22:57, Morten Brørup wrote:
> > > >> In the DPDK framework, many data-plane API names contain queues.
> > > e.g.
> > > >> eventdev/crypto..
> > > >> The concept of virt queues has continuity.
> > > >
> > > > I was also wondering about the name "virtual queue".
> > > >
> > > > Usually, something "virtual" would be an abstraction of something
> > > physical, e.g. a software layer on top of something physical.
> > > >
> > > > Back in the days, a "DMA channel" used to mean a DMA engine on a
> CPU.
> > > If a CPU had 2 DMA channels, they could both be set up
> simultaneously.
> > > >
> > > > The current design has the "dmadev" representing a CPU or other
> chip,
> > > which has one or more "HW-queues" representing DMA channels (of the
> > > same type), and then "virt-queue" as a software abstraction on top,
> for
> > > using a DMA channel in different ways through individually
> configured
> > > contexts (virt-queues).
> > > >
> > > > It makes sense to me, although I would consider renaming "HW-
> queue"
> > > to "channel" and perhaps "virt-queue" to "queue".
> > >
> > > The 'DMA channel' is more used than 'DMA queue', at least google
> show
> > > that there are at least 20+ times more.
> > >
> > > It's a good idea build the abstraction layer: queue <> channel <>
> dma-
> > > controller.
> > > In this way, the meaning of each layer is relatively easy to
> > > distinguish literally.
> > >
> > > will fix in V2
> > >
> >
> > After re-reading all the mails in this thread, I have found one more
> important high level detail still not decided:
> >
> > Bruce had suggested flattening the DMA channels, so each dmadev
> represents a DMA channel. And DMA controllers with multiple DMA
> channels will have to instantiate multiple dmadevs, one for each DMA
> channel.
> >
> > Just like a four port NIC instantiates four ethdevs.
> >
> > Then, like ethdevs, there would only be two abstraction layers:
> dmadev <> queue, where a dmadev is a DMA channel on a DMA controller.
> >
> > However, this assumes that the fast path functions on the individual
> DMA channels of a DMA controller can be accessed completely
> independently and simultaneously by multiple threads. (Otherwise, the
> driver would need to implement critical regions or locking around
> accessing the common registers in the DMA controller shared by the DMA
> channels.)
> >
> > Unless any of the DMA controller vendors claim that this assumption
> about independence of the DMA channels is wrong, I strongly support
> Bruce's flattening suggestion.
> 
> It is wrong from alteast octeontx2_dma PoV.
> 
> # The PCI device is DMA controller where the driver/device is
> mapped.(As device driver is based on PCI bus, We dont want to have
> vdev for this)
> # The PCI device has HW queue(s)
> # Each HW queue has different channels.
> 
> In the current configuration, we have only one queue per device and it
> has 4 channels. 4 channels are not threaded safe as it is based on
> single queue.

Please clarify "current configuration": Is that a configuration modifiable by changing some software/driver, or is it the chip that was built that way in the RTL code?

> 
> I think, if we need to flatten it, I think, it makes sense to have
> dmadev <> channel (and each channel can have thread-safe capability
> based on how it mapped on HW queues based on the device driver
> capability).

The key question is how many threads can independently call data-plane dmadev functions (rte_dma_copy() etc.) simultaneously. If I understand your explanation correctly, only one - because you only have one DMA device, and all access to it goes through a single hardware queue.

I just realized that although you only have one DMA Controller with only one HW queue, your four DMA channels allows four sequentially initiated transactions to be running simultaneously. Does the application have any benefit by knowing that the dmadev can have multiple ongoing transactions, or can the fast-path dmadev API hide that ability?
Jerin Jacob July 4, 2021, 7:34 a.m. UTC | #75
On Sat, Jul 3, 2021 at 5:30 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of fengchengwen
> > Sent: Saturday, 3 July 2021 11.45
> >
> > On 2021/7/3 16:53, Morten Brørup wrote:
> > >> From: fengchengwen [mailto:fengchengwen@huawei.com]
> > >> Sent: Saturday, 3 July 2021 02.32
> > >>
> > >> On 2021/7/2 22:57, Morten Brørup wrote:
> > >>>> In the DPDK framework, many data-plane API names contain queues.
> > >> e.g.
> > >>>> eventdev/crypto..
> > >>>> The concept of virt queues has continuity.
> > >>>
> > >>> I was also wondering about the name "virtual queue".
> > >>>
> > >>> Usually, something "virtual" would be an abstraction of something
> > >> physical, e.g. a software layer on top of something physical.
> > >>>
> > >>> Back in the days, a "DMA channel" used to mean a DMA engine on a
> > CPU.
> > >> If a CPU had 2 DMA channels, they could both be set up
> > simultaneously.
> > >>>
> > >>> The current design has the "dmadev" representing a CPU or other
> > chip,
> > >> which has one or more "HW-queues" representing DMA channels (of the
> > >> same type), and then "virt-queue" as a software abstraction on top,
> > for
> > >> using a DMA channel in different ways through individually
> > configured
> > >> contexts (virt-queues).
> > >>>
> > >>> It makes sense to me, although I would consider renaming "HW-queue"
> > >> to "channel" and perhaps "virt-queue" to "queue".
> > >>
> > >> The 'DMA channel' is more used than 'DMA queue', at least google
> > show
> > >> that there are at least 20+ times more.
> > >>
> > >> It's a good idea build the abstraction layer: queue <> channel <>
> > dma-
> > >> controller.
> > >> In this way, the meaning of each layer is relatively easy to
> > >> distinguish literally.
> > >>
> > >> will fix in V2
> > >>
> > >
> > > After re-reading all the mails in this thread, I have found one more
> > important high level detail still not decided:
> > >
> > > Bruce had suggested flattening the DMA channels, so each dmadev
> > represents a DMA channel. And DMA controllers with multiple DMA
> > channels will have to instantiate multiple dmadevs, one for each DMA
> > channel.
> >
> > The dpaa2_qdma support multiple DMA channels, I looked into the
> > dpaa2_qdma
> > and found the control-plane interacts with the kernel, so if we use the
> > flattening model, there maybe interactions between dmadevs.
>
> It is perfectly acceptable for the control-plane DMA controller functions to interact across multiple dmadevs, not being thread safe and using locks etc. to protect critical regions accessing shared registers.
>
> The key question is: Can the data-plane dmadev functions (rte_dma_copy() etc.) be implemented to be thread safe, so multiple threads can use data-plane dmadev functions simultaneously?

It can . if we need it in that way.

>
> >
> > >
> > > Just like a four port NIC instantiates four ethdevs.
> > >
> > > Then, like ethdevs, there would only be two abstraction layers:
> > dmadev <> queue, where a dmadev is a DMA channel on a DMA controller.
> >
> > the dmadev <> channel <> queue model, there are three abstraction
> > layers,
> > and two abstraction layers.
> >
> > >
> > > However, this assumes that the fast path functions on the individual
> > DMA channels of a DMA controller can be accessed completely
> > independently and simultaneously by multiple threads. (Otherwise, the
> > driver would need to implement critical regions or locking around
> > accessing the common registers in the DMA controller shared by the DMA
> > channels.)
> >
> > Yes, this scheme has a big implicit dependency, that is, the channels
> > are
> > independent of each other.
> >
> > >
> > > Unless any of the DMA controller vendors claim that this assumption
> > about independence of the DMA channels is wrong, I strongly support
> > Bruce's flattening suggestion.
> > >
> > > -Morten
> > >
> >
>
Jerin Jacob July 4, 2021, 7:43 a.m. UTC | #76
On Sat, Jul 3, 2021 at 5:54 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
> > Sent: Saturday, 3 July 2021 11.09
> >
> > On Sat, Jul 3, 2021 at 2:23 PM Morten Brørup <mb@smartsharesystems.com>
> > wrote:
> > >
> > > > From: fengchengwen [mailto:fengchengwen@huawei.com]
> > > > Sent: Saturday, 3 July 2021 02.32
> > > >
> > > > On 2021/7/2 22:57, Morten Brørup wrote:
> > > > >> In the DPDK framework, many data-plane API names contain queues.
> > > > e.g.
> > > > >> eventdev/crypto..
> > > > >> The concept of virt queues has continuity.
> > > > >
> > > > > I was also wondering about the name "virtual queue".
> > > > >
> > > > > Usually, something "virtual" would be an abstraction of something
> > > > physical, e.g. a software layer on top of something physical.
> > > > >
> > > > > Back in the days, a "DMA channel" used to mean a DMA engine on a
> > CPU.
> > > > If a CPU had 2 DMA channels, they could both be set up
> > simultaneously.
> > > > >
> > > > > The current design has the "dmadev" representing a CPU or other
> > chip,
> > > > which has one or more "HW-queues" representing DMA channels (of the
> > > > same type), and then "virt-queue" as a software abstraction on top,
> > for
> > > > using a DMA channel in different ways through individually
> > configured
> > > > contexts (virt-queues).
> > > > >
> > > > > It makes sense to me, although I would consider renaming "HW-
> > queue"
> > > > to "channel" and perhaps "virt-queue" to "queue".
> > > >
> > > > The 'DMA channel' is more used than 'DMA queue', at least google
> > show
> > > > that there are at least 20+ times more.
> > > >
> > > > It's a good idea build the abstraction layer: queue <> channel <>
> > dma-
> > > > controller.
> > > > In this way, the meaning of each layer is relatively easy to
> > > > distinguish literally.
> > > >
> > > > will fix in V2
> > > >
> > >
> > > After re-reading all the mails in this thread, I have found one more
> > important high level detail still not decided:
> > >
> > > Bruce had suggested flattening the DMA channels, so each dmadev
> > represents a DMA channel. And DMA controllers with multiple DMA
> > channels will have to instantiate multiple dmadevs, one for each DMA
> > channel.
> > >
> > > Just like a four port NIC instantiates four ethdevs.
> > >
> > > Then, like ethdevs, there would only be two abstraction layers:
> > dmadev <> queue, where a dmadev is a DMA channel on a DMA controller.
> > >
> > > However, this assumes that the fast path functions on the individual
> > DMA channels of a DMA controller can be accessed completely
> > independently and simultaneously by multiple threads. (Otherwise, the
> > driver would need to implement critical regions or locking around
> > accessing the common registers in the DMA controller shared by the DMA
> > channels.)
> > >
> > > Unless any of the DMA controller vendors claim that this assumption
> > about independence of the DMA channels is wrong, I strongly support
> > Bruce's flattening suggestion.
> >
> > It is wrong from alteast octeontx2_dma PoV.
> >
> > # The PCI device is DMA controller where the driver/device is
> > mapped.(As device driver is based on PCI bus, We dont want to have
> > vdev for this)
> > # The PCI device has HW queue(s)
> > # Each HW queue has different channels.
> >
> > In the current configuration, we have only one queue per device and it
> > has 4 channels. 4 channels are not threaded safe as it is based on
> > single queue.
>
> Please clarify "current configuration": Is that a configuration modifiable by changing some software/driver, or is it the chip that was built that way in the RTL code?

We have 8 queues per SoC, Based on some of HW versions it can be
configured as (a) or (b) using FW settings.
a) One PCI devices with 8 Queues
b) 8 PCI devices with each one has one queue.

Everyone is using mode (b) as it helps 8 different applications to use
DMA as if one application binds the PCI device other applications can
not use the same PCI device.
If one application needs 8 queues, it is possible that 8 dmadevice can
be bound to a single application with mode (b).


I think, in above way we can flatten to <device> <> <channel/queue>

>
> >
> > I think, if we need to flatten it, I think, it makes sense to have
> > dmadev <> channel (and each channel can have thread-safe capability
> > based on how it mapped on HW queues based on the device driver
> > capability).
>
> The key question is how many threads can independently call data-plane dmadev functions (rte_dma_copy() etc.) simultaneously. If I understand your explanation correctly, only one - because you only have one DMA device, and all access to it goes through a single hardware queue.
>
> I just realized that although you only have one DMA Controller with only one HW queue, your four DMA channels allows four sequentially initiated transactions to be running simultaneously. Does the application have any benefit by knowing that the dmadev can have multiple ongoing transactions, or can the fast-path dmadev API hide that ability?

In my view it is better to hide and I have similar proposal at
http://mails.dpdk.org/archives/dev/2021-July/213141.html
--------------
>   7) Because data-plane APIs are not thread-safe, and user could determine
>      virt-queue to HW-queue's map (at the queue-setup stage), so it is user's
>      duty to ensure thread-safe.

+1. But I am not sure how easy for the fast-path application to have this logic,
Instead, I think, it is better to tell the capa for queue by driver
and in channel configuration,
the application can request for requirement (Is multiple producers enq
to the same HW queue or not).
Based on the request, the implementation can pick the correct function
pointer for enq.(lock vs lockless version if HW does not support
lockless)

------------------------
>
Morten Brørup July 5, 2021, 10:28 a.m. UTC | #77
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
> Sent: Sunday, 4 July 2021 09.43
> 
> On Sat, Jul 3, 2021 at 5:54 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> >
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
> > > Sent: Saturday, 3 July 2021 11.09
> > >
> > > On Sat, Jul 3, 2021 at 2:23 PM Morten Brørup
> <mb@smartsharesystems.com>
> > > wrote:
> > > >
> > > > > From: fengchengwen [mailto:fengchengwen@huawei.com]
> > > > > Sent: Saturday, 3 July 2021 02.32
> > > > >
> > > > > On 2021/7/2 22:57, Morten Brørup wrote:
> > > > > >> In the DPDK framework, many data-plane API names contain
> queues.
> > > > > e.g.
> > > > > >> eventdev/crypto..
> > > > > >> The concept of virt queues has continuity.
> > > > > >
> > > > > > I was also wondering about the name "virtual queue".
> > > > > >
> > > > > > Usually, something "virtual" would be an abstraction of
> something
> > > > > physical, e.g. a software layer on top of something physical.
> > > > > >
> > > > > > Back in the days, a "DMA channel" used to mean a DMA engine
> on a
> > > CPU.
> > > > > If a CPU had 2 DMA channels, they could both be set up
> > > simultaneously.
> > > > > >
> > > > > > The current design has the "dmadev" representing a CPU or
> other
> > > chip,
> > > > > which has one or more "HW-queues" representing DMA channels (of
> the
> > > > > same type), and then "virt-queue" as a software abstraction on
> top,
> > > for
> > > > > using a DMA channel in different ways through individually
> > > configured
> > > > > contexts (virt-queues).
> > > > > >
> > > > > > It makes sense to me, although I would consider renaming "HW-
> > > queue"
> > > > > to "channel" and perhaps "virt-queue" to "queue".
> > > > >
> > > > > The 'DMA channel' is more used than 'DMA queue', at least
> google
> > > show
> > > > > that there are at least 20+ times more.
> > > > >
> > > > > It's a good idea build the abstraction layer: queue <> channel
> <>
> > > dma-
> > > > > controller.
> > > > > In this way, the meaning of each layer is relatively easy to
> > > > > distinguish literally.
> > > > >
> > > > > will fix in V2
> > > > >
> > > >
> > > > After re-reading all the mails in this thread, I have found one
> more
> > > important high level detail still not decided:
> > > >
> > > > Bruce had suggested flattening the DMA channels, so each dmadev
> > > represents a DMA channel. And DMA controllers with multiple DMA
> > > channels will have to instantiate multiple dmadevs, one for each
> DMA
> > > channel.
> > > >
> > > > Just like a four port NIC instantiates four ethdevs.
> > > >
> > > > Then, like ethdevs, there would only be two abstraction layers:
> > > dmadev <> queue, where a dmadev is a DMA channel on a DMA
> controller.
> > > >
> > > > However, this assumes that the fast path functions on the
> individual
> > > DMA channels of a DMA controller can be accessed completely
> > > independently and simultaneously by multiple threads. (Otherwise,
> the
> > > driver would need to implement critical regions or locking around
> > > accessing the common registers in the DMA controller shared by the
> DMA
> > > channels.)
> > > >
> > > > Unless any of the DMA controller vendors claim that this
> assumption
> > > about independence of the DMA channels is wrong, I strongly support
> > > Bruce's flattening suggestion.
> > >
> > > It is wrong from alteast octeontx2_dma PoV.
> > >
> > > # The PCI device is DMA controller where the driver/device is
> > > mapped.(As device driver is based on PCI bus, We dont want to have
> > > vdev for this)
> > > # The PCI device has HW queue(s)
> > > # Each HW queue has different channels.
> > >
> > > In the current configuration, we have only one queue per device and
> it
> > > has 4 channels. 4 channels are not threaded safe as it is based on
> > > single queue.
> >
> > Please clarify "current configuration": Is that a configuration
> modifiable by changing some software/driver, or is it the chip that was
> built that way in the RTL code?
> 
> We have 8 queues per SoC, Based on some of HW versions it can be
> configured as (a) or (b) using FW settings.
> a) One PCI devices with 8 Queues
> b) 8 PCI devices with each one has one queue.
> 
> Everyone is using mode (b) as it helps 8 different applications to use
> DMA as if one application binds the PCI device other applications can
> not use the same PCI device.
> If one application needs 8 queues, it is possible that 8 dmadevice can
> be bound to a single application with mode (b).
> 
> 
> I think, in above way we can flatten to <device> <> <channel/queue>
> 
> >
> > >
> > > I think, if we need to flatten it, I think, it makes sense to have
> > > dmadev <> channel (and each channel can have thread-safe capability
> > > based on how it mapped on HW queues based on the device driver
> > > capability).
> >
> > The key question is how many threads can independently call data-
> plane dmadev functions (rte_dma_copy() etc.) simultaneously. If I
> understand your explanation correctly, only one - because you only have
> one DMA device, and all access to it goes through a single hardware
> queue.
> >
> > I just realized that although you only have one DMA Controller with
> only one HW queue, your four DMA channels allows four sequentially
> initiated transactions to be running simultaneously. Does the
> application have any benefit by knowing that the dmadev can have
> multiple ongoing transactions, or can the fast-path dmadev API hide
> that ability?
> 
> In my view it is better to hide and I have similar proposal at
> http://mails.dpdk.org/archives/dev/2021-July/213141.html
> --------------
> >   7) Because data-plane APIs are not thread-safe, and user could
> determine
> >      virt-queue to HW-queue's map (at the queue-setup stage), so it
> is user's
> >      duty to ensure thread-safe.
> 
> +1. But I am not sure how easy for the fast-path application to have
> this logic,
> Instead, I think, it is better to tell the capa for queue by driver
> and in channel configuration,
> the application can request for requirement (Is multiple producers enq
> to the same HW queue or not).
> Based on the request, the implementation can pick the correct function
> pointer for enq.(lock vs lockless version if HW does not support
> lockless)

+1 to that!

> 
> ------------------------
> >
fengchengwen July 6, 2021, 7:11 a.m. UTC | #78
On 2021/7/5 18:28, Morten Brørup wrote:
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
>> Sent: Sunday, 4 July 2021 09.43
>>
>> On Sat, Jul 3, 2021 at 5:54 PM Morten Brørup <mb@smartsharesystems.com>
>> wrote:
>>>
>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
>>>> Sent: Saturday, 3 July 2021 11.09
>>>>
>>>> On Sat, Jul 3, 2021 at 2:23 PM Morten Brørup
>> <mb@smartsharesystems.com>
>>>> wrote:
>>>>>
>>>>>> From: fengchengwen [mailto:fengchengwen@huawei.com]
>>>>>> Sent: Saturday, 3 July 2021 02.32
>>>>>>
>>>>>> On 2021/7/2 22:57, Morten Brørup wrote:
>>>>>>>> In the DPDK framework, many data-plane API names contain
>> queues.
>>>>>> e.g.
>>>>>>>> eventdev/crypto..
>>>>>>>> The concept of virt queues has continuity.
>>>>>>>
>>>>>>> I was also wondering about the name "virtual queue".
>>>>>>>
>>>>>>> Usually, something "virtual" would be an abstraction of
>> something
>>>>>> physical, e.g. a software layer on top of something physical.
>>>>>>>
>>>>>>> Back in the days, a "DMA channel" used to mean a DMA engine
>> on a
>>>> CPU.
>>>>>> If a CPU had 2 DMA channels, they could both be set up
>>>> simultaneously.
>>>>>>>
>>>>>>> The current design has the "dmadev" representing a CPU or
>> other
>>>> chip,
>>>>>> which has one or more "HW-queues" representing DMA channels (of
>> the
>>>>>> same type), and then "virt-queue" as a software abstraction on
>> top,
>>>> for
>>>>>> using a DMA channel in different ways through individually
>>>> configured
>>>>>> contexts (virt-queues).
>>>>>>>
>>>>>>> It makes sense to me, although I would consider renaming "HW-
>>>> queue"
>>>>>> to "channel" and perhaps "virt-queue" to "queue".
>>>>>>
>>>>>> The 'DMA channel' is more used than 'DMA queue', at least
>> google
>>>> show
>>>>>> that there are at least 20+ times more.
>>>>>>
>>>>>> It's a good idea build the abstraction layer: queue <> channel
>> <>
>>>> dma-
>>>>>> controller.
>>>>>> In this way, the meaning of each layer is relatively easy to
>>>>>> distinguish literally.
>>>>>>
>>>>>> will fix in V2
>>>>>>
>>>>>
>>>>> After re-reading all the mails in this thread, I have found one
>> more
>>>> important high level detail still not decided:
>>>>>
>>>>> Bruce had suggested flattening the DMA channels, so each dmadev
>>>> represents a DMA channel. And DMA controllers with multiple DMA
>>>> channels will have to instantiate multiple dmadevs, one for each
>> DMA
>>>> channel.
>>>>>
>>>>> Just like a four port NIC instantiates four ethdevs.
>>>>>
>>>>> Then, like ethdevs, there would only be two abstraction layers:
>>>> dmadev <> queue, where a dmadev is a DMA channel on a DMA
>> controller.
>>>>>
>>>>> However, this assumes that the fast path functions on the
>> individual
>>>> DMA channels of a DMA controller can be accessed completely
>>>> independently and simultaneously by multiple threads. (Otherwise,
>> the
>>>> driver would need to implement critical regions or locking around
>>>> accessing the common registers in the DMA controller shared by the
>> DMA
>>>> channels.)
>>>>>
>>>>> Unless any of the DMA controller vendors claim that this
>> assumption
>>>> about independence of the DMA channels is wrong, I strongly support
>>>> Bruce's flattening suggestion.
>>>>
>>>> It is wrong from alteast octeontx2_dma PoV.
>>>>
>>>> # The PCI device is DMA controller where the driver/device is
>>>> mapped.(As device driver is based on PCI bus, We dont want to have
>>>> vdev for this)
>>>> # The PCI device has HW queue(s)
>>>> # Each HW queue has different channels.
>>>>
>>>> In the current configuration, we have only one queue per device and
>> it
>>>> has 4 channels. 4 channels are not threaded safe as it is based on
>>>> single queue.
>>>
>>> Please clarify "current configuration": Is that a configuration
>> modifiable by changing some software/driver, or is it the chip that was
>> built that way in the RTL code?
>>
>> We have 8 queues per SoC, Based on some of HW versions it can be
>> configured as (a) or (b) using FW settings.
>> a) One PCI devices with 8 Queues
>> b) 8 PCI devices with each one has one queue.
>>
>> Everyone is using mode (b) as it helps 8 different applications to use
>> DMA as if one application binds the PCI device other applications can
>> not use the same PCI device.
>> If one application needs 8 queues, it is possible that 8 dmadevice can
>> be bound to a single application with mode (b).
>>
>>
>> I think, in above way we can flatten to <device> <> <channel/queue>

I just look at dpaa2_qdma driver code, and found it seems OK to setup
one xxxdev for one queue.

>>
>>>
>>>>
>>>> I think, if we need to flatten it, I think, it makes sense to have
>>>> dmadev <> channel (and each channel can have thread-safe capability
>>>> based on how it mapped on HW queues based on the device driver
>>>> capability).
>>>
>>> The key question is how many threads can independently call data-
>> plane dmadev functions (rte_dma_copy() etc.) simultaneously. If I
>> understand your explanation correctly, only one - because you only have
>> one DMA device, and all access to it goes through a single hardware
>> queue.
>>>
>>> I just realized that although you only have one DMA Controller with
>> only one HW queue, your four DMA channels allows four sequentially
>> initiated transactions to be running simultaneously. Does the
>> application have any benefit by knowing that the dmadev can have
>> multiple ongoing transactions, or can the fast-path dmadev API hide
>> that ability?
>>
>> In my view it is better to hide and I have similar proposal at
>> http://mails.dpdk.org/archives/dev/2021-July/213141.html
>> --------------
>>>   7) Because data-plane APIs are not thread-safe, and user could
>> determine
>>>      virt-queue to HW-queue's map (at the queue-setup stage), so it
>> is user's
>>>      duty to ensure thread-safe.
>>
>> +1. But I am not sure how easy for the fast-path application to have
>> this logic,
>> Instead, I think, it is better to tell the capa for queue by driver
>> and in channel configuration,
>> the application can request for requirement (Is multiple producers enq
>> to the same HW queue or not).
>> Based on the request, the implementation can pick the correct function
>> pointer for enq.(lock vs lockless version if HW does not support
>> lockless)
> 
> +1 to that!
> 

add in channel configuration sound good.

>>
>> ------------------------
>>>
>
diff mbox series

Patch

diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
new file mode 100644
index 0000000..ca7c8a8
--- /dev/null
+++ b/lib/dmadev/rte_dmadev.h
@@ -0,0 +1,531 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 HiSilicon Limited.
+ */
+
+#ifndef _RTE_DMADEV_H_
+#define _RTE_DMADEV_H_
+
+/**
+ * @file rte_dmadev.h
+ *
+ * DMA (Direct Memory Access) device APIs.
+ *
+ * Defines RTE DMA Device APIs for DMA operations and its provisioning.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+#include <rte_memory.h>
+#include <rte_errno.h>
+#include <rte_compat.h>
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the total number of DMA devices that have been successfully
+ * initialised.
+ *
+ * @return
+ *   The total number of usable DMA devices.
+ */
+__rte_experimental
+uint16_t
+rte_dmadev_count(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the device identifier for the named DMA device.
+ *
+ * @param name
+ *   DMA device name to select the DMA device identifier.
+ *
+ * @return
+ *   Returns DMA device identifier on success.
+ *   - <0: Failure to find named DMA device.
+ */
+__rte_experimental
+int
+rte_dmadev_get_dev_id(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Return the NUMA socket to which a device is connected.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ *
+ * @return
+ *   The NUMA socket id to which the device is connected or
+ *   a default of zero if the socket could not be determined.
+ *   - -EINVAL: dev_id value is out of range.
+ */
+__rte_experimental
+int
+rte_dmadev_socket_id(uint16_t dev_id);
+
+/**
+ * The capabilities of a DMA device.
+ */
+#define RTE_DMA_DEV_CAPA_FILL	(1ull << 0) /**< Support fill ops */
+#define RTE_DMA_DEV_CAPA_FENCE	(1ull << 1) /**< Support fence ops */
+#define RTE_DMA_DEV_CAPA_HANDLE	(1ull << 2) /**< Support opaque handle */
+
+/**
+ * DMA device information
+ */
+struct rte_dmadev_info {
+	const char *driver_name; /**< DMA driver name. */
+	struct rte_device *device; /**< Device information. */
+	uint64_t dev_capa; /**< Device capabilities (RTE_DMA_DEV_CAPA_). */
+	uint16_t nb_max_desc; /**< Max allowed number of descriptors. */
+	uint16_t nb_min_desc; /**< Min allowed number of descriptors. */
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Retrieve the contextual information of a DMA device.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ *
+ * @param[out] dev_info
+ *   A pointer to a structure of type *rte_dmadev_info* to be filled with the
+ *   contextual information of the device.
+ * @return
+ *   - 0: Success, driver updates the contextual information of the DMA device
+ *   - <0: Error code returned by the driver info get function.
+ *
+ */
+__rte_experimental
+int
+rte_dmadev_info_get(uint16_t dev_id, struct rte_dmadev_info *dev_info);
+
+/**
+ * A structure used to configure a DMA device.
+ */
+struct rte_dmadev_conf {
+	uint16_t nb_desc; /**< The number of submission descriptor ring size */
+	bool handle_enable; /**< if set, process user-supplied opaque handle */
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Configure a DMA device.
+ *
+ * This function must be invoked first before any other function in the
+ * API. This function can also be re-invoked when a device is in the
+ * stopped state.
+ *
+ * The caller may use rte_dmadev_info_get() to get the capability of each
+ * resources available for this DMA device.
+ *
+ * @param dev_id
+ *   The identifier of the device to configure.
+ * @param dev_conf
+ *   The DMA device configuration structure encapsulated into rte_dmadev_conf
+ *   object.
+ *
+ * @return
+ *   - 0: Success, device configured.
+ *   - <0: Error code returned by the driver configuration function.
+ */
+__rte_experimental
+int
+rte_dmadev_configure(uint16_t dev_id, struct rte_dmadev_conf *dev_conf);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Start a DMA device.
+ *
+ * The device start step is the last one and consists of setting the DMA
+ * to start accepting jobs (e.g. fill or copy).
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ *
+ * @return
+ *   - 0: Success, device started.
+ *   < 0: Error code returned by the driver start function.
+ */
+__rte_experimental
+int
+rte_dmadev_start(uint16_t dev_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Stop a DMA device. The device can be restarted with a call to
+ * rte_dmadev_start()
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ *
+ * @return
+ *   - 0: Success, device stopped.
+ *   - <0: Error code returned by the driver stop function.
+ */
+__rte_experimental
+int
+rte_dmadev_stop(uint16_t dev_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Close a DMA device. The device cannot be restarted after this call.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ *
+ * @return
+ *  - 0 on successfully closing device
+ *  - <0 on failure to close device
+ */
+__rte_experimental
+int
+rte_dmadev_close(uint16_t dev_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Reset a DMA device.
+ * This is different from cycle of rte_dmadev_start->rte_dmadev_stop in the
+ * sense similar to hard or soft reset.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ *
+ * @return
+ *   - 0 on successful reset device.
+ *   - <0 on failure to reset device.
+ *   - (-ENOTSUP) if the device doesn't support this function.
+ */
+__rte_experimental
+int
+rte_dmadev_reset(uint16_t dev_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Enqueue a fill operation onto the DMA device
+ *
+ * This queues up a fill operation to be performed by hardware, but does not
+ * trigger hardware to begin that operation.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ * @param pattern
+ *   The pattern to populate the destination buffer with.
+ * @param dst
+ *   The address of the destination buffer.
+ * @param len
+ *   The length of the destination buffer.
+ * @param op_handle
+ *   An opaque handle for this operation, may be returned when completed or
+ *   completed_error.
+ *
+ * @return
+ *   Number of operations enqueued, either 0 or 1
+ */
+__rte_experimental
+static inline int
+rte_dmadev_fill(uint16_t dev_id, uint64_t pattern, rte_iova_t dst,
+		uint32_t len, uintptr_t op_handle);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Enqueue a copy operation onto the DMA device.
+ *
+ * This queues up a copy operation to be performed by hardware, but does not
+ * trigger hardware to begin that operation.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ * @param src
+ *   The address of the source buffer.
+ * @param dst
+ *   The address of the destination buffer.
+ * @param len
+ *   The length of the data to be copied.
+ * @param op_handle
+ *   An opaque handle for this operation, may be returned when completed or
+ *   completed_error.
+ *
+ * @return
+ *   Number of operations enqueued, either 0 or 1.
+ */
+__rte_experimental
+static inline int
+rte_dmadev_copy(uint16_t dev_id, rte_iova_t src, rte_iova_t dst,
+		uint32_t len, uintptr_t op_handle);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Add a fence to force ordering between operations
+ *
+ * This adds a fence to a sequence of operations to enforce ordering, such that
+ * all operations enqueued before the fence must be completed before operations
+ * after the fence.
+ * NOTE: Since this fence may be added as a flag to the last operation enqueued,
+ * this API may not function correctly when called immediately after an
+ * "rte_dmadev_perform_ops" call i.e. before any new operations are enqueued.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ *
+ * @return
+ *   - 0: on successful add fence.
+ *   - <0: on failure to add fence.
+ */
+__rte_experimental
+static inline int
+rte_dmadev_fence(uint16_t dev_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Trigger hardware to begin performing enqueued operations
+ *
+ * This API is used to write the "doorbell" to the hardware to trigger it
+ * to begin the operations previously enqueued by rte_dmadev_enqueue_xxx()
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ *
+ * @return
+ *   - 0: on successful trigger hardware.
+ *   - <0: on failure to trigger hardware.
+ */
+__rte_experimental
+static inline int
+rte_dmadev_perform(uint16_t dev_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Returns the number of operations that have been successful completed.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ * @param op_handle
+ *   Return the lastest completed operation's opaque handle which passed by fill
+ *   or copy ops.
+ *   NOTE: If handle_enable configuration option for the device was not set,
+ *   this parameter is ignored, and may be NULL.
+ *
+ * @return
+ *   -1 on device error, with rte_errno set appropriately and parameters
+ *   unmodified.
+ *   Otherwise number of successful completed operations.
+ */
+__rte_experimental
+static inline int
+rte_dmadev_completed(uint16_t dev_id, uintptr_t *op_handle);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Returns the number of operations that failed to complete.
+ * NOTE: This API was used when rte_dmadev_completed return -1.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ * @param op_handle
+ *   Return the lastest failed operation's opaque handle which passed by fill
+ *   or copy ops.
+ *   NOTE: If handle_enable configuration option for the device was not set,
+ *   this parameter is ignored, and may be NULL.
+ *
+ * @return
+ *   The number of failed to complete operations (due to some error, e.g.
+ *   hardware errors)
+ */
+__rte_experimental
+static inline uint16_t
+rte_dmadev_completed_error(uint16_t dev_id, uintptr_t *op_handle);
+
+/** Maximum name length for extended statistics counters */
+#define RTE_DMA_DEV_XSTATS_NAME_SIZE 64
+
+/**
+ * A name-key lookup element for extended statistics.
+ *
+ * This structure is used to map between names and ID numbers
+ * for extended ethdev statistics.
+ */
+struct rte_dmadev_xstats_name {
+	char name[RTE_DMA_DEV_XSTATS_NAME_SIZE];
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Retrieve names of extended statistics of a DMA device.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ * @param[out] xstats_names
+ *   Block of memory to insert names into. Must be at least size in capacity.
+ *   If set to NULL, function returns required capacity.
+ * @param size
+ *   Capacity of xstats_names (number of names).
+ * @return
+ *   - positive value lower or equal to size: success. The return value
+ *     is the number of entries filled in the stats table.
+ *   - positive value higher than size: error, the given statistics table
+ *     is too small. The return value corresponds to the size that should
+ *     be given to succeed. The entries in the table are not valid and
+ *     shall not be used by the caller.
+ *   - negative value on error:
+ *        -ENODEV for invalid *dev_id*
+ *        -ENOTSUP if the device doesn't support this function.
+ */
+int
+rte_dmadev_xstats_names_get(uint16_t dev_id,
+			    struct rte_dmadev_xstats_name *xstats_names,
+			    unsigned int size);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Retrieve extended statistics of a DMA device.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ * @param ids
+ *   The id numbers of the stats to get. The ids can be got from the stat
+ *   position in the stat list from rte_dmadev_get_xstats_names(), or
+ *   by using rte_dmadev_get_xstats_by_name()
+ * @param[out] values
+ *   The values for each stats request by ID.
+ * @param n
+ *   The number of stats requested
+ * @return
+ *   - positive value: number of stat entries filled into the values array
+ *   - negative value on error:
+ *        -ENODEV for invalid *dev_id*
+ *        -ENOTSUP if the device doesn't support this function.
+ */
+int
+rte_dmadev_xstats_get(uint16_t dev_id,
+		      const unsigned int ids[],
+		      uint64_t values[],
+		      unsigned int n);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Reset the values of the xstats of the selected component in the device.
+ *
+ * @param dev_id
+ *   The identifier of the device
+ * @param ids
+ *   Selects specific statistics to be reset. When NULL, all statistics
+ *   will be reset. If non-NULL, must point to array of at least
+ *   *nb_ids* size.
+ * @param nb_ids
+ *   The number of ids available from the *ids* array. Ignored when ids is NULL.
+ * @return
+ *   - zero: successfully reset the statistics to zero
+ *   - negative value on error:
+ *        -EINVAL invalid parameters
+ *        -ENOTSUP if not supported.
+ */
+int
+rte_dmadev_xstats_reset(uint16_t dev_id,
+			const uint32_t ids[],
+			uint32_t nb_ids);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Dump internal information about *dev_id* to the FILE* provided in *f*.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ *
+ * @param f
+ *   A pointer to a file for output.
+ *
+ * @return
+ *   - 0: on successful dump device.
+ *   - <0: on failure to dump device.
+ *   - (-ENOTSUP) if the device doesn't support this function.
+ */
+__rte_experimental
+int
+rte_dmadev_dump(uint16_t dev_id, FILE *f);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Trigger the dmadev self test.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ *
+ * @return
+ *   - 0: Selftest successful.
+ *   - -ENOTSUP if the device doesn't support selftest
+ *   - other values < 0 on failure.
+ */
+int
+rte_dmadev_selftest(uint16_t dev_id);
+
+
+struct rte_dmadev_ops;
+
+#define RTE_DMADEV_NAME_MAX_LEN	(64)
+/**< @internal Max length of name of DMA PMD */
+
+/** @internal
+ * The data structure associated with each DMA device.
+ */
+struct rte_dmadev {
+	/**< Device ID for this instance */
+	uint16_t dev_id;
+	/**< Functions exported by PMD */
+	const struct rte_dmadev_ops *dev_ops;
+	/**< Device info. supplied during device initialization */
+	struct rte_device *device;
+	/**< Driver info. supplied by probing */
+	const char *driver_name;
+
+	/**< Device name */
+	char name[RTE_DMADEV_NAME_MAX_LEN];
+} __rte_cache_aligned;
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_DMADEV_H_ */
diff --git a/lib/dmadev/rte_dmadev_pmd.h b/lib/dmadev/rte_dmadev_pmd.h
new file mode 100644
index 0000000..faa3909
--- /dev/null
+++ b/lib/dmadev/rte_dmadev_pmd.h
@@ -0,0 +1,384 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 HiSilicon Limited.
+ */
+
+#ifndef _RTE_DMADEV_PMD_H_
+#define _RTE_DMADEV_PMD_H_
+
+/** @file
+ * RTE DMA PMD APIs
+ *
+ * @note
+ * Driver facing APIs for a DMA device. These are not to be called directly by
+ * any application.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <string.h>
+
+#include <rte_dev.h>
+#include <rte_common.h>
+
+#include "rte_dmadev.h"
+
+/**
+ * Get the rte_dmadev structure device pointer for the named device.
+ *
+ * @param name
+ *   device name to select the device structure.
+ *
+ * @return
+ *   - The rte_dmadev structure pointer for the given device name.
+ */
+struct rte_dmadev *
+rte_dmadev_pmd_get_named_dev(const char *name);
+
+/**
+ * Definitions of all functions exported by a driver through the
+ * generic structure of type *dmadev_ops* supplied in the *rte_dmadev*
+ * structure associated with a device.
+ */
+
+/**
+ * Get device information of a device.
+ *
+ * @param dev
+ *   DMA device pointer
+ * @param dev_info
+ *   DMA device information structure
+ *
+ * @return
+ *   Returns 0 on success, negative error code on failure
+ */
+typedef int (*dmadev_info_get_t)(struct rte_dmadev *dev,
+				 struct rte_dmadev_info *dev_info);
+
+/**
+ * Configure a device.
+ *
+ * @param dev
+ *   DMA device pointer
+ * @param config
+ *   DMA device configuration structure
+ *
+ * @return
+ *   Returns 0 on success
+ */
+typedef int (*dmadev_configure_t)(const struct rte_dmadev *dev,
+				  struct rte_dmadev_conf *config);
+
+/**
+ * Start a configured device.
+ *
+ * @param dev
+ *   DMA device pointer
+ *
+ * @return
+ *   Returns 0 on success
+ */
+typedef int (*dmadev_start_t)(struct rte_dmadev *dev);
+
+/**
+ * Stop a configured device.
+ *
+ * @param dev
+ *   DMA device pointer
+ *
+ * @return
+ *   Return 0 on success
+ */
+typedef int (*dmadev_stop_t)(struct rte_dmadev *dev);
+
+/**
+ * Close a configured device.
+ *
+ * @param dev
+ *   DMA device pointer
+ *
+ * @return
+ *   Return 0 on success
+ */
+typedef int (*dmadev_close_t)(struct rte_dmadev *dev);
+
+/**
+ * Reset a configured device.
+ *
+ * @param dev
+ *   DMA device pointer
+ *
+ * @return
+ *   0 for success
+ *   !0 for failure
+ */
+typedef int (*dmadev_reset_t)(struct rte_dmadev *dev);
+
+/**
+ * Enqueue a fill operation onto the DMA device
+ *
+ * This queues up a fill operation to be performed by hardware, but does not
+ * trigger hardware to begin that operation.
+ *
+ * @param dev
+ *   DMA device pointer.
+ * @param pattern
+ *   The pattern to populate the destination buffer with.
+ * @param dst
+ *   The address of the destination buffer.
+ * @param len
+ *   The length of the destination buffer.
+ * @param op_handle
+ *   An opaque handle for this operation, may be returned when completed or
+ *   completed_error.
+ *
+ * @return
+ *   Number of operations enqueued, either 0 or 1
+ */
+typedef int (*dmadev_fill_t)(struct rte_dmadev *dev,
+			     uint64_t pattern, rte_iova_t dst,
+			     uint32_t len, uintptr_t op_handle);
+
+/**
+ * Enqueue a copy operation onto the DMA device.
+ *
+ * This queues up a copy operation to be performed by hardware, but does not
+ * trigger hardware to begin that operation.
+ *
+ * @param dev
+ *   DMA device pointer.
+ * @param src
+ *   The address of the source buffer.
+ * @param dst
+ *   The address of the destination buffer.
+ * @param len
+ *   The length of the data to be copied.
+ * @param op_handle
+ *   An opaque handle for this operation, may be returned when completed or
+ *   completed_error.
+ *
+ * @return
+ *   Number of operations enqueued, either 0 or 1.
+ */
+typedef int (*dmadev_copy_t)(struct rte_dmadev *dev,
+			     rte_iova_t src, rte_iova_t dst,
+			     uint32_t len, uintptr_t op_handle);
+
+/**
+ * Add a fence to force ordering between operations
+ *
+ * This adds a fence to a sequence of operations to enforce ordering, such that
+ * all operations enqueued before the fence must be completed before operations
+ * after the fence.
+ * NOTE: Since this fence may be added as a flag to the last operation enqueued,
+ * this API may not function correctly when called immediately after an
+ * "rte_dmadev_perform_ops" call i.e. before any new operations are enqueued.
+ *
+ * @param dev
+ *   DMA device pointer.
+ *
+ * @return
+ *   - 0: on successful add fence.
+ *   - <0: on failure to add fence.
+ */
+typedef int (*dmadev_fence_t)(struct rte_dmadev *dev);
+
+/**
+ * Trigger hardware to begin performing enqueued operations
+ *
+ * This API is used to write the "doorbell" to the hardware to trigger it
+ * to begin the operations previously enqueued by rte_dmadev_enqueue_xxx()
+ *
+ * @param dev
+ *   DMA device pointer.
+ *
+ * @return
+ *   - 0: on successful trigger hardware.
+ *   - <0: on failure to trigger hardware.
+ */
+typedef int (*dmadev_perform_t)(struct rte_dmadev *dev);
+
+/**
+ * Returns the number of operations that have been successful completed.
+ *
+ * @param dev
+ *   DMA device pointer.
+ * @param op_handle
+ *   Return the lastest completed operation's opaque handle which passed by fill
+ *   or copy ops.
+ *   NOTE: If handle_enable configuration option for the device was not set,
+ *   this parameter is ignored, and may be NULL.
+ *
+ * @return
+ *   -1 on device error, with rte_errno set appropriately and parameters
+ *   unmodified.
+ *   Otherwise number of successful completed operations.
+ */
+typedef int (*dmadev_completed_t)(struct rte_dmadev *dev, uintptr_t *op_handle);
+
+/**
+ * Returns the number of operations that failed to complete.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ * @param op_handle
+ *   Return the lastest failed operation's opaque handle which passed by fill
+ *   or copy ops.
+ *   NOTE: If handle_enable configuration option for the device was not set,
+ *   this parameter is ignored, and may be NULL.
+ *
+ * @return
+ *   The number of failed completed operations (due to some error, e.g. hardware
+ *   errors)
+ */
+typedef int (*dmadev_completed_error_t)(struct rte_dmadev *dev,
+					uintptr_t *op_handle);
+
+/**
+ * Retrieve a set of statistics from device.
+ * Note: Being a DMA device, the stats are specific to the device being
+ * implemented thus represented as xstats.
+ *
+ * @param dev
+ *   DMA device pointer
+ * @param ids
+ *   The stat ids to retrieve
+ * @param values
+ *   The returned stat values
+ * @param n
+ *   The number of id values and entries in the values array
+ *
+ * @return
+ *   The number of stat values successfully filled into the values array
+ */
+typedef int (*dmadev_xstats_get_t)(const struct rte_dmadev *dev,
+		const unsigned int ids[], uint64_t values[], unsigned int n);
+
+/**
+ * Resets the statistic values in xstats for the device.
+ */
+typedef int (*dmadev_xstats_reset_t)(struct rte_dmadev *dev,
+		const uint32_t ids[],
+		uint32_t nb_ids);
+
+/**
+ * Get names of extended stats of an DMA device
+ *
+ * @param dev
+ *   DMA device pointer
+ * @param xstats_names
+ *   Array of name values to be filled in
+ * @param size
+ *   Number of values in the xstats_names array
+ *
+ * @return
+ *   When size >= the number of stats, return the number of stat values filled
+ *   into the array.
+ *   When size < the number of available stats, return the number of stats
+ *   values, and do not fill in any data into xstats_names.
+ */
+typedef int (*dmadev_xstats_get_names_t)(const struct rte_dmadev *dev,
+		struct rte_dmadev_xstats_name *xstats_names,
+		unsigned int size);
+
+/**
+ * Dump internal information
+ *
+ * @param dev
+ *   DMA device pointer
+ * @param f
+ *   A pointer to a file for output
+ *
+ * @return
+ *   0 for success,
+ *   !0 Error
+ *
+ */
+typedef int (*dmadev_dump_t)(struct rte_dmadev *dev, FILE *f);
+
+/**
+ * Start dmadev selftest
+ *
+ * @param dev_id
+ *   The identifier of the device.
+ *
+ * @return
+ *   Return 0 on success
+ */
+typedef int (*dmadev_selftest_t)(uint16_t dev_id);
+
+/** Dmadevice operations function pointer table */
+struct rte_dmadev_ops {
+	/**< Get device info. */
+	dmadev_info_get_t dev_info_get;
+	/**< Configure device. */
+	dmadev_configure_t dev_configure;
+	/**< Start device. */
+	dmadev_start_t dev_start;
+	/**< Stop device. */
+	dmadev_stop_t dev_stop;
+	/**< Close device. */
+	dmadev_close_t dev_close;
+	/**< Reset device. */
+	dmadev_reset_t dev_reset;
+
+	/**< Enqueue a fill operation onto the DMA device */
+	dmadev_fill_t fill;
+	/**< Enqueue a copy operation onto the DMA device */
+	dmadev_copy_t copy;
+	/**< Add a fence to force ordering between operations */
+	dmadev_fence_t fence;
+	/**< Trigger hardware to begin performing enqueued operations */
+	dmadev_perform_t perform;
+	/**< Returns the number of operations that successful completed */
+	dmadev_completed_t completed;
+	/**< Returns the number of operations that failed to complete */
+	dmadev_completed_error_t completed_error;
+
+	/**< Get extended device statistics. */
+	dmadev_xstats_get_t xstats_get;
+	/**< Get names of extended stats. */
+	dmadev_xstats_get_names_t xstats_get_names;
+	/**< Reset the statistics values in xstats. */
+	dmadev_xstats_reset_t xstats_reset;
+
+	/* Dump internal information */
+	dmadev_dump_t dump;
+
+	/**< Device selftest function */
+	dmadev_selftest_t dev_selftest;
+};
+
+/**
+ * Allocates a new dmadev slot for an DMA device and returns the pointer
+ * to that slot for the driver to use.
+ *
+ * @param name
+ *   Unique identifier name for each device
+ * @param socket_id
+ *   Socket to allocate resources on.
+ *
+ * @return
+ *   - Slot in the rte_dev_devices array for a new device;
+ */
+struct rte_dmadev *
+rte_dmadev_pmd_allocate(const char *name, int socket_id);
+
+/**
+ * Release the specified dmadev device.
+ *
+ * @param dev
+ *   The *dmadev* pointer is the address of the *rte_dmadev* structure.
+ *
+ * @return
+ *   - 0 on success, negative on error
+ */
+int
+rte_dmadev_pmd_release(struct rte_dmadev *dev);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_DMADEV_PMD_H_ */