[v5,09/16] dma/idxd: add data-path job submission functions

Message ID 20210917152437.3270330-10-kevin.laatz@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series add dmadev driver for idxd devices |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Kevin Laatz Sept. 17, 2021, 3:24 p.m. UTC
  Add data path functions for enqueuing and submitting operations to DSA
devices.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Reviewed-by: Conor Walsh <conor.walsh@intel.com>
---
 doc/guides/dmadevs/idxd.rst      |  64 +++++++++++++++
 drivers/dma/idxd/idxd_common.c   | 136 +++++++++++++++++++++++++++++++
 drivers/dma/idxd/idxd_internal.h |   5 ++
 drivers/dma/idxd/meson.build     |   1 +
 4 files changed, 206 insertions(+)
  

Comments

Bruce Richardson Sept. 20, 2021, 10:30 a.m. UTC | #1
On Fri, Sep 17, 2021 at 03:24:30PM +0000, Kevin Laatz wrote:
> Add data path functions for enqueuing and submitting operations to DSA
> devices.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Reviewed-by: Conor Walsh <conor.walsh@intel.com>
> ---
>  doc/guides/dmadevs/idxd.rst      |  64 +++++++++++++++
>  drivers/dma/idxd/idxd_common.c   | 136 +++++++++++++++++++++++++++++++
>  drivers/dma/idxd/idxd_internal.h |   5 ++
>  drivers/dma/idxd/meson.build     |   1 +
>  4 files changed, 206 insertions(+)
> 
> diff --git a/doc/guides/dmadevs/idxd.rst b/doc/guides/dmadevs/idxd.rst
> index a603c5dd22..7835461a22 100644
> --- a/doc/guides/dmadevs/idxd.rst
> +++ b/doc/guides/dmadevs/idxd.rst
> @@ -153,3 +153,67 @@ The following code shows how the device is configured in
>  
>  Once configured, the device can then be made ready for use by calling the
>  ``rte_dma_start()`` API.
> +
> +Performing Data Copies
> +~~~~~~~~~~~~~~~~~~~~~~~
> +
> +To perform data copies using IDXD dmadev devices, descriptors should be enqueued
> +using the ``rte_dma_copy()`` API. The HW can be triggered to perform the copy
> +in two ways, either via a ``RTE_DMA_OP_FLAG_SUBMIT`` flag or by calling
> +``rte_dma_submit()``. Once copies have been completed, the completion will
> +be reported back when the application calls ``rte_dma_completed()`` or
> +``rte_dma_completed_status()``. The latter will also report the status of each
> +completed operation.
> +
> +The ``rte_dma_copy()`` function enqueues a single copy to the device ring for
> +copying at a later point. The parameters to that function include the IOVA addresses
> +of both the source and destination buffers, as well as the length of the copy.
> +
> +The ``rte_dma_copy()`` function enqueues a copy operation on the device ring.
> +If the ``RTE_DMA_OP_FLAG_SUBMIT`` flag is set when calling ``rte_dma_copy()``,
> +the device hardware will be informed of the elements. Alternatively, if the flag
> +is not set, the application needs to call the ``rte_dma_submit()`` function to
> +notify the device hardware. Once the device hardware is informed of the elements
> +enqueued on the ring, the device will begin to process them. It is expected
> +that, for efficiency reasons, a burst of operations will be enqueued to the
> +device via multiple enqueue calls between calls to the ``rte_dma_submit()``
> +function.
> +
> +The following code demonstrates how to enqueue a burst of copies to the
> +device and start the hardware processing of them:
> +
> +.. code-block:: C
> +
> +   struct rte_mbuf *srcs[COMP_BURST_SZ], *dsts[COMP_BURST_SZ];
> +   unsigned int i;
> +
> +   for (i = 0; i < RTE_DIM(srcs); i++) {
> +      uint64_t *src_data;
> +
> +      srcs[i] = rte_pktmbuf_alloc(pool);
> +      dsts[i] = rte_pktmbuf_alloc(pool);
> +      src_data = rte_pktmbuf_mtod(srcs[i], uint64_t *);
> +      if (srcs[i] == NULL || dsts[i] == NULL) {
> +         PRINT_ERR("Error allocating buffers\n");
> +         return -1;
> +      }
> +
> +      for (j = 0; j < COPY_LEN/sizeof(uint64_t); j++)
> +         src_data[j] = rte_rand();
> +
> +      if (rte_dma_copy(dev_id, vchan, srcs[i]->buf_iova + srcs[i]->data_off,
> +            dsts[i]->buf_iova + dsts[i]->data_off, COPY_LEN, 0) < 0) {
> +         PRINT_ERR("Error with rte_dma_copy for buffer %u\n", i);
> +         return -1;
> +      }
> +   }
> +   rte_dma_submit(dev_id, vchan);
> +

I think this code block is larger than necessary, because it shows buffer
allocation and initialization rather than just the basics of copy() and
submit() APIs. Furthermore, rather than calling out the generic API use in
the idxd-specific docs, can we just include a reference to the dmadev
documentation?

/Bruce
  
Chengwen Feng Sept. 22, 2021, 3:22 a.m. UTC | #2
On 2021/9/17 23:24, Kevin Laatz wrote:
> Add data path functions for enqueuing and submitting operations to DSA
> devices.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Reviewed-by: Conor Walsh <conor.walsh@intel.com>
> ---
>  doc/guides/dmadevs/idxd.rst      |  64 +++++++++++++++
>  drivers/dma/idxd/idxd_common.c   | 136 +++++++++++++++++++++++++++++++
>  drivers/dma/idxd/idxd_internal.h |   5 ++
>  drivers/dma/idxd/meson.build     |   1 +
>  4 files changed, 206 insertions(+)
> 

[snip]

> +
> +static __rte_always_inline int
> +__idxd_write_desc(struct rte_dma_dev *dev,
> +		const uint32_t op_flags,
> +		const rte_iova_t src,
> +		const rte_iova_t dst,
> +		const uint32_t size,
> +		const uint32_t flags)
> +{
> +	struct idxd_dmadev *idxd = dev->dev_private;
> +	uint16_t mask = idxd->desc_ring_mask;
> +	uint16_t job_id = idxd->batch_start + idxd->batch_size;
> +	/* we never wrap batches, so we only mask the start and allow start+size to overflow */
> +	uint16_t write_idx = (idxd->batch_start & mask) + idxd->batch_size;
> +
> +	/* first check batch ring space then desc ring space */
> +	if ((idxd->batch_idx_read == 0 && idxd->batch_idx_write == idxd->max_batches) ||
> +			idxd->batch_idx_write + 1 == idxd->batch_idx_read)
> +		return -1;
> +	if (((write_idx + 1) & mask) == (idxd->ids_returned & mask))
> +		return -1;

Please return -ENOSPC when the ring is full.

> +
> +	/* write desc. Note: descriptors don't wrap, but the completion address does */
> +	const uint64_t op_flags64 = (uint64_t)(op_flags | IDXD_FLAG_COMPLETION_ADDR_VALID) << 32;
> +	const uint64_t comp_addr = __desc_idx_to_iova(idxd, write_idx & mask);
> +	_mm256_store_si256((void *)&idxd->desc_ring[write_idx],
> +			_mm256_set_epi64x(dst, src, comp_addr, op_flags64));
> +	_mm256_store_si256((void *)&idxd->desc_ring[write_idx].size,
> +			_mm256_set_epi64x(0, 0, 0, size));
> +
> +	idxd->batch_size++;
> +
> +	rte_prefetch0_write(&idxd->desc_ring[write_idx + 1]);
> +
> +	if (flags & RTE_DMA_OP_FLAG_SUBMIT)
> +		__submit(idxd);
> +
> +	return job_id;
> +}
> +
> +int
> +idxd_enqueue_copy(struct rte_dma_dev *dev, uint16_t qid __rte_unused, rte_iova_t src,
> +		rte_iova_t dst, unsigned int length, uint64_t flags)
> +{
> +	/* we can take advantage of the fact that the fence flag in dmadev and DSA are the same,
> +	 * but check it at compile time to be sure.
> +	 */
> +	RTE_BUILD_BUG_ON(RTE_DMA_OP_FLAG_FENCE != IDXD_FLAG_FENCE);
> +	uint32_t memmove = (idxd_op_memmove << IDXD_CMD_OP_SHIFT) |
> +			IDXD_FLAG_CACHE_CONTROL | (flags & IDXD_FLAG_FENCE);
> +	return __idxd_write_desc(dev, memmove, src, dst, length, flags);
> +}
> +

[snip]
  

Patch

diff --git a/doc/guides/dmadevs/idxd.rst b/doc/guides/dmadevs/idxd.rst
index a603c5dd22..7835461a22 100644
--- a/doc/guides/dmadevs/idxd.rst
+++ b/doc/guides/dmadevs/idxd.rst
@@ -153,3 +153,67 @@  The following code shows how the device is configured in
 
 Once configured, the device can then be made ready for use by calling the
 ``rte_dma_start()`` API.
+
+Performing Data Copies
+~~~~~~~~~~~~~~~~~~~~~~~
+
+To perform data copies using IDXD dmadev devices, descriptors should be enqueued
+using the ``rte_dma_copy()`` API. The HW can be triggered to perform the copy
+in two ways, either via a ``RTE_DMA_OP_FLAG_SUBMIT`` flag or by calling
+``rte_dma_submit()``. Once copies have been completed, the completion will
+be reported back when the application calls ``rte_dma_completed()`` or
+``rte_dma_completed_status()``. The latter will also report the status of each
+completed operation.
+
+The ``rte_dma_copy()`` function enqueues a single copy to the device ring for
+copying at a later point. The parameters to that function include the IOVA addresses
+of both the source and destination buffers, as well as the length of the copy.
+
+The ``rte_dma_copy()`` function enqueues a copy operation on the device ring.
+If the ``RTE_DMA_OP_FLAG_SUBMIT`` flag is set when calling ``rte_dma_copy()``,
+the device hardware will be informed of the elements. Alternatively, if the flag
+is not set, the application needs to call the ``rte_dma_submit()`` function to
+notify the device hardware. Once the device hardware is informed of the elements
+enqueued on the ring, the device will begin to process them. It is expected
+that, for efficiency reasons, a burst of operations will be enqueued to the
+device via multiple enqueue calls between calls to the ``rte_dma_submit()``
+function.
+
+The following code demonstrates how to enqueue a burst of copies to the
+device and start the hardware processing of them:
+
+.. code-block:: C
+
+   struct rte_mbuf *srcs[COMP_BURST_SZ], *dsts[COMP_BURST_SZ];
+   unsigned int i;
+
+   for (i = 0; i < RTE_DIM(srcs); i++) {
+      uint64_t *src_data;
+
+      srcs[i] = rte_pktmbuf_alloc(pool);
+      dsts[i] = rte_pktmbuf_alloc(pool);
+      src_data = rte_pktmbuf_mtod(srcs[i], uint64_t *);
+      if (srcs[i] == NULL || dsts[i] == NULL) {
+         PRINT_ERR("Error allocating buffers\n");
+         return -1;
+      }
+
+      for (j = 0; j < COPY_LEN/sizeof(uint64_t); j++)
+         src_data[j] = rte_rand();
+
+      if (rte_dma_copy(dev_id, vchan, srcs[i]->buf_iova + srcs[i]->data_off,
+            dsts[i]->buf_iova + dsts[i]->data_off, COPY_LEN, 0) < 0) {
+         PRINT_ERR("Error with rte_dma_copy for buffer %u\n", i);
+         return -1;
+      }
+   }
+   rte_dma_submit(dev_id, vchan);
+
+Filling an Area of Memory
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The IDXD driver also has support for the ``fill`` operation, where an area
+of memory is overwritten, or filled, with a short pattern of data.
+Fill operations can be performed in much the same was as copy operations
+described above, just using the ``rte_dma_fill()`` function rather than the
+``rte_dma_copy()`` function.
diff --git a/drivers/dma/idxd/idxd_common.c b/drivers/dma/idxd/idxd_common.c
index 2c222708cf..b01edeab07 100644
--- a/drivers/dma/idxd/idxd_common.c
+++ b/drivers/dma/idxd/idxd_common.c
@@ -2,14 +2,144 @@ 
  * Copyright 2021 Intel Corporation
  */
 
+#include <x86intrin.h>
+
 #include <rte_dmadev_pmd.h>
 #include <rte_malloc.h>
 #include <rte_common.h>
+#include <rte_prefetch.h>
 
 #include "idxd_internal.h"
 
 #define IDXD_PMD_NAME_STR "dmadev_idxd"
 
+static __rte_always_inline rte_iova_t
+__desc_idx_to_iova(struct idxd_dmadev *idxd, uint16_t n)
+{
+	return idxd->desc_iova + (n * sizeof(struct idxd_hw_desc));
+}
+
+static __rte_always_inline void
+__idxd_movdir64b(volatile void *dst, const struct idxd_hw_desc *src)
+{
+	asm volatile (".byte 0x66, 0x0f, 0x38, 0xf8, 0x02"
+			:
+			: "a" (dst), "d" (src)
+			: "memory");
+}
+
+static __rte_always_inline void
+__submit(struct idxd_dmadev *idxd)
+{
+	rte_prefetch1(&idxd->batch_comp_ring[idxd->batch_idx_read]);
+
+	if (idxd->batch_size == 0)
+		return;
+
+	/* write completion to batch comp ring */
+	rte_iova_t comp_addr = idxd->batch_iova +
+			(idxd->batch_idx_write * sizeof(struct idxd_completion));
+
+	if (idxd->batch_size == 1) {
+		/* submit batch directly */
+		struct idxd_hw_desc desc =
+				idxd->desc_ring[idxd->batch_start & idxd->desc_ring_mask];
+		desc.completion = comp_addr;
+		desc.op_flags |= IDXD_FLAG_REQUEST_COMPLETION;
+		_mm_sfence(); /* fence before writing desc to device */
+		__idxd_movdir64b(idxd->portal, &desc);
+	} else {
+		const struct idxd_hw_desc batch_desc = {
+				.op_flags = (idxd_op_batch << IDXD_CMD_OP_SHIFT) |
+				IDXD_FLAG_COMPLETION_ADDR_VALID |
+				IDXD_FLAG_REQUEST_COMPLETION,
+				.desc_addr = __desc_idx_to_iova(idxd,
+						idxd->batch_start & idxd->desc_ring_mask),
+				.completion = comp_addr,
+				.size = idxd->batch_size,
+		};
+		_mm_sfence(); /* fence before writing desc to device */
+		__idxd_movdir64b(idxd->portal, &batch_desc);
+	}
+
+	if (++idxd->batch_idx_write > idxd->max_batches)
+		idxd->batch_idx_write = 0;
+
+	idxd->batch_start += idxd->batch_size;
+	idxd->batch_size = 0;
+	idxd->batch_idx_ring[idxd->batch_idx_write] = idxd->batch_start;
+	_mm256_store_si256((void *)&idxd->batch_comp_ring[idxd->batch_idx_write],
+			_mm256_setzero_si256());
+}
+
+static __rte_always_inline int
+__idxd_write_desc(struct rte_dma_dev *dev,
+		const uint32_t op_flags,
+		const rte_iova_t src,
+		const rte_iova_t dst,
+		const uint32_t size,
+		const uint32_t flags)
+{
+	struct idxd_dmadev *idxd = dev->dev_private;
+	uint16_t mask = idxd->desc_ring_mask;
+	uint16_t job_id = idxd->batch_start + idxd->batch_size;
+	/* we never wrap batches, so we only mask the start and allow start+size to overflow */
+	uint16_t write_idx = (idxd->batch_start & mask) + idxd->batch_size;
+
+	/* first check batch ring space then desc ring space */
+	if ((idxd->batch_idx_read == 0 && idxd->batch_idx_write == idxd->max_batches) ||
+			idxd->batch_idx_write + 1 == idxd->batch_idx_read)
+		return -1;
+	if (((write_idx + 1) & mask) == (idxd->ids_returned & mask))
+		return -1;
+
+	/* write desc. Note: descriptors don't wrap, but the completion address does */
+	const uint64_t op_flags64 = (uint64_t)(op_flags | IDXD_FLAG_COMPLETION_ADDR_VALID) << 32;
+	const uint64_t comp_addr = __desc_idx_to_iova(idxd, write_idx & mask);
+	_mm256_store_si256((void *)&idxd->desc_ring[write_idx],
+			_mm256_set_epi64x(dst, src, comp_addr, op_flags64));
+	_mm256_store_si256((void *)&idxd->desc_ring[write_idx].size,
+			_mm256_set_epi64x(0, 0, 0, size));
+
+	idxd->batch_size++;
+
+	rte_prefetch0_write(&idxd->desc_ring[write_idx + 1]);
+
+	if (flags & RTE_DMA_OP_FLAG_SUBMIT)
+		__submit(idxd);
+
+	return job_id;
+}
+
+int
+idxd_enqueue_copy(struct rte_dma_dev *dev, uint16_t qid __rte_unused, rte_iova_t src,
+		rte_iova_t dst, unsigned int length, uint64_t flags)
+{
+	/* we can take advantage of the fact that the fence flag in dmadev and DSA are the same,
+	 * but check it at compile time to be sure.
+	 */
+	RTE_BUILD_BUG_ON(RTE_DMA_OP_FLAG_FENCE != IDXD_FLAG_FENCE);
+	uint32_t memmove = (idxd_op_memmove << IDXD_CMD_OP_SHIFT) |
+			IDXD_FLAG_CACHE_CONTROL | (flags & IDXD_FLAG_FENCE);
+	return __idxd_write_desc(dev, memmove, src, dst, length, flags);
+}
+
+int
+idxd_enqueue_fill(struct rte_dma_dev *dev, uint16_t qid __rte_unused, uint64_t pattern,
+		rte_iova_t dst, unsigned int length, uint64_t flags)
+{
+	uint32_t fill = (idxd_op_fill << IDXD_CMD_OP_SHIFT) |
+			IDXD_FLAG_CACHE_CONTROL | (flags & IDXD_FLAG_FENCE);
+	return __idxd_write_desc(dev, fill, pattern, dst, length, flags);
+}
+
+int
+idxd_submit(struct rte_dma_dev *dev, uint16_t qid __rte_unused)
+{
+	__submit(dev->dev_private);
+	return 0;
+}
+
 int
 idxd_dump(const struct rte_dma_dev *dev, FILE *f)
 {
@@ -141,6 +271,12 @@  idxd_dmadev_create(const char *name, struct rte_device *dev,
 	dmadev->dev_ops = ops;
 	dmadev->device = dev;
 
+	dmadev->copy = idxd_enqueue_copy;
+	dmadev->fill = idxd_enqueue_fill;
+	dmadev->submit = idxd_submit;
+	dmadev->completed = idxd_completed;
+	dmadev->completed_status = idxd_completed_status;
+
 	idxd = rte_malloc_socket(NULL, sizeof(struct idxd_dmadev), 0, dev->numa_node);
 	if (idxd == NULL) {
 		IDXD_PMD_ERR("Unable to allocate memory for device");
diff --git a/drivers/dma/idxd/idxd_internal.h b/drivers/dma/idxd/idxd_internal.h
index fdd018ca35..b66c2d0182 100644
--- a/drivers/dma/idxd/idxd_internal.h
+++ b/drivers/dma/idxd/idxd_internal.h
@@ -88,5 +88,10 @@  int idxd_vchan_setup(struct rte_dma_dev *dev, uint16_t vchan,
 		const struct rte_dma_vchan_conf *qconf, uint32_t qconf_sz);
 int idxd_info_get(const struct rte_dma_dev *dev, struct rte_dma_info *dev_info,
 		uint32_t size);
+int idxd_enqueue_copy(struct rte_dma_dev *dev, uint16_t qid, rte_iova_t src,
+		rte_iova_t dst, unsigned int length, uint64_t flags);
+int idxd_enqueue_fill(struct rte_dma_dev *dev, uint16_t qid, uint64_t pattern,
+		rte_iova_t dst, unsigned int length, uint64_t flags);
+int idxd_submit(struct rte_dma_dev *dev, uint16_t qid);
 
 #endif /* _IDXD_INTERNAL_H_ */
diff --git a/drivers/dma/idxd/meson.build b/drivers/dma/idxd/meson.build
index 36dbd3e518..acb1b10618 100644
--- a/drivers/dma/idxd/meson.build
+++ b/drivers/dma/idxd/meson.build
@@ -6,6 +6,7 @@  if is_windows
 endif
 
 deps += ['bus_pci']
+cflags += '-mavx2' # all platforms with idxd HW support AVX
 sources = files(
         'idxd_bus.c',
         'idxd_common.c',