[v5,1/6] ethdev: introduce Rx buffer split

Message ID 1602616917-22193-2-git-send-email-viacheslavo@nvidia.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series ethdev: introduce Rx buffer split |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Slava Ovsiienko Oct. 13, 2020, 7:21 p.m. UTC
  The DPDK datapath in the transmit direction is very flexible.
An application can build the multi-segment packet and manages
almost all data aspects - the memory pools where segments
are allocated from, the segment lengths, the memory attributes
like external buffers, registered for DMA, etc.

In the receiving direction, the datapath is much less flexible,
an application can only specify the memory pool to configure the
receiving queue and nothing more. In order to extend receiving
datapath capabilities it is proposed to add the way to provide
extended information how to split the packets being received.

The following structure is introduced to specify the Rx packet
segment:

struct rte_eth_rxseg {
    struct rte_mempool *mp; /* memory pools to allocate segment from */
    uint16_t length; /* segment maximal data length,
		       	configures "split point" */
    uint16_t offset; /* data offset from beginning
		       	of mbuf data buffer */
    uint32_t reserved; /* reserved field */
};

The segment descriptions are added to the rte_eth_rxconf structure:
   rx_seg - pointer the array of segment descriptions, each element
             describes the memory pool, maximal data length, initial
             data offset from the beginning of data buffer in mbuf.
	     This array allows to specify the different settings for
	     each segment in individual fashion.
   n_seg - number of elements in the array

If the extended segment descriptions is provided with these new
fields the mp parameter of the rte_eth_rx_queue_setup must be
specified as NULL to avoid ambiguity.

The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
capabilities is introduced to present the way for PMD to report to
application about supporting Rx packet split to configurable
segments. Prior invoking the rte_eth_rx_queue_setup() routine
application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.

If the Rx queue is configured with new routine the packets being
received will be split into multiple segments pushed to the mbufs
with specified attributes. The PMD will split the received packets
into multiple segments according to the specification in the
description array:

- the first network buffer will be allocated from the memory pool,
  specified in the first segment description element, the second
  network buffer - from the pool in the second segment description
  element and so on. If there is no enough elements to describe
  the buffer for entire packet of maximal length the pool from the
  last valid element will be used to allocate the buffers from for the
  rest of segments

- the offsets from the segment description elements will provide
  the data offset from the buffer beginning except the first mbuf -
  for this one the offset is added to the RTE_PKTMBUF_HEADROOM to get
  actual offset from the buffer beginning. If there is no enough
  elements to describe the buffer for entire packet of maximal length
  the offsets for the rest of segment will be supposed to be zero.

- the data length being received to each segment is limited  by the
  length specified in the segment description element. The data
  receiving starts with filling up the first mbuf data buffer, if the
  specified maximal segment length is reached and there are data
  remaining (packet is longer than buffer in the first mbuf) the
  following data will be pushed to the next segment up to its own
  maximal length. If the first two segments is not enough to store
  all the packet remaining data  the next (third) segment will
  be engaged and so on. If the length in the segment description
  element is zero the actual buffer size will be deduced from
  the appropriate memory pool properties. If there is no enough
  elements to describe the buffer for entire packet of maximal
  length the buffer size will be deduced from the pool of the last
  valid element for the remaining segments.

For example, let's suppose we configured the Rx queue with the
following segments:
    seg0 - pool0, len0=14B, off0=2
    seg1 - pool1, len1=20B, off1=128B
    seg2 - pool2, len2=20B, off2=0B
    seg3 - pool3, len3=512B, off3=0B

The packet 46 bytes long will look like the following:
    seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
    seg1 - 20B long @ 128 in mbuf from pool1
    seg2 - 12B long @ 0 in mbuf from pool2

The packet 1500 bytes long will look like the following:
    seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
    seg1 - 20B @ 128 in mbuf from pool1
    seg2 - 20B @ 0 in mbuf from pool2
    seg3 - 512B @ 0 in mbuf from pool3
    seg4 - 512B @ 0 in mbuf from pool3
    seg5 - 422B @ 0 in mbuf from pool3

The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
configured to support new buffer split feature (if n_seg
is greater than one).

The new approach would allow splitting the ingress packets into
multiple parts pushed to the memory with different attributes.
For example, the packet headers can be pushed to the embedded
data buffers within mbufs and the application data into
the external buffers attached to mbufs allocated from the
different memory pools. The memory attributes for the split
parts may differ either - for example the application data
may be pushed into the external memory located on the dedicated
physical device, say GPU or NVMe. This would improve the DPDK
receiving datapath flexibility with preserving compatibility
with existing API.

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/nics/features.rst             | 15 +++++
 doc/guides/rel_notes/release_20_11.rst   |  9 +++
 lib/librte_ethdev/rte_ethdev.c           | 95 ++++++++++++++++++++++++--------
 lib/librte_ethdev/rte_ethdev.h           | 58 ++++++++++++++++++-
 lib/librte_ethdev/rte_ethdev_version.map |  1 +
 5 files changed, 155 insertions(+), 23 deletions(-)
  

Comments

Ferruh Yigit Oct. 13, 2020, 10:34 p.m. UTC | #1
On 10/13/2020 8:21 PM, Viacheslav Ovsiienko wrote:
> The DPDK datapath in the transmit direction is very flexible.
> An application can build the multi-segment packet and manages
> almost all data aspects - the memory pools where segments
> are allocated from, the segment lengths, the memory attributes
> like external buffers, registered for DMA, etc.
> 
> In the receiving direction, the datapath is much less flexible,
> an application can only specify the memory pool to configure the
> receiving queue and nothing more. In order to extend receiving
> datapath capabilities it is proposed to add the way to provide
> extended information how to split the packets being received.
> 
> The following structure is introduced to specify the Rx packet
> segment:
> 
> struct rte_eth_rxseg {
>      struct rte_mempool *mp; /* memory pools to allocate segment from */
>      uint16_t length; /* segment maximal data length,
> 		       	configures "split point" */
>      uint16_t offset; /* data offset from beginning
> 		       	of mbuf data buffer */
>      uint32_t reserved; /* reserved field */
> };
> 
> The segment descriptions are added to the rte_eth_rxconf structure:
>     rx_seg - pointer the array of segment descriptions, each element
>               describes the memory pool, maximal data length, initial
>               data offset from the beginning of data buffer in mbuf.
> 	     This array allows to specify the different settings for
> 	     each segment in individual fashion.
>     n_seg - number of elements in the array
> 
> If the extended segment descriptions is provided with these new
> fields the mp parameter of the rte_eth_rx_queue_setup must be
> specified as NULL to avoid ambiguity.
> 
> The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
> capabilities is introduced to present the way for PMD to report to
> application about supporting Rx packet split to configurable
> segments. Prior invoking the rte_eth_rx_queue_setup() routine
> application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.
> 
> If the Rx queue is configured with new routine the packets being
> received will be split into multiple segments pushed to the mbufs
> with specified attributes. The PMD will split the received packets
> into multiple segments according to the specification in the
> description array:
> 
> - the first network buffer will be allocated from the memory pool,
>    specified in the first segment description element, the second
>    network buffer - from the pool in the second segment description
>    element and so on. If there is no enough elements to describe
>    the buffer for entire packet of maximal length the pool from the
>    last valid element will be used to allocate the buffers from for the
>    rest of segments
> 
> - the offsets from the segment description elements will provide
>    the data offset from the buffer beginning except the first mbuf -
>    for this one the offset is added to the RTE_PKTMBUF_HEADROOM to get
>    actual offset from the buffer beginning. If there is no enough
>    elements to describe the buffer for entire packet of maximal length
>    the offsets for the rest of segment will be supposed to be zero.
> 
> - the data length being received to each segment is limited  by the
>    length specified in the segment description element. The data
>    receiving starts with filling up the first mbuf data buffer, if the
>    specified maximal segment length is reached and there are data
>    remaining (packet is longer than buffer in the first mbuf) the
>    following data will be pushed to the next segment up to its own
>    maximal length. If the first two segments is not enough to store
>    all the packet remaining data  the next (third) segment will
>    be engaged and so on. If the length in the segment description
>    element is zero the actual buffer size will be deduced from
>    the appropriate memory pool properties. If there is no enough
>    elements to describe the buffer for entire packet of maximal
>    length the buffer size will be deduced from the pool of the last
>    valid element for the remaining segments.
> 
> For example, let's suppose we configured the Rx queue with the
> following segments:
>      seg0 - pool0, len0=14B, off0=2
>      seg1 - pool1, len1=20B, off1=128B
>      seg2 - pool2, len2=20B, off2=0B
>      seg3 - pool3, len3=512B, off3=0B
> 
> The packet 46 bytes long will look like the following:
>      seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>      seg1 - 20B long @ 128 in mbuf from pool1
>      seg2 - 12B long @ 0 in mbuf from pool2
> 
> The packet 1500 bytes long will look like the following:
>      seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>      seg1 - 20B @ 128 in mbuf from pool1
>      seg2 - 20B @ 0 in mbuf from pool2
>      seg3 - 512B @ 0 in mbuf from pool3
>      seg4 - 512B @ 0 in mbuf from pool3
>      seg5 - 422B @ 0 in mbuf from pool3
> 
> The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
> configured to support new buffer split feature (if n_seg
> is greater than one).
> 
> The new approach would allow splitting the ingress packets into
> multiple parts pushed to the memory with different attributes.
> For example, the packet headers can be pushed to the embedded
> data buffers within mbufs and the application data into
> the external buffers attached to mbufs allocated from the
> different memory pools. The memory attributes for the split
> parts may differ either - for example the application data
> may be pushed into the external memory located on the dedicated
> physical device, say GPU or NVMe. This would improve the DPDK
> receiving datapath flexibility with preserving compatibility
> with existing API.
> 
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> ---
>   doc/guides/nics/features.rst             | 15 +++++
>   doc/guides/rel_notes/release_20_11.rst   |  9 +++
>   lib/librte_ethdev/rte_ethdev.c           | 95 ++++++++++++++++++++++++--------
>   lib/librte_ethdev/rte_ethdev.h           | 58 ++++++++++++++++++-
>   lib/librte_ethdev/rte_ethdev_version.map |  1 +

Can you please update deprecation notice too, to remove the notice?

>   5 files changed, 155 insertions(+), 23 deletions(-)
> 
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index dd8c955..a45a9e8 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -185,6 +185,21 @@ Supports receiving segmented mbufs.
>   * **[related]    eth_dev_ops**: ``rx_pkt_burst``.
>   
>   
> +.. _nic_features_buffer_split:
> +
> +Buffer Split on Rx
> +------------------
> +
> +Scatters the packets being received on specified boundaries to segmented mbufs.
> +
> +* **[uses]       rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.

uses 'rx_seg and rx_nseg' from 'rte_eth_rxconf', perhaps it can be another line.

> +* **[implements] datapath**: ``Buffer Split functionality``.
> +* **[implements] rte_eth_dev_data**: ``buffer_split``.

What is implemented here?

> +* **[provides]   rte_eth_dev_info**: ``rx_offload_capa:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
> +* **[provides]   eth_dev_ops**: ``rxq_info_get:buffer_split``.

Is this correct?

> +* **[related] API**: ``rte_eth_rx_queue_setup()``.
> +
> +
>   .. _nic_features_lro:
>   
>   LRO
> diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
> index bcc0fc2..f67ec62 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -60,6 +60,12 @@ New Features
>     Added the FEC API which provides functions for query FEC capabilities and
>     current FEC mode from device. Also, API for configuring FEC mode is also provided.
>   
> +* **Introduced extended buffer description for receiving.**
> +
> +  Added the extended Rx queue setup routine providing the individual

This looks wrong with last version.

> +  descriptions for each Rx segment with maximal size, buffer offset and memory
> +  pool to allocate data buffers from.
> +
>   * **Updated Broadcom bnxt driver.**
>   
>     Updated the Broadcom bnxt driver with new features and improvements, including:
> @@ -253,6 +259,9 @@ API Changes
>     As the data of ``uint8_t`` will be truncated when queue number under
>     a TC is greater than 256.
>   
> +* ethdev: Added fields rx_seg and rx_nseg to rte_eth_rxconf structure
> +  to provide extended description of the receiving buffer.
> +
>   * vhost: Moved vDPA APIs from experimental to stable.
>   
>   * rawdev: Added a structure size parameter to the functions
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index 892c246..7b64a6e 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -105,6 +105,9 @@ struct rte_eth_xstats_name_off {
>   #define RTE_RX_OFFLOAD_BIT2STR(_name)	\
>   	{ DEV_RX_OFFLOAD_##_name, #_name }
>   
> +#define RTE_ETH_RX_OFFLOAD_BIT2STR(_name)	\
> +	{ RTE_ETH_RX_OFFLOAD_##_name, #_name }
> +
>   static const struct {
>   	uint64_t offload;
>   	const char *name;
> @@ -128,9 +131,11 @@ struct rte_eth_xstats_name_off {
>   	RTE_RX_OFFLOAD_BIT2STR(SCTP_CKSUM),
>   	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>   	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> +	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
>   };
>   
>   #undef RTE_RX_OFFLOAD_BIT2STR
> +#undef RTE_ETH_RX_OFFLOAD_BIT2STR
>   
>   #define RTE_TX_OFFLOAD_BIT2STR(_name)	\
>   	{ DEV_TX_OFFLOAD_##_name, #_name }
> @@ -1770,10 +1775,14 @@ struct rte_eth_dev *
>   		       struct rte_mempool *mp)
>   {
>   	int ret;
> +	uint16_t seg_idx;
>   	uint32_t mbp_buf_size;
>   	struct rte_eth_dev *dev;
>   	struct rte_eth_dev_info dev_info;
>   	struct rte_eth_rxconf local_conf;
> +	const struct rte_eth_rxseg *rx_seg;
> +	struct rte_eth_rxseg seg_single = { .mp = mp};
> +	uint16_t n_seg;
>   	void **rxq;
>   
>   	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> @@ -1784,13 +1793,32 @@ struct rte_eth_dev *
>   		return -EINVAL;
>   	}
>   
> -	if (mp == NULL) {
> -		RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
> -		return -EINVAL;
> -	}
> -
>   	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_setup, -ENOTSUP);
>   
> +	rx_seg = rx_conf->rx_seg;
> +	n_seg = rx_conf->rx_nseg;
> +	if (rx_seg == NULL) {
> +		/* Exclude ambiguities about segment descrtiptions. */
> +		if (n_seg) {
> +			RTE_ETHDEV_LOG(ERR,
> +				       "Non empty array with null pointer\n");
> +			return -EINVAL;
> +		}
> +		rx_seg = &seg_single;
> +		n_seg = 1;

Why setting 'rx_seg' & 'n_seg'? Why not leaving them NULL and 0 when not used?
This was PMD can do NULL/0 check and can know they are not used.


I think better to do a "if (mp == NULL)" check here, both 'rx_seg' & 'mp' should 
not be NULL.

> +	} else {
> +		if (n_seg == 0) {
> +			RTE_ETHDEV_LOG(ERR,
> +				       "Invalid zero descriptions number\n");
> +			return -EINVAL;
> +		}
> +		if (mp) {
> +			RTE_ETHDEV_LOG(ERR,
> +				       "Memory pool duplicated definition\n");
> +			return -EINVAL;
> +		}
> +	}
> +
>   	/*
>   	 * Check the size of the mbuf data buffer.
>   	 * This value must be provided in the private data of the memory pool.
> @@ -1800,23 +1828,48 @@ struct rte_eth_dev *
>   	if (ret != 0)
>   		return ret;
>   
> -	if (mp->private_data_size < sizeof(struct rte_pktmbuf_pool_private)) {
> -		RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
> -			mp->name, (int)mp->private_data_size,
> -			(int)sizeof(struct rte_pktmbuf_pool_private));
> -		return -ENOSPC;
> -	}
> -	mbp_buf_size = rte_pktmbuf_data_room_size(mp);
> +	for (seg_idx = 0; seg_idx < n_seg; seg_idx++) {
> +		struct rte_mempool *mpl = rx_seg[seg_idx].mp;
> +		uint32_t length = rx_seg[seg_idx].length;
> +		uint32_t offset = rx_seg[seg_idx].offset;
> +		uint32_t head_room = seg_idx ? 0 : RTE_PKTMBUF_HEADROOM;
>   
> -	if (mbp_buf_size < dev_info.min_rx_bufsize + RTE_PKTMBUF_HEADROOM) {
> -		RTE_ETHDEV_LOG(ERR,
> -			"%s mbuf_data_room_size %d < %d (RTE_PKTMBUF_HEADROOM=%d + min_rx_bufsize(dev)=%d)\n",
> -			mp->name, (int)mbp_buf_size,
> -			(int)(RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize),
> -			(int)RTE_PKTMBUF_HEADROOM,
> -			(int)dev_info.min_rx_bufsize);
> -		return -EINVAL;
> +		if (mpl == NULL) {
> +			RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
> +			return -EINVAL;
> +		}
> +
> +		if (mpl->private_data_size <
> +				sizeof(struct rte_pktmbuf_pool_private)) {
> +			RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
> +				mpl->name, (int)mpl->private_data_size,
> +				(int)sizeof(struct rte_pktmbuf_pool_private));
> +			return -ENOSPC;
> +		}
> +
> +		mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
> +		length = length ? length : (mbp_buf_size - head_room);
> +		if (mbp_buf_size < length + offset + head_room) {
> +			RTE_ETHDEV_LOG(ERR,
> +				"%s mbuf_data_room_size %u < %u"
> +				" (segment length=%u + segment offset=%u)\n",
> +				mpl->name, mbp_buf_size,
> +				length + offset, length, offset);
> +			return -EINVAL;
> +		}
>   	}
> +	/* Check the minimal buffer size for the single segment only. */

This is the main branch, what do you think moving the comment to the beggining 
of above for loop and add a comment about testing the multiple segment.

Btw, I have a concern that this single/multi segment can cause a confusion with 
multi segment packets. Can something else, like "split package" can be used 
instead of segment?

> +	if (mp && (mbp_buf_size < dev_info.min_rx_bufsize +
> +				  RTE_PKTMBUF_HEADROOM)) {
> +		RTE_ETHDEV_LOG(ERR,
> +			       "%s mbuf_data_room_size %u < %u "
> +			       "(RTE_PKTMBUF_HEADROOM=%u + "
> +			       "min_rx_bufsize(dev)=%u)\n",
> +			       mp->name, mbp_buf_size,
> +			       RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize,
> +			       RTE_PKTMBUF_HEADROOM, dev_info.min_rx_bufsize);
> +			return -EINVAL;
> +		}
>   
>   	/* Use default specified by driver, if nb_rx_desc is zero */
>   	if (nb_rx_desc == 0) {
> @@ -1914,8 +1967,6 @@ struct rte_eth_dev *
>   			dev->data->min_rx_buf_size = mbp_buf_size;
>   	}
>   
> -	rte_ethdev_trace_rxq_setup(port_id, rx_queue_id, nb_rx_desc, mp,
> -		rx_conf, ret);

Is this removed intentionally?

>   	return eth_err(port_id, ret);
>   }
>   
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 5bcfbb8..9cf0a03 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -970,6 +970,16 @@ struct rte_eth_txmode {
>   };
>   
>   /**
> + * A structure used to configure an RX packet segment to split.
> + */
> +struct rte_eth_rxseg {
> +	struct rte_mempool *mp; /**< Memory pools to allocate segment from */
> +	uint16_t length; /**< Segment data length, configures split point. */
> +	uint16_t offset; /**< Data offset from beginning of mbuf data buffer */
> +	uint32_t reserved; /**< Reserved field */
> +};
> +
> +/**
>    * A structure used to configure an RX ring of an Ethernet port.
>    */
>   struct rte_eth_rxconf {
> @@ -977,13 +987,23 @@ struct rte_eth_rxconf {
>   	uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
>   	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>   	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> +	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +	/**
> +	 * The pointer to the array of segment descriptions, each element
> +	 * describes the memory pool, maximal segment data length, initial
> +	 * data offset from the beginning of data buffer in mbuf. This allow
> +	 * to specify the dedicated properties for each segment in the receiving
> +	 * buffer - pool, buffer offset, maximal segment size. The number of
> +	 * segment descriptions in the array is specified by the rx_nseg
> +	 * field.
> +	 */

What do you think providing a short description here, and move above comment to 
abice "struct rte_eth_rxseg" struct?

> +	struct rte_eth_rxseg *rx_seg;
>   	/**
>   	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>   	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
>   	 * fields on rte_eth_dev_info structure are allowed to be set.
>   	 */
>   	uint64_t offloads;
> -

unrelated

>   	uint64_t reserved_64s[2]; /**< Reserved for future fields */
>   	void *reserved_ptrs[2];   /**< Reserved for future fields */
>   };
> @@ -1260,6 +1280,7 @@ struct rte_eth_conf {
>   #define DEV_RX_OFFLOAD_SCTP_CKSUM	0x00020000
>   #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>   #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
> +#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
>   
>   #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>   				 DEV_RX_OFFLOAD_UDP_CKSUM | \
> @@ -2027,6 +2048,41 @@ int rte_eth_dev_configure(uint16_t port_id, uint16_t nb_rx_queue,
>    *   No need to repeat any bit in rx_conf->offloads which has already been
>    *   enabled in rte_eth_dev_configure() at port level. An offloading enabled
>    *   at port level can't be disabled at queue level.

Can it be possible to put a kind of marker here, like "@rx_seg & @rx_nseg", to 
clarify what are you talking about.

> + *   The configuration structure also contains the pointer to the array
> + *   of the receiving buffer segment descriptions, each element describes
> + *   the memory pool, maximal segment data length, initial data offset from
> + *   the beginning of data buffer in mbuf. This allow to specify the dedicated
> + *   properties for each segment in the receiving buffer - pool, buffer
> + *   offset, maximal segment size. If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT offload
> + *   flag is configured the PMD will split the received packets into multiple
> + *   segments according to the specification in the description array:
> + *   - the first network buffer will be allocated from the memory pool,
> + *     specified in the first segment description element, the second
> + *     network buffer - from the pool in the second segment description
> + *     element and so on. If there is no enough elements to describe
> + *     the buffer for entire packet of maximal length the pool from the last
> + *     valid element will be used to allocate the buffers from for the rest
> + *     of segments.
> + *   - the offsets from the segment description elements will provide the
> + *     data offset from the buffer beginning except the first mbuf - for this
> + *     one the offset is added to the RTE_PKTMBUF_HEADROOM to get actual
> + *     offset from the buffer beginning. If there is no enough elements
> + *     to describe the buffer for entire packet of maximal length the offsets
> + *     for the rest of segment will be supposed to be zero.
> + *   - the data length being received to each segment is limited by the
> + *     length specified in the segment description element. The data receiving
> + *     starts with filling up the first mbuf data buffer, if the specified
> + *     maximal segment length is reached and there are data remaining
> + *     (packet is longer than buffer in the first mbuf) the following data
> + *     will be pushed to the next segment up to its own length. If the first
> + *     two segments is not enough to store all the packet data the next
> + *     (third) segment will be engaged and so on. If the length in the segment
> + *     description element is zero the actual buffer size will be deduced
> + *     from the appropriate memory pool properties. If there is no enough
> + *     elements to describe the buffer for entire packet of maximal length
> + *     the buffer size will be deduced from the pool of the last valid
> + *     element for the all remaining segments.
> + *

I think as a first thing the comment should clarify that if @rx_seg provided 
'mb_pool' should be NULL, and if split Rx feature is not used "@rx_seg & 
@rx_nseg" should be NULL and 0.

Also above is too wordy, it is hard to follow. Like "@rx_seg & @rx_nseg" are 
only taken into account if application provides 
'RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT' offload should be clearer to see, etc.
Can you try to simplify it, perhpas moving some of above comments to the "struct 
rte_eth_rxseg" can work?

>    * @param mb_pool
>    *   The pointer to the memory pool from which to allocate *rte_mbuf* network
>    *   memory buffers to populate each descriptor of the receive ring.
> diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
> index f8a0945..25f7cee 100644
> --- a/lib/librte_ethdev/rte_ethdev_version.map
> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> @@ -232,6 +232,7 @@ EXPERIMENTAL {
>   	rte_eth_fec_get_capability;
>   	rte_eth_fec_get;
>   	rte_eth_fec_set;
> +

unrelated
  
Olivier Matz Oct. 14, 2020, 1:31 p.m. UTC | #2
Hi Slava,

After reviewing this version and comparing with v4, I think having the
configuration in rxconf (like in this patch) is a better choice.

Few comments below.

On Tue, Oct 13, 2020 at 11:34:16PM +0100, Ferruh Yigit wrote:
> On 10/13/2020 8:21 PM, Viacheslav Ovsiienko wrote:
> > The DPDK datapath in the transmit direction is very flexible.
> > An application can build the multi-segment packet and manages
> > almost all data aspects - the memory pools where segments
> > are allocated from, the segment lengths, the memory attributes
> > like external buffers, registered for DMA, etc.
> > 
> > In the receiving direction, the datapath is much less flexible,
> > an application can only specify the memory pool to configure the
> > receiving queue and nothing more. In order to extend receiving
> > datapath capabilities it is proposed to add the way to provide
> > extended information how to split the packets being received.
> > 
> > The following structure is introduced to specify the Rx packet
> > segment:
> > 
> > struct rte_eth_rxseg {
> >      struct rte_mempool *mp; /* memory pools to allocate segment from */
> >      uint16_t length; /* segment maximal data length,
> > 		       	configures "split point" */
> >      uint16_t offset; /* data offset from beginning
> > 		       	of mbuf data buffer */
> >      uint32_t reserved; /* reserved field */
> > };
> > 
> > The segment descriptions are added to the rte_eth_rxconf structure:
> >     rx_seg - pointer the array of segment descriptions, each element
> >               describes the memory pool, maximal data length, initial
> >               data offset from the beginning of data buffer in mbuf.
> > 	     This array allows to specify the different settings for
> > 	     each segment in individual fashion.
> >     n_seg - number of elements in the array
> > 
> > If the extended segment descriptions is provided with these new
> > fields the mp parameter of the rte_eth_rx_queue_setup must be
> > specified as NULL to avoid ambiguity.
> > 
> > The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
> > capabilities is introduced to present the way for PMD to report to
> > application about supporting Rx packet split to configurable
> > segments. Prior invoking the rte_eth_rx_queue_setup() routine
> > application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.
> > 
> > If the Rx queue is configured with new routine the packets being
> > received will be split into multiple segments pushed to the mbufs
> > with specified attributes. The PMD will split the received packets
> > into multiple segments according to the specification in the
> > description array:
> > 
> > - the first network buffer will be allocated from the memory pool,
> >    specified in the first segment description element, the second
> >    network buffer - from the pool in the second segment description
> >    element and so on. If there is no enough elements to describe
> >    the buffer for entire packet of maximal length the pool from the
> >    last valid element will be used to allocate the buffers from for the
> >    rest of segments
> > 
> > - the offsets from the segment description elements will provide
> >    the data offset from the buffer beginning except the first mbuf -
> >    for this one the offset is added to the RTE_PKTMBUF_HEADROOM to get
> >    actual offset from the buffer beginning. If there is no enough
> >    elements to describe the buffer for entire packet of maximal length
> >    the offsets for the rest of segment will be supposed to be zero.
> > 
> > - the data length being received to each segment is limited  by the
> >    length specified in the segment description element. The data
> >    receiving starts with filling up the first mbuf data buffer, if the
> >    specified maximal segment length is reached and there are data
> >    remaining (packet is longer than buffer in the first mbuf) the
> >    following data will be pushed to the next segment up to its own
> >    maximal length. If the first two segments is not enough to store
> >    all the packet remaining data  the next (third) segment will
> >    be engaged and so on. If the length in the segment description
> >    element is zero the actual buffer size will be deduced from
> >    the appropriate memory pool properties. If there is no enough
> >    elements to describe the buffer for entire packet of maximal
> >    length the buffer size will be deduced from the pool of the last
> >    valid element for the remaining segments.
> > 
> > For example, let's suppose we configured the Rx queue with the
> > following segments:
> >      seg0 - pool0, len0=14B, off0=2
> >      seg1 - pool1, len1=20B, off1=128B
> >      seg2 - pool2, len2=20B, off2=0B
> >      seg3 - pool3, len3=512B, off3=0B
> > 
> > The packet 46 bytes long will look like the following:
> >      seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
> >      seg1 - 20B long @ 128 in mbuf from pool1
> >      seg2 - 12B long @ 0 in mbuf from pool2
> > 
> > The packet 1500 bytes long will look like the following:
> >      seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
> >      seg1 - 20B @ 128 in mbuf from pool1
> >      seg2 - 20B @ 0 in mbuf from pool2
> >      seg3 - 512B @ 0 in mbuf from pool3
> >      seg4 - 512B @ 0 in mbuf from pool3
> >      seg5 - 422B @ 0 in mbuf from pool3
> > 
> > The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
> > configured to support new buffer split feature (if n_seg
> > is greater than one).
> > 
> > The new approach would allow splitting the ingress packets into
> > multiple parts pushed to the memory with different attributes.
> > For example, the packet headers can be pushed to the embedded
> > data buffers within mbufs and the application data into
> > the external buffers attached to mbufs allocated from the
> > different memory pools. The memory attributes for the split
> > parts may differ either - for example the application data
> > may be pushed into the external memory located on the dedicated
> > physical device, say GPU or NVMe. This would improve the DPDK
> > receiving datapath flexibility with preserving compatibility
> > with existing API.
> > 
> > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> > ---
> >   doc/guides/nics/features.rst             | 15 +++++
> >   doc/guides/rel_notes/release_20_11.rst   |  9 +++
> >   lib/librte_ethdev/rte_ethdev.c           | 95 ++++++++++++++++++++++++--------
> >   lib/librte_ethdev/rte_ethdev.h           | 58 ++++++++++++++++++-
> >   lib/librte_ethdev/rte_ethdev_version.map |  1 +
> 
> Can you please update deprecation notice too, to remove the notice?
> 
> >   5 files changed, 155 insertions(+), 23 deletions(-)
> > 
> > diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> > index dd8c955..a45a9e8 100644
> > --- a/doc/guides/nics/features.rst
> > +++ b/doc/guides/nics/features.rst
> > @@ -185,6 +185,21 @@ Supports receiving segmented mbufs.
> >   * **[related]    eth_dev_ops**: ``rx_pkt_burst``.
> > +.. _nic_features_buffer_split:
> > +
> > +Buffer Split on Rx
> > +------------------
> > +
> > +Scatters the packets being received on specified boundaries to segmented mbufs.
> > +
> > +* **[uses]       rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
> 
> uses 'rx_seg and rx_nseg' from 'rte_eth_rxconf', perhaps it can be another line.
> 
> > +* **[implements] datapath**: ``Buffer Split functionality``.
> > +* **[implements] rte_eth_dev_data**: ``buffer_split``.
> 
> What is implemented here?
> 
> > +* **[provides]   rte_eth_dev_info**: ``rx_offload_capa:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
> > +* **[provides]   eth_dev_ops**: ``rxq_info_get:buffer_split``.
> 
> Is this correct?
> 
> > +* **[related] API**: ``rte_eth_rx_queue_setup()``.
> > +
> > +
> >   .. _nic_features_lro:
> >   LRO
> > diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
> > index bcc0fc2..f67ec62 100644
> > --- a/doc/guides/rel_notes/release_20_11.rst
> > +++ b/doc/guides/rel_notes/release_20_11.rst
> > @@ -60,6 +60,12 @@ New Features
> >     Added the FEC API which provides functions for query FEC capabilities and
> >     current FEC mode from device. Also, API for configuring FEC mode is also provided.
> > +* **Introduced extended buffer description for receiving.**
> > +
> > +  Added the extended Rx queue setup routine providing the individual
> 
> This looks wrong with last version.
> 
> > +  descriptions for each Rx segment with maximal size, buffer offset and memory
> > +  pool to allocate data buffers from.
> > +
> >   * **Updated Broadcom bnxt driver.**
> >     Updated the Broadcom bnxt driver with new features and improvements, including:
> > @@ -253,6 +259,9 @@ API Changes
> >     As the data of ``uint8_t`` will be truncated when queue number under
> >     a TC is greater than 256.
> > +* ethdev: Added fields rx_seg and rx_nseg to rte_eth_rxconf structure
> > +  to provide extended description of the receiving buffer.
> > +
> >   * vhost: Moved vDPA APIs from experimental to stable.
> >   * rawdev: Added a structure size parameter to the functions
> > diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> > index 892c246..7b64a6e 100644
> > --- a/lib/librte_ethdev/rte_ethdev.c
> > +++ b/lib/librte_ethdev/rte_ethdev.c
> > @@ -105,6 +105,9 @@ struct rte_eth_xstats_name_off {
> >   #define RTE_RX_OFFLOAD_BIT2STR(_name)	\
> >   	{ DEV_RX_OFFLOAD_##_name, #_name }
> > +#define RTE_ETH_RX_OFFLOAD_BIT2STR(_name)	\
> > +	{ RTE_ETH_RX_OFFLOAD_##_name, #_name }
> > +
> >   static const struct {
> >   	uint64_t offload;
> >   	const char *name;
> > @@ -128,9 +131,11 @@ struct rte_eth_xstats_name_off {
> >   	RTE_RX_OFFLOAD_BIT2STR(SCTP_CKSUM),
> >   	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> >   	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > +	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> >   };
> >   #undef RTE_RX_OFFLOAD_BIT2STR
> > +#undef RTE_ETH_RX_OFFLOAD_BIT2STR
> >   #define RTE_TX_OFFLOAD_BIT2STR(_name)	\
> >   	{ DEV_TX_OFFLOAD_##_name, #_name }
> > @@ -1770,10 +1775,14 @@ struct rte_eth_dev *
> >   		       struct rte_mempool *mp)
> >   {
> >   	int ret;
> > +	uint16_t seg_idx;
> >   	uint32_t mbp_buf_size;
> >   	struct rte_eth_dev *dev;
> >   	struct rte_eth_dev_info dev_info;
> >   	struct rte_eth_rxconf local_conf;
> > +	const struct rte_eth_rxseg *rx_seg;
> > +	struct rte_eth_rxseg seg_single = { .mp = mp};

missing space

> > +	uint16_t n_seg;
> >   	void **rxq;
> >   	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> > @@ -1784,13 +1793,32 @@ struct rte_eth_dev *
> >   		return -EINVAL;
> >   	}
> > -	if (mp == NULL) {
> > -		RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
> > -		return -EINVAL;
> > -	}
> > -
> >   	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_setup, -ENOTSUP);
> > +	rx_seg = rx_conf->rx_seg;
> > +	n_seg = rx_conf->rx_nseg;
> > +	if (rx_seg == NULL) {
> > +		/* Exclude ambiguities about segment descrtiptions. */
> > +		if (n_seg) {
> > +			RTE_ETHDEV_LOG(ERR,
> > +				       "Non empty array with null pointer\n");
> > +			return -EINVAL;
> > +		}
> > +		rx_seg = &seg_single;
> > +		n_seg = 1;
> 
> Why setting 'rx_seg' & 'n_seg'? Why not leaving them NULL and 0 when not used?
> This was PMD can do NULL/0 check and can know they are not used.

I think they are just set locally to factorize the checks below, but I agree it
is questionnable: it seems the only check which is really factorized is about
private_data_size.

> I think better to do a "if (mp == NULL)" check here, both 'rx_seg' & 'mp'
> should not be NULL.

Agree, something like this looks more simple:

	if (mp == NULL) {
		if (n_seg == 0 || rx_seg == NULL)
			RTE_ETHDEV_LOG(ERR, "...");
	} else {
		if (n_seg != 0 || rx_seg != NULL)
			RTE_ETHDEV_LOG(ERR, "...");
		rx_seg = &seg_single;
		n_seg = 1;
	}

 
> > +	} else {
> > +		if (n_seg == 0) {
> > +			RTE_ETHDEV_LOG(ERR,
> > +				       "Invalid zero descriptions number\n");
> > +			return -EINVAL;
> > +		}
> > +		if (mp) {
> > +			RTE_ETHDEV_LOG(ERR,
> > +				       "Memory pool duplicated definition\n");
> > +			return -EINVAL;
> > +		}
> > +	}
> > +
> >   	/*
> >   	 * Check the size of the mbuf data buffer.
> >   	 * This value must be provided in the private data of the memory pool.
> > @@ -1800,23 +1828,48 @@ struct rte_eth_dev *
> >   	if (ret != 0)
> >   		return ret;
> > -	if (mp->private_data_size < sizeof(struct rte_pktmbuf_pool_private)) {
> > -		RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
> > -			mp->name, (int)mp->private_data_size,
> > -			(int)sizeof(struct rte_pktmbuf_pool_private));
> > -		return -ENOSPC;
> > -	}
> > -	mbp_buf_size = rte_pktmbuf_data_room_size(mp);
> > +	for (seg_idx = 0; seg_idx < n_seg; seg_idx++) {
> > +		struct rte_mempool *mpl = rx_seg[seg_idx].mp;
> > +		uint32_t length = rx_seg[seg_idx].length;
> > +		uint32_t offset = rx_seg[seg_idx].offset;
> > +		uint32_t head_room = seg_idx ? 0 : RTE_PKTMBUF_HEADROOM;
> > -	if (mbp_buf_size < dev_info.min_rx_bufsize + RTE_PKTMBUF_HEADROOM) {
> > -		RTE_ETHDEV_LOG(ERR,
> > -			"%s mbuf_data_room_size %d < %d (RTE_PKTMBUF_HEADROOM=%d + min_rx_bufsize(dev)=%d)\n",
> > -			mp->name, (int)mbp_buf_size,
> > -			(int)(RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize),
> > -			(int)RTE_PKTMBUF_HEADROOM,
> > -			(int)dev_info.min_rx_bufsize);
> > -		return -EINVAL;
> > +		if (mpl == NULL) {
> > +			RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
> > +			return -EINVAL;
> > +		}
> > +
> > +		if (mpl->private_data_size <
> > +				sizeof(struct rte_pktmbuf_pool_private)) {
> > +			RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
> > +				mpl->name, (int)mpl->private_data_size,
> > +				(int)sizeof(struct rte_pktmbuf_pool_private));
> > +			return -ENOSPC;
> > +		}
> > +
> > +		mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
> > +		length = length ? length : (mbp_buf_size - head_room);
> > +		if (mbp_buf_size < length + offset + head_room) {


Is length == 0 allowed? Or is it to handle the case where mp != NULL?
Is the test needed in that case? It seems it is equivalent to do "if
(offset > 0)".

Wouldn't it be better to check mp like this?

	if (mp == NULL) {
		if (mbp_buf_size < length + offset + head_room)
			...error...
	}


> > +			RTE_ETHDEV_LOG(ERR,
> > +				"%s mbuf_data_room_size %u < %u"
> > +				" (segment length=%u + segment offset=%u)\n",
> > +				mpl->name, mbp_buf_size,
> > +				length + offset, length, offset);
> > +			return -EINVAL;
> > +		}
> >   	}
> > +	/* Check the minimal buffer size for the single segment only. */
> 
> This is the main branch, what do you think moving the comment to the
> beggining of above for loop and add a comment about testing the multiple
> segment.
> 
> Btw, I have a concern that this single/multi segment can cause a confusion
> with multi segment packets. Can something else, like "split package" can be
> used instead of segment?
> 
> > +	if (mp && (mbp_buf_size < dev_info.min_rx_bufsize +
> > +				  RTE_PKTMBUF_HEADROOM)) {
> > +		RTE_ETHDEV_LOG(ERR,
> > +			       "%s mbuf_data_room_size %u < %u "
> > +			       "(RTE_PKTMBUF_HEADROOM=%u + "
> > +			       "min_rx_bufsize(dev)=%u)\n",
> > +			       mp->name, mbp_buf_size,
> > +			       RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize,
> > +			       RTE_PKTMBUF_HEADROOM, dev_info.min_rx_bufsize);
> > +			return -EINVAL;
> > +		}

The "}" is not indented correctly


> >   	/* Use default specified by driver, if nb_rx_desc is zero */
> >   	if (nb_rx_desc == 0) {
> > @@ -1914,8 +1967,6 @@ struct rte_eth_dev *
> >   			dev->data->min_rx_buf_size = mbp_buf_size;
> >   	}
> > -	rte_ethdev_trace_rxq_setup(port_id, rx_queue_id, nb_rx_desc, mp,
> > -		rx_conf, ret);
> 
> Is this removed intentionally?
> 
> >   	return eth_err(port_id, ret);
> >   }
> > diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> > index 5bcfbb8..9cf0a03 100644
> > --- a/lib/librte_ethdev/rte_ethdev.h
> > +++ b/lib/librte_ethdev/rte_ethdev.h
> > @@ -970,6 +970,16 @@ struct rte_eth_txmode {
> >   };
> >   /**
> > + * A structure used to configure an RX packet segment to split.
> > + */
> > +struct rte_eth_rxseg {
> > +	struct rte_mempool *mp; /**< Memory pools to allocate segment from */

pools -> pool

> > +	uint16_t length; /**< Segment data length, configures split point. */
> > +	uint16_t offset; /**< Data offset from beginning of mbuf data buffer */
> > +	uint32_t reserved; /**< Reserved field */

To allow future additions to the structure, should the reserved field
always be set to 0? If yes, maybe it should be specified here and
checked in rte_eth_rx_queue_setup().

Some "." missing at the end of comments.

> > +};
> > +
> > +/**
> >    * A structure used to configure an RX ring of an Ethernet port.
> >    */
> >   struct rte_eth_rxconf {
> > @@ -977,13 +987,23 @@ struct rte_eth_rxconf {
> >   	uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
> >   	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >   	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > +	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> > +	/**
> > +	 * The pointer to the array of segment descriptions, each element
> > +	 * describes the memory pool, maximal segment data length, initial
> > +	 * data offset from the beginning of data buffer in mbuf. This allow

allow -> allows
or maybe "This allows to specify" -> "This specifies"

> > +	 * to specify the dedicated properties for each segment in the receiving
> > +	 * buffer - pool, buffer offset, maximal segment size. The number of
> > +	 * segment descriptions in the array is specified by the rx_nseg
> > +	 * field.
> > +	 */
> 
> What do you think providing a short description here, and move above comment
> to abice "struct rte_eth_rxseg" struct?

+1

> 
> > +	struct rte_eth_rxseg *rx_seg;
> >   	/**
> >   	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >   	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> >   	 * fields on rte_eth_dev_info structure are allowed to be set.
> >   	 */
> >   	uint64_t offloads;
> > -
> 
> unrelated
> 
> >   	uint64_t reserved_64s[2]; /**< Reserved for future fields */
> >   	void *reserved_ptrs[2];   /**< Reserved for future fields */
> >   };
> > @@ -1260,6 +1280,7 @@ struct rte_eth_conf {
> >   #define DEV_RX_OFFLOAD_SCTP_CKSUM	0x00020000
> >   #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> >   #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
> > +#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> >   #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >   				 DEV_RX_OFFLOAD_UDP_CKSUM | \
> > @@ -2027,6 +2048,41 @@ int rte_eth_dev_configure(uint16_t port_id, uint16_t nb_rx_queue,
> >    *   No need to repeat any bit in rx_conf->offloads which has already been
> >    *   enabled in rte_eth_dev_configure() at port level. An offloading enabled
> >    *   at port level can't be disabled at queue level.
> 
> Can it be possible to put a kind of marker here, like "@rx_seg & @rx_nseg",
> to clarify what are you talking about.
> 
> > + *   The configuration structure also contains the pointer to the array
> > + *   of the receiving buffer segment descriptions, each element describes
> > + *   the memory pool, maximal segment data length, initial data offset from
> > + *   the beginning of data buffer in mbuf. This allow to specify the dedicated
> > + *   properties for each segment in the receiving buffer - pool, buffer
> > + *   offset, maximal segment size. If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT offload
> > + *   flag is configured the PMD will split the received packets into multiple
> > + *   segments according to the specification in the description array:
> > + *   - the first network buffer will be allocated from the memory pool,
> > + *     specified in the first segment description element, the second
> > + *     network buffer - from the pool in the second segment description
> > + *     element and so on. If there is no enough elements to describe
> > + *     the buffer for entire packet of maximal length the pool from the last
> > + *     valid element will be used to allocate the buffers from for the rest
> > + *     of segments.
> > + *   - the offsets from the segment description elements will provide the
> > + *     data offset from the buffer beginning except the first mbuf - for this
> > + *     one the offset is added to the RTE_PKTMBUF_HEADROOM to get actual
> > + *     offset from the buffer beginning. If there is no enough elements
> > + *     to describe the buffer for entire packet of maximal length the offsets
> > + *     for the rest of segment will be supposed to be zero.
> > + *   - the data length being received to each segment is limited by the
> > + *     length specified in the segment description element. The data receiving
> > + *     starts with filling up the first mbuf data buffer, if the specified
> > + *     maximal segment length is reached and there are data remaining
> > + *     (packet is longer than buffer in the first mbuf) the following data
> > + *     will be pushed to the next segment up to its own length. If the first
> > + *     two segments is not enough to store all the packet data the next
> > + *     (third) segment will be engaged and so on. If the length in the segment
> > + *     description element is zero the actual buffer size will be deduced
> > + *     from the appropriate memory pool properties. If there is no enough
> > + *     elements to describe the buffer for entire packet of maximal length
> > + *     the buffer size will be deduced from the pool of the last valid
> > + *     element for the all remaining segments.
> > + *
> 
> I think as a first thing the comment should clarify that if @rx_seg provided
> 'mb_pool' should be NULL, and if split Rx feature is not used "@rx_seg &
> @rx_nseg" should be NULL and 0.
> 
> Also above is too wordy, it is hard to follow. Like "@rx_seg & @rx_nseg" are
> only taken into account if application provides
> 'RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT' offload should be clearer to see, etc.
> Can you try to simplify it, perhpas moving some of above comments to the
> "struct rte_eth_rxseg" can work?
> 
> >    * @param mb_pool
> >    *   The pointer to the memory pool from which to allocate *rte_mbuf* network
> >    *   memory buffers to populate each descriptor of the receive ring.
> > diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
> > index f8a0945..25f7cee 100644
> > --- a/lib/librte_ethdev/rte_ethdev_version.map
> > +++ b/lib/librte_ethdev/rte_ethdev_version.map
> > @@ -232,6 +232,7 @@ EXPERIMENTAL {
> >   	rte_eth_fec_get_capability;
> >   	rte_eth_fec_get;
> >   	rte_eth_fec_set;
> > +
> 
> unrelated
  
Slava Ovsiienko Oct. 14, 2020, 2:42 p.m. UTC | #3
Hi, Ferruh
> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@intel.com>
> Sent: Wednesday, October 14, 2020 1:34
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org
> Cc: thomasm@monjalon.net; stephen@networkplumber.org;
> olivier.matz@6wind.com; jerinjacobk@gmail.com;
> maxime.coquelin@redhat.com; david.marchand@redhat.com;
> arybchenko@solarflare.com
> Subject: Re: [PATCH v5 1/6] ethdev: introduce Rx buffer split
> 
[..snip..]
> 
> Can you please update deprecation notice too, to remove the notice?
> 
Yes, I missed the point, thank you for noticing.

> >   5 files changed, 155 insertions(+), 23 deletions(-)
> >
> > diff --git a/doc/guides/nics/features.rst
> > b/doc/guides/nics/features.rst index dd8c955..a45a9e8 100644
> > --- a/doc/guides/nics/features.rst
> > +* **[implements] rte_eth_dev_data**: ``buffer_split``.
> 
> What is implemented here?
> 
none, removed.

> > +* **[provides]   rte_eth_dev_info**:
> ``rx_offload_capa:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
> > +* **[provides]   eth_dev_ops**: ``rxq_info_get:buffer_split``.
> 
> Is this correct?
Yes, the dedicated rx_burst routine supporting buffer split might
be engaged by PMD and might be reported via rxq_info_get().

> > +		rx_seg = &seg_single;
> > +		n_seg = 1;
> 
> Why setting 'rx_seg' & 'n_seg'? Why not leaving them NULL and 0 when not
> used?
> This was PMD can do NULL/0 check and can know they are not used.
Refactored, single pool (legacy) and new extended config check are
separated into dedicated branches.
 

> > -	rte_ethdev_trace_rxq_setup(port_id, rx_queue_id, nb_rx_desc, mp,
> > -		rx_conf, ret);
> 
> Is this removed intentionally?
> 
Missed statement, reverted back.

[..snip..]
Comments and descriptions rearranged and updated according to the comments, v6 is coming.

With best regards, Slava
  

Patch

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index dd8c955..a45a9e8 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -185,6 +185,21 @@  Supports receiving segmented mbufs.
 * **[related]    eth_dev_ops**: ``rx_pkt_burst``.
 
 
+.. _nic_features_buffer_split:
+
+Buffer Split on Rx
+------------------
+
+Scatters the packets being received on specified boundaries to segmented mbufs.
+
+* **[uses]       rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
+* **[implements] datapath**: ``Buffer Split functionality``.
+* **[implements] rte_eth_dev_data**: ``buffer_split``.
+* **[provides]   rte_eth_dev_info**: ``rx_offload_capa:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
+* **[provides]   eth_dev_ops**: ``rxq_info_get:buffer_split``.
+* **[related] API**: ``rte_eth_rx_queue_setup()``.
+
+
 .. _nic_features_lro:
 
 LRO
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index bcc0fc2..f67ec62 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -60,6 +60,12 @@  New Features
   Added the FEC API which provides functions for query FEC capabilities and
   current FEC mode from device. Also, API for configuring FEC mode is also provided.
 
+* **Introduced extended buffer description for receiving.**
+
+  Added the extended Rx queue setup routine providing the individual
+  descriptions for each Rx segment with maximal size, buffer offset and memory
+  pool to allocate data buffers from.
+
 * **Updated Broadcom bnxt driver.**
 
   Updated the Broadcom bnxt driver with new features and improvements, including:
@@ -253,6 +259,9 @@  API Changes
   As the data of ``uint8_t`` will be truncated when queue number under
   a TC is greater than 256.
 
+* ethdev: Added fields rx_seg and rx_nseg to rte_eth_rxconf structure
+  to provide extended description of the receiving buffer.
+
 * vhost: Moved vDPA APIs from experimental to stable.
 
 * rawdev: Added a structure size parameter to the functions
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 892c246..7b64a6e 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -105,6 +105,9 @@  struct rte_eth_xstats_name_off {
 #define RTE_RX_OFFLOAD_BIT2STR(_name)	\
 	{ DEV_RX_OFFLOAD_##_name, #_name }
 
+#define RTE_ETH_RX_OFFLOAD_BIT2STR(_name)	\
+	{ RTE_ETH_RX_OFFLOAD_##_name, #_name }
+
 static const struct {
 	uint64_t offload;
 	const char *name;
@@ -128,9 +131,11 @@  struct rte_eth_xstats_name_off {
 	RTE_RX_OFFLOAD_BIT2STR(SCTP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
+#undef RTE_ETH_RX_OFFLOAD_BIT2STR
 
 #define RTE_TX_OFFLOAD_BIT2STR(_name)	\
 	{ DEV_TX_OFFLOAD_##_name, #_name }
@@ -1770,10 +1775,14 @@  struct rte_eth_dev *
 		       struct rte_mempool *mp)
 {
 	int ret;
+	uint16_t seg_idx;
 	uint32_t mbp_buf_size;
 	struct rte_eth_dev *dev;
 	struct rte_eth_dev_info dev_info;
 	struct rte_eth_rxconf local_conf;
+	const struct rte_eth_rxseg *rx_seg;
+	struct rte_eth_rxseg seg_single = { .mp = mp};
+	uint16_t n_seg;
 	void **rxq;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -1784,13 +1793,32 @@  struct rte_eth_dev *
 		return -EINVAL;
 	}
 
-	if (mp == NULL) {
-		RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
-		return -EINVAL;
-	}
-
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_setup, -ENOTSUP);
 
+	rx_seg = rx_conf->rx_seg;
+	n_seg = rx_conf->rx_nseg;
+	if (rx_seg == NULL) {
+		/* Exclude ambiguities about segment descrtiptions. */
+		if (n_seg) {
+			RTE_ETHDEV_LOG(ERR,
+				       "Non empty array with null pointer\n");
+			return -EINVAL;
+		}
+		rx_seg = &seg_single;
+		n_seg = 1;
+	} else {
+		if (n_seg == 0) {
+			RTE_ETHDEV_LOG(ERR,
+				       "Invalid zero descriptions number\n");
+			return -EINVAL;
+		}
+		if (mp) {
+			RTE_ETHDEV_LOG(ERR,
+				       "Memory pool duplicated definition\n");
+			return -EINVAL;
+		}
+	}
+
 	/*
 	 * Check the size of the mbuf data buffer.
 	 * This value must be provided in the private data of the memory pool.
@@ -1800,23 +1828,48 @@  struct rte_eth_dev *
 	if (ret != 0)
 		return ret;
 
-	if (mp->private_data_size < sizeof(struct rte_pktmbuf_pool_private)) {
-		RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
-			mp->name, (int)mp->private_data_size,
-			(int)sizeof(struct rte_pktmbuf_pool_private));
-		return -ENOSPC;
-	}
-	mbp_buf_size = rte_pktmbuf_data_room_size(mp);
+	for (seg_idx = 0; seg_idx < n_seg; seg_idx++) {
+		struct rte_mempool *mpl = rx_seg[seg_idx].mp;
+		uint32_t length = rx_seg[seg_idx].length;
+		uint32_t offset = rx_seg[seg_idx].offset;
+		uint32_t head_room = seg_idx ? 0 : RTE_PKTMBUF_HEADROOM;
 
-	if (mbp_buf_size < dev_info.min_rx_bufsize + RTE_PKTMBUF_HEADROOM) {
-		RTE_ETHDEV_LOG(ERR,
-			"%s mbuf_data_room_size %d < %d (RTE_PKTMBUF_HEADROOM=%d + min_rx_bufsize(dev)=%d)\n",
-			mp->name, (int)mbp_buf_size,
-			(int)(RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize),
-			(int)RTE_PKTMBUF_HEADROOM,
-			(int)dev_info.min_rx_bufsize);
-		return -EINVAL;
+		if (mpl == NULL) {
+			RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
+			return -EINVAL;
+		}
+
+		if (mpl->private_data_size <
+				sizeof(struct rte_pktmbuf_pool_private)) {
+			RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
+				mpl->name, (int)mpl->private_data_size,
+				(int)sizeof(struct rte_pktmbuf_pool_private));
+			return -ENOSPC;
+		}
+
+		mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
+		length = length ? length : (mbp_buf_size - head_room);
+		if (mbp_buf_size < length + offset + head_room) {
+			RTE_ETHDEV_LOG(ERR,
+				"%s mbuf_data_room_size %u < %u"
+				" (segment length=%u + segment offset=%u)\n",
+				mpl->name, mbp_buf_size,
+				length + offset, length, offset);
+			return -EINVAL;
+		}
 	}
+	/* Check the minimal buffer size for the single segment only. */
+	if (mp && (mbp_buf_size < dev_info.min_rx_bufsize +
+				  RTE_PKTMBUF_HEADROOM)) {
+		RTE_ETHDEV_LOG(ERR,
+			       "%s mbuf_data_room_size %u < %u "
+			       "(RTE_PKTMBUF_HEADROOM=%u + "
+			       "min_rx_bufsize(dev)=%u)\n",
+			       mp->name, mbp_buf_size,
+			       RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize,
+			       RTE_PKTMBUF_HEADROOM, dev_info.min_rx_bufsize);
+			return -EINVAL;
+		}
 
 	/* Use default specified by driver, if nb_rx_desc is zero */
 	if (nb_rx_desc == 0) {
@@ -1914,8 +1967,6 @@  struct rte_eth_dev *
 			dev->data->min_rx_buf_size = mbp_buf_size;
 	}
 
-	rte_ethdev_trace_rxq_setup(port_id, rx_queue_id, nb_rx_desc, mp,
-		rx_conf, ret);
 	return eth_err(port_id, ret);
 }
 
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 5bcfbb8..9cf0a03 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -970,6 +970,16 @@  struct rte_eth_txmode {
 };
 
 /**
+ * A structure used to configure an RX packet segment to split.
+ */
+struct rte_eth_rxseg {
+	struct rte_mempool *mp; /**< Memory pools to allocate segment from */
+	uint16_t length; /**< Segment data length, configures split point. */
+	uint16_t offset; /**< Data offset from beginning of mbuf data buffer */
+	uint32_t reserved; /**< Reserved field */
+};
+
+/**
  * A structure used to configure an RX ring of an Ethernet port.
  */
 struct rte_eth_rxconf {
@@ -977,13 +987,23 @@  struct rte_eth_rxconf {
 	uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
+	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	/**
+	 * The pointer to the array of segment descriptions, each element
+	 * describes the memory pool, maximal segment data length, initial
+	 * data offset from the beginning of data buffer in mbuf. This allow
+	 * to specify the dedicated properties for each segment in the receiving
+	 * buffer - pool, buffer offset, maximal segment size. The number of
+	 * segment descriptions in the array is specified by the rx_nseg
+	 * field.
+	 */
+	struct rte_eth_rxseg *rx_seg;
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
 	 * fields on rte_eth_dev_info structure are allowed to be set.
 	 */
 	uint64_t offloads;
-
 	uint64_t reserved_64s[2]; /**< Reserved for future fields */
 	void *reserved_ptrs[2];   /**< Reserved for future fields */
 };
@@ -1260,6 +1280,7 @@  struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_SCTP_CKSUM	0x00020000
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
+#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \
@@ -2027,6 +2048,41 @@  int rte_eth_dev_configure(uint16_t port_id, uint16_t nb_rx_queue,
  *   No need to repeat any bit in rx_conf->offloads which has already been
  *   enabled in rte_eth_dev_configure() at port level. An offloading enabled
  *   at port level can't be disabled at queue level.
+ *   The configuration structure also contains the pointer to the array
+ *   of the receiving buffer segment descriptions, each element describes
+ *   the memory pool, maximal segment data length, initial data offset from
+ *   the beginning of data buffer in mbuf. This allow to specify the dedicated
+ *   properties for each segment in the receiving buffer - pool, buffer
+ *   offset, maximal segment size. If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT offload
+ *   flag is configured the PMD will split the received packets into multiple
+ *   segments according to the specification in the description array:
+ *   - the first network buffer will be allocated from the memory pool,
+ *     specified in the first segment description element, the second
+ *     network buffer - from the pool in the second segment description
+ *     element and so on. If there is no enough elements to describe
+ *     the buffer for entire packet of maximal length the pool from the last
+ *     valid element will be used to allocate the buffers from for the rest
+ *     of segments.
+ *   - the offsets from the segment description elements will provide the
+ *     data offset from the buffer beginning except the first mbuf - for this
+ *     one the offset is added to the RTE_PKTMBUF_HEADROOM to get actual
+ *     offset from the buffer beginning. If there is no enough elements
+ *     to describe the buffer for entire packet of maximal length the offsets
+ *     for the rest of segment will be supposed to be zero.
+ *   - the data length being received to each segment is limited by the
+ *     length specified in the segment description element. The data receiving
+ *     starts with filling up the first mbuf data buffer, if the specified
+ *     maximal segment length is reached and there are data remaining
+ *     (packet is longer than buffer in the first mbuf) the following data
+ *     will be pushed to the next segment up to its own length. If the first
+ *     two segments is not enough to store all the packet data the next
+ *     (third) segment will be engaged and so on. If the length in the segment
+ *     description element is zero the actual buffer size will be deduced
+ *     from the appropriate memory pool properties. If there is no enough
+ *     elements to describe the buffer for entire packet of maximal length
+ *     the buffer size will be deduced from the pool of the last valid
+ *     element for the all remaining segments.
+ *
  * @param mb_pool
  *   The pointer to the memory pool from which to allocate *rte_mbuf* network
  *   memory buffers to populate each descriptor of the receive ring.
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index f8a0945..25f7cee 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -232,6 +232,7 @@  EXPERIMENTAL {
 	rte_eth_fec_get_capability;
 	rte_eth_fec_get;
 	rte_eth_fec_set;
+
 };
 
 INTERNAL {