[v8,1/6] ethdev: introduce Rx buffer split

Message ID 1602834519-8696-2-git-send-email-viacheslavo@nvidia.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series ethdev: introduce Rx buffer split |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Slava Ovsiienko Oct. 16, 2020, 7:48 a.m. UTC
  The DPDK datapath in the transmit direction is very flexible.
An application can build the multi-segment packet and manages
almost all data aspects - the memory pools where segments
are allocated from, the segment lengths, the memory attributes
like external buffers, registered for DMA, etc.

In the receiving direction, the datapath is much less flexible,
an application can only specify the memory pool to configure the
receiving queue and nothing more. In order to extend receiving
datapath capabilities it is proposed to add the way to provide
extended information how to split the packets being received.

The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
capabilities is introduced to present the way for PMD to report to
application about supporting Rx packet split to configurable
segments. Prior invoking the rte_eth_rx_queue_setup() routine
application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.

The following structure is introduced to specify the Rx packet
segment for RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT offload:

struct rte_eth_rxseg_split {

    struct rte_mempool *mp; /* memory pools to allocate segment from */
    uint16_t length; /* segment maximal data length,
		       	configures "split point" */
    uint16_t offset; /* data offset from beginning
		       	of mbuf data buffer */
    uint32_t reserved; /* reserved field */
};

The segment descriptions are added to the rte_eth_rxconf structure:
   rx_seg - pointer the array of segment descriptions, each element
             describes the memory pool, maximal data length, initial
             data offset from the beginning of data buffer in mbuf.
	     This array allows to specify the different settings for
	     each segment in individual fashion.
   rx_nseg - number of elements in the array

If the extended segment descriptions is provided with these new
fields the mp parameter of the rte_eth_rx_queue_setup must be
specified as NULL to avoid ambiguity.

There are two options to specify Rx buffer configuration:
- mp is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is zero,
  it is compatible configuration, follows existing implementation,
  provides single pool and no description for segment sizes
  and offsets.
- mp is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not
  zero, it provides the extended configuration, individually for
  each segment.

f the Rx queue is configured with new settings the packets being
received will be split into multiple segments pushed to the mbufs
with specified attributes. The PMD will split the received packets
into multiple segments according to the specification in the
description array.

For example, let's suppose we configured the Rx queue with the
following segments:
    seg0 - pool0, len0=14B, off0=2
    seg1 - pool1, len1=20B, off1=128B
    seg2 - pool2, len2=20B, off2=0B
    seg3 - pool3, len3=512B, off3=0B

The packet 46 bytes long will look like the following:
    seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
    seg1 - 20B long @ 128 in mbuf from pool1
    seg2 - 12B long @ 0 in mbuf from pool2

The packet 1500 bytes long will look like the following:
    seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
    seg1 - 20B @ 128 in mbuf from pool1
    seg2 - 20B @ 0 in mbuf from pool2
    seg3 - 512B @ 0 in mbuf from pool3
    seg4 - 512B @ 0 in mbuf from pool3
    seg5 - 422B @ 0 in mbuf from pool3

The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
configured to support new buffer split feature (if rx_nseg
is greater than one).

The split limitations imposed by underlying PMD is reported
in the new introduced rte_eth_dev_info->rx_seg_capa field.

The new approach would allow splitting the ingress packets into
multiple parts pushed to the memory with different attributes.
For example, the packet headers can be pushed to the embedded
data buffers within mbufs and the application data into
the external buffers attached to mbufs allocated from the
different memory pools. The memory attributes for the split
parts may differ either - for example the application data
may be pushed into the external memory located on the dedicated
physical device, say GPU or NVMe. This would improve the DPDK
receiving datapath flexibility with preserving compatibility
with existing API.

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 doc/guides/nics/features.rst           |  15 ++++
 doc/guides/rel_notes/deprecation.rst   |   5 --
 doc/guides/rel_notes/release_20_11.rst |   9 ++
 lib/librte_ethdev/rte_ethdev.c         | 152 +++++++++++++++++++++++++++------
 lib/librte_ethdev/rte_ethdev.h         |  88 ++++++++++++++++++-
 5 files changed, 238 insertions(+), 31 deletions(-)
  

Comments

Andrew Rybchenko Oct. 16, 2020, 8:51 a.m. UTC | #1
On 10/16/20 10:48 AM, Viacheslav Ovsiienko wrote:
> The DPDK datapath in the transmit direction is very flexible.
> An application can build the multi-segment packet and manages
> almost all data aspects - the memory pools where segments
> are allocated from, the segment lengths, the memory attributes
> like external buffers, registered for DMA, etc.
> 
> In the receiving direction, the datapath is much less flexible,
> an application can only specify the memory pool to configure the
> receiving queue and nothing more. In order to extend receiving
> datapath capabilities it is proposed to add the way to provide
> extended information how to split the packets being received.
> 
> The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
> capabilities is introduced to present the way for PMD to report to
> application about supporting Rx packet split to configurable
> segments. Prior invoking the rte_eth_rx_queue_setup() routine
> application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.
> 
> The following structure is introduced to specify the Rx packet
> segment for RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT offload:
> 
> struct rte_eth_rxseg_split {
> 
>     struct rte_mempool *mp; /* memory pools to allocate segment from */
>     uint16_t length; /* segment maximal data length,
> 		       	configures "split point" */
>     uint16_t offset; /* data offset from beginning
> 		       	of mbuf data buffer */
>     uint32_t reserved; /* reserved field */
> };
> 
> The segment descriptions are added to the rte_eth_rxconf structure:
>    rx_seg - pointer the array of segment descriptions, each element
>              describes the memory pool, maximal data length, initial
>              data offset from the beginning of data buffer in mbuf.
> 	     This array allows to specify the different settings for
> 	     each segment in individual fashion.
>    rx_nseg - number of elements in the array
> 
> If the extended segment descriptions is provided with these new
> fields the mp parameter of the rte_eth_rx_queue_setup must be
> specified as NULL to avoid ambiguity.
> 
> There are two options to specify Rx buffer configuration:
> - mp is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is zero,
>   it is compatible configuration, follows existing implementation,
>   provides single pool and no description for segment sizes
>   and offsets.
> - mp is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not
>   zero, it provides the extended configuration, individually for
>   each segment.
> 
> f the Rx queue is configured with new settings the packets being
> received will be split into multiple segments pushed to the mbufs
> with specified attributes. The PMD will split the received packets
> into multiple segments according to the specification in the
> description array.
> 
> For example, let's suppose we configured the Rx queue with the
> following segments:
>     seg0 - pool0, len0=14B, off0=2
>     seg1 - pool1, len1=20B, off1=128B
>     seg2 - pool2, len2=20B, off2=0B
>     seg3 - pool3, len3=512B, off3=0B
> 
> The packet 46 bytes long will look like the following:
>     seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>     seg1 - 20B long @ 128 in mbuf from pool1
>     seg2 - 12B long @ 0 in mbuf from pool2
> 
> The packet 1500 bytes long will look like the following:
>     seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>     seg1 - 20B @ 128 in mbuf from pool1
>     seg2 - 20B @ 0 in mbuf from pool2
>     seg3 - 512B @ 0 in mbuf from pool3
>     seg4 - 512B @ 0 in mbuf from pool3
>     seg5 - 422B @ 0 in mbuf from pool3
> 
> The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
> configured to support new buffer split feature (if rx_nseg
> is greater than one).
> 
> The split limitations imposed by underlying PMD is reported
> in the new introduced rte_eth_dev_info->rx_seg_capa field.
> 
> The new approach would allow splitting the ingress packets into
> multiple parts pushed to the memory with different attributes.
> For example, the packet headers can be pushed to the embedded
> data buffers within mbufs and the application data into
> the external buffers attached to mbufs allocated from the
> different memory pools. The memory attributes for the split
> parts may differ either - for example the application data
> may be pushed into the external memory located on the dedicated
> physical device, say GPU or NVMe. This would improve the DPDK
> receiving datapath flexibility with preserving compatibility
> with existing API.
> 
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> ---
>  doc/guides/nics/features.rst           |  15 ++++
>  doc/guides/rel_notes/deprecation.rst   |   5 --
>  doc/guides/rel_notes/release_20_11.rst |   9 ++
>  lib/librte_ethdev/rte_ethdev.c         | 152 +++++++++++++++++++++++++++------
>  lib/librte_ethdev/rte_ethdev.h         |  88 ++++++++++++++++++-
>  5 files changed, 238 insertions(+), 31 deletions(-)
> 
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index dd8c955..832ea3b 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -185,6 +185,21 @@ Supports receiving segmented mbufs.
>  * **[related]    eth_dev_ops**: ``rx_pkt_burst``.
>  
>  
> +.. _nic_features_buffer_split:
> +
> +Buffer Split on Rx
> +------------------
> +
> +Scatters the packets being received on specified boundaries to segmented mbufs.
> +
> +* **[uses]       rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
> +* **[uses]       rte_eth_rxconf**: ``rx_conf.rx_seg, rx_conf.rx_nseg``.
> +* **[implements] datapath**: ``Buffer Split functionality``.
> +* **[provides]   rte_eth_dev_info**: ``rx_offload_capa:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
> +* **[provides]   eth_dev_ops**: ``rxq_info_get:buffer_split``.
> +* **[related] API**: ``rte_eth_rx_queue_setup()``.
> +
> +
>  .. _nic_features_lro:
>  
>  LRO
> diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
> index 584e720..232cd54 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -138,11 +138,6 @@ Deprecation Notices
>    In 19.11 PMDs will still update the field even when the offload is not
>    enabled.
>  
> -* ethdev: Add new fields to ``rte_eth_rxconf`` to configure the receiving
> -  queues to split ingress packets into multiple segments according to the
> -  specified lengths into the buffers allocated from the specified
> -  memory pools. The backward compatibility to existing API is preserved.
> -
>  * ethdev: ``rx_descriptor_done`` dev_ops and ``rte_eth_rx_descriptor_done``
>    will be removed in 21.11.
>    Existing ``rte_eth_rx_descriptor_status`` and ``rte_eth_tx_descriptor_status``
> diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
> index bcc0fc2..bcc2479 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -60,6 +60,12 @@ New Features
>    Added the FEC API which provides functions for query FEC capabilities and
>    current FEC mode from device. Also, API for configuring FEC mode is also provided.
>  
> +* **Introduced extended buffer description for receiving.**
> +
> +  Added the extended Rx buffer description for Rx queue setup routine
> +  providing the individual settings for each Rx segment with maximal size,
> +  buffer offset and memory pool to allocate data buffers from.
> +
>  * **Updated Broadcom bnxt driver.**
>  
>    Updated the Broadcom bnxt driver with new features and improvements, including:
> @@ -253,6 +259,9 @@ API Changes
>    As the data of ``uint8_t`` will be truncated when queue number under
>    a TC is greater than 256.
>  
> +* ethdev: Added fields rx_seg and rx_nseg to rte_eth_rxconf structure
> +  to provide extended description of the receiving buffer.
> +
>  * vhost: Moved vDPA APIs from experimental to stable.
>  
>  * rawdev: Added a structure size parameter to the functions
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index 892c246..0eda3f4 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -105,6 +105,9 @@ struct rte_eth_xstats_name_off {
>  #define RTE_RX_OFFLOAD_BIT2STR(_name)	\
>  	{ DEV_RX_OFFLOAD_##_name, #_name }
>  
> +#define RTE_ETH_RX_OFFLOAD_BIT2STR(_name)	\
> +	{ RTE_ETH_RX_OFFLOAD_##_name, #_name }
> +
>  static const struct {
>  	uint64_t offload;
>  	const char *name;
> @@ -128,9 +131,11 @@ struct rte_eth_xstats_name_off {
>  	RTE_RX_OFFLOAD_BIT2STR(SCTP_CKSUM),
>  	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>  	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> +	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
>  };
>  
>  #undef RTE_RX_OFFLOAD_BIT2STR
> +#undef RTE_ETH_RX_OFFLOAD_BIT2STR
>  
>  #define RTE_TX_OFFLOAD_BIT2STR(_name)	\
>  	{ DEV_TX_OFFLOAD_##_name, #_name }
> @@ -1763,6 +1768,77 @@ struct rte_eth_dev *
>  	return ret;
>  }
>  
> +static int
> +rte_eth_rx_queue_check_split(const struct rte_eth_rxseg *rx_seg,
> +			     uint16_t n_seg, uint32_t *mbp_buf_size,
> +			     const struct rte_eth_dev_info *dev_info)
> +{
> +	const struct rte_eth_rxseg_capa *seg_capa = &dev_info->rx_seg_capa;
> +	struct rte_mempool *mp_first;
> +	uint32_t offset_mask;
> +	uint16_t seg_idx;
> +
> +	if (n_seg > seg_capa->max_seg) {
> +		RTE_ETHDEV_LOG(ERR,
> +			       "Requested Rx segments %u exceed supported %u\n",
> +			       n_seg, seg_capa->max_seg);
> +		return -EINVAL;
> +	}
> +	/*
> +	 * Check the sizes and offsets against buffer sizes
> +	 * for each segment specified in extended configuration.
> +	 */
> +	mp_first = rx_seg[0].conf.split.mp;
> +	offset_mask = (1u << seg_capa->offset_align_log2) - 1;
> +	for (seg_idx = 0; seg_idx < n_seg; seg_idx++) {
> +		struct rte_mempool *mpl = rx_seg[seg_idx].conf.split.mp;
> +		uint32_t length = rx_seg[seg_idx].conf.split.length;
> +		uint32_t offset = rx_seg[seg_idx].conf.split.offset;
> +
> +		if (mpl == NULL) {
> +			RTE_ETHDEV_LOG(ERR, "null mempool pointer\n");
> +			return -EINVAL;
> +		}
> +		if (seg_idx != 0 && mp_first != mpl &&
> +		    seg_capa->multi_pools == 0) {
> +			RTE_ETHDEV_LOG(ERR, "Receiving to multiple pools is not supported\n");
> +			return -ENOTSUP;
> +		}
> +		if (offset != 0) {
> +			if (seg_capa->offset_allowed == 0) {
> +				RTE_ETHDEV_LOG(ERR, "Rx segmentation with offset is not supported\n");
> +				return -ENOTSUP;
> +			}
> +			if (offset & offset_mask) {
> +				RTE_ETHDEV_LOG(ERR, "Rx segmentat invalid offset alignment %u, %u\n",
> +					       offset,
> +					       seg_capa->offset_align_log2);
> +				return -EINVAL;
> +			}
> +		}
> +		if (mpl->private_data_size <
> +			sizeof(struct rte_pktmbuf_pool_private)) {
> +			RTE_ETHDEV_LOG(ERR,
> +				       "%s private_data_size %u < %u\n",
> +				       mpl->name, mpl->private_data_size,
> +				       (unsigned int)sizeof
> +					(struct rte_pktmbuf_pool_private));
> +			return -ENOSPC;
> +		}
> +		offset += seg_idx != 0 ? 0 : RTE_PKTMBUF_HEADROOM;
> +		*mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
> +		length = length != 0 ? length : *mbp_buf_size;
> +		if (*mbp_buf_size < length + offset) {
> +			RTE_ETHDEV_LOG(ERR,
> +				       "%s mbuf_data_room_size %u < %u (segment length=%u + segment offset=%u)\n",
> +				       mpl->name, *mbp_buf_size,
> +				       length + offset, length, offset);
> +			return -EINVAL;
> +		}
> +	}
> +	return 0;
> +}
> +
>  int
>  rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
>  		       uint16_t nb_rx_desc, unsigned int socket_id,
> @@ -1784,38 +1860,64 @@ struct rte_eth_dev *
>  		return -EINVAL;
>  	}
>  
> -	if (mp == NULL) {
> -		RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
> -		return -EINVAL;
> -	}
> -
>  	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_setup, -ENOTSUP);
>  
> -	/*
> -	 * Check the size of the mbuf data buffer.
> -	 * This value must be provided in the private data of the memory pool.
> -	 * First check that the memory pool has a valid private data.
> -	 */
>  	ret = rte_eth_dev_info_get(port_id, &dev_info);
>  	if (ret != 0)
>  		return ret;
>  
> -	if (mp->private_data_size < sizeof(struct rte_pktmbuf_pool_private)) {
> -		RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
> -			mp->name, (int)mp->private_data_size,
> -			(int)sizeof(struct rte_pktmbuf_pool_private));
> -		return -ENOSPC;
> -	}
> -	mbp_buf_size = rte_pktmbuf_data_room_size(mp);
> +	if (mp != NULL) {
> +		/* Single pool configuration check. */
> +		if (rx_conf->rx_nseg != 0) {
> +			RTE_ETHDEV_LOG(ERR,
> +				       "Ambiguous segment configuration\n");
> +			return -EINVAL;
> +		}
> +		/*
> +		 * Check the size of the mbuf data buffer, this value
> +		 * must be provided in the private data of the memory pool.
> +		 * First check that the memory pool(s) has a valid private data.
> +		 */
> +		if (mp->private_data_size <
> +				sizeof(struct rte_pktmbuf_pool_private)) {
> +			RTE_ETHDEV_LOG(ERR, "%s private_data_size %u < %u\n",
> +				mp->name, mp->private_data_size,
> +				(unsigned int)
> +				sizeof(struct rte_pktmbuf_pool_private));
> +			return -ENOSPC;
> +		}
> +		mbp_buf_size = rte_pktmbuf_data_room_size(mp);
> +		if (mbp_buf_size < dev_info.min_rx_bufsize +
> +				   RTE_PKTMBUF_HEADROOM) {
> +			RTE_ETHDEV_LOG(ERR,
> +				       "%s mbuf_data_room_size %u < %u (RTE_PKTMBUF_HEADROOM=%u + min_rx_bufsize(dev)=%u)\n",
> +				       mp->name, mbp_buf_size,
> +				       RTE_PKTMBUF_HEADROOM +
> +				       dev_info.min_rx_bufsize,
> +				       RTE_PKTMBUF_HEADROOM,
> +				       dev_info.min_rx_bufsize);
> +			return -EINVAL;
> +		}
> +	} else {
> +		const struct rte_eth_rxseg *rx_seg = rx_conf->rx_seg;
> +		uint16_t n_seg = rx_conf->rx_nseg;
>  
> -	if (mbp_buf_size < dev_info.min_rx_bufsize + RTE_PKTMBUF_HEADROOM) {
> -		RTE_ETHDEV_LOG(ERR,
> -			"%s mbuf_data_room_size %d < %d (RTE_PKTMBUF_HEADROOM=%d + min_rx_bufsize(dev)=%d)\n",
> -			mp->name, (int)mbp_buf_size,
> -			(int)(RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize),
> -			(int)RTE_PKTMBUF_HEADROOM,
> -			(int)dev_info.min_rx_bufsize);
> -		return -EINVAL;
> +		/* Extended multi-segment configuration check. */
> +		if (rx_conf->rx_seg == NULL || rx_conf->rx_nseg == 0) {
> +			RTE_ETHDEV_LOG(ERR,
> +				       "Memory pool is null and no extended configuration provided\n");
> +			return -EINVAL;
> +		}
> +		if (rx_conf->offloads & RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT) {
> +			ret = rte_eth_rx_queue_check_split(rx_seg, n_seg,
> +							   &mbp_buf_size,
> +							   &dev_info);
> +			if (ret != 0)
> +				return ret;
> +		} else {
> +			RTE_ETHDEV_LOG(ERR, "No Rx segmentation offload configured\n");
> +			return -EINVAL;
> +		}
>  	}
>  
>  	/* Use default specified by driver, if nb_rx_desc is zero */
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 5bcfbb8..6f18bec 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -970,6 +970,27 @@ struct rte_eth_txmode {
>  };
>  
>  /**
> + * A structure used to configure an Rx packet segment to split.
> + */
> +struct rte_eth_rxseg_split {
> +	struct rte_mempool *mp; /**< Memory pool to allocate segment from. */
> +	uint16_t length; /**< Segment data length, configures split point. */
> +	uint16_t offset; /**< Data offset from beginning of mbuf data buffer. */
> +	uint32_t reserved; /**< Reserved field. */
> +};
> +
> +/**
> + * A common structure used to describe Rx packet segment properties.
> + */
> +struct rte_eth_rxseg {
> +	union {

Why not just 'union rte_eth_rxseg' ?

> +		/* The settings for buffer split offload. */
> +		struct rte_eth_rxseg_split split;

Pointer to a split table must be here. I.e.
struct rte_eth_rxseg_split *split;
Also it must be specified how the array is terminated.
We need either a number of define last item condition
(mp == NULL ?)

> +		/* The other features settings should be added here. */
> +	} conf;
> +};



> +
> +/**
>   * A structure used to configure an RX ring of an Ethernet port.
>   */
>  struct rte_eth_rxconf {
> @@ -977,6 +998,46 @@ struct rte_eth_rxconf {
>  	uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
>  	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>  	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> +	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +	/**
> +	 * Points to the array of segment descriptions. Each array element
> +	 * describes the properties for each segment in the receiving
> +	 * buffer according to feature descripting structure.
> +	 *
> +	 * The supported capabilities of receiving segmentation is reported
> +	 * in rte_eth_dev_info ->rx_seg_capa field.
> +	 *
> +	 * If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag is set in offloads field,
> +	 * the PMD will split the received packets into multiple segments
> +	 * according to the specification in the description array:
> +	 *
> +	 * - the first network buffer will be allocated from the memory pool,
> +	 *   specified in the first array element, the second buffer, from the
> +	 *   pool in the second element, and so on.
> +	 *
> +	 * - the offsets from the segment description elements specify
> +	 *   the data offset from the buffer beginning except the first mbuf.
> +	 *   For this one the offset is added with RTE_PKTMBUF_HEADROOM.
> +	 *
> +	 * - the lengths in the elements define the maximal data amount
> +	 *   being received to each segment. The receiving starts with filling
> +	 *   up the first mbuf data buffer up to specified length. If the
> +	 *   there are data remaining (packet is longer than buffer in the first
> +	 *   mbuf) the following data will be pushed to the next segment
> +	 *   up to its own length, and so on.
> +	 *
> +	 * - If the length in the segment description element is zero
> +	 *   the actual buffer size will be deduced from the appropriate
> +	 *   memory pool properties.
> +	 *
> +	 * - if there is not enough elements to describe the buffer for entire
> +	 *   packet of maximal length the following parameters will be used
> +	 *   for the all remaining segments:
> +	 *     - pool from the last valid element
> +	 *     - the buffer size from this pool
> +	 *     - zero offset
> +	 */
> +	struct rte_eth_rxseg *rx_seg;

It must not be a pointer. It looks really strange this way
taking into account that it is a union in fact.
Also, why is it put here in the middle of exsiting structure?
IMHO it should be added after offlaods.

>  	/**
>  	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>  	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1260,6 +1321,7 @@ struct rte_eth_conf {
>  #define DEV_RX_OFFLOAD_SCTP_CKSUM	0x00020000
>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>  #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
> +#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
>  
>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>  				 DEV_RX_OFFLOAD_UDP_CKSUM | \
> @@ -1376,6 +1438,17 @@ struct rte_eth_switch_info {
>  };
>  
>  /**
> + * Ethernet device Rx buffer segmentation capabilities.
> + */
> +__extension__
> +struct rte_eth_rxseg_capa {
> +	uint16_t max_seg; /**< Maximum amount of segments to split. */

May be 'max_segs' to avoid confusing vs maximum segment length.

> +	uint16_t multi_pools:1; /**< Supports receiving to multiple pools.*/
> +	uint16_t offset_allowed:1; /**< Supports buffer offsets. */
> +	uint16_t offset_align_log2:4; /**< Required offset alignment. */

4 bits are even insufficient to specify cache-line alignment.
IMHO at least 8 bits are required.

Consider to put 32 width bit-fields at start of the structure.
Than, max_segs (16), offset_align_log2 (8), plus reserved (8).

> +};
> +
> +/**
>   * Ethernet device information
>   */
>  
> @@ -1403,6 +1476,7 @@ struct rte_eth_dev_info {
>  	/** Maximum number of hash MAC addresses for MTA and UTA. */
>  	uint16_t max_vfs; /**< Maximum number of VFs. */
>  	uint16_t max_vmdq_pools; /**< Maximum number of VMDq pools. */
> +	struct rte_eth_rxseg_capa rx_seg_capa; /**< Segmentation capability.*/

Why is it put in the middle of existing structure?

>  	uint64_t rx_offload_capa;
>  	/**< All RX offload capabilities including all per-queue ones */
>  	uint64_t tx_offload_capa;
> @@ -2027,9 +2101,21 @@ int rte_eth_dev_configure(uint16_t port_id, uint16_t nb_rx_queue,
>   *   No need to repeat any bit in rx_conf->offloads which has already been
>   *   enabled in rte_eth_dev_configure() at port level. An offloading enabled
>   *   at port level can't be disabled at queue level.
> + *   The configuration structure also contains the pointer to the array
> + *   of the receiving buffer segment descriptions, see rx_seg and rx_nseg
> + *   fields, this extended configuration might be used by split offloads like
> + *   RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT. If mp_pool is not NULL,
> + *   the extended configuration fields must be set to NULL and zero.
>   * @param mb_pool
>   *   The pointer to the memory pool from which to allocate *rte_mbuf* network
> - *   memory buffers to populate each descriptor of the receive ring.
> + *   memory buffers to populate each descriptor of the receive ring. There are
> + *   two options to provide Rx buffer configuration:
> + *   - single pool:
> + *     mb_pool is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is 0.
> + *   - multiple segments description:
> + *     mb_pool is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not 0.
> + *     Taken only if flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is set in offloads.
> + *
>   * @return
>   *   - 0: Success, receive queue correctly set up.
>   *   - -EIO: if device is removed.
>
  
Andrew Rybchenko Oct. 16, 2020, 8:58 a.m. UTC | #2
On 10/16/20 11:51 AM, Andrew Rybchenko wrote:
> On 10/16/20 10:48 AM, Viacheslav Ovsiienko wrote:
>> The DPDK datapath in the transmit direction is very flexible.
>> An application can build the multi-segment packet and manages
>> almost all data aspects - the memory pools where segments
>> are allocated from, the segment lengths, the memory attributes
>> like external buffers, registered for DMA, etc.
>>
>> In the receiving direction, the datapath is much less flexible,
>> an application can only specify the memory pool to configure the
>> receiving queue and nothing more. In order to extend receiving
>> datapath capabilities it is proposed to add the way to provide
>> extended information how to split the packets being received.
>>
>> The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
>> capabilities is introduced to present the way for PMD to report to
>> application about supporting Rx packet split to configurable
>> segments. Prior invoking the rte_eth_rx_queue_setup() routine
>> application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.
>>
>> The following structure is introduced to specify the Rx packet
>> segment for RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT offload:
>>
>> struct rte_eth_rxseg_split {
>>
>>     struct rte_mempool *mp; /* memory pools to allocate segment from */
>>     uint16_t length; /* segment maximal data length,
>> 		       	configures "split point" */
>>     uint16_t offset; /* data offset from beginning
>> 		       	of mbuf data buffer */
>>     uint32_t reserved; /* reserved field */
>> };
>>
>> The segment descriptions are added to the rte_eth_rxconf structure:
>>    rx_seg - pointer the array of segment descriptions, each element
>>              describes the memory pool, maximal data length, initial
>>              data offset from the beginning of data buffer in mbuf.
>> 	     This array allows to specify the different settings for
>> 	     each segment in individual fashion.
>>    rx_nseg - number of elements in the array
>>
>> If the extended segment descriptions is provided with these new
>> fields the mp parameter of the rte_eth_rx_queue_setup must be
>> specified as NULL to avoid ambiguity.
>>
>> There are two options to specify Rx buffer configuration:
>> - mp is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is zero,
>>   it is compatible configuration, follows existing implementation,
>>   provides single pool and no description for segment sizes
>>   and offsets.
>> - mp is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not
>>   zero, it provides the extended configuration, individually for
>>   each segment.
>>
>> f the Rx queue is configured with new settings the packets being
>> received will be split into multiple segments pushed to the mbufs
>> with specified attributes. The PMD will split the received packets
>> into multiple segments according to the specification in the
>> description array.
>>
>> For example, let's suppose we configured the Rx queue with the
>> following segments:
>>     seg0 - pool0, len0=14B, off0=2
>>     seg1 - pool1, len1=20B, off1=128B
>>     seg2 - pool2, len2=20B, off2=0B
>>     seg3 - pool3, len3=512B, off3=0B
>>
>> The packet 46 bytes long will look like the following:
>>     seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>>     seg1 - 20B long @ 128 in mbuf from pool1
>>     seg2 - 12B long @ 0 in mbuf from pool2
>>
>> The packet 1500 bytes long will look like the following:
>>     seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>>     seg1 - 20B @ 128 in mbuf from pool1
>>     seg2 - 20B @ 0 in mbuf from pool2
>>     seg3 - 512B @ 0 in mbuf from pool3
>>     seg4 - 512B @ 0 in mbuf from pool3
>>     seg5 - 422B @ 0 in mbuf from pool3
>>
>> The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
>> configured to support new buffer split feature (if rx_nseg
>> is greater than one).
>>
>> The split limitations imposed by underlying PMD is reported
>> in the new introduced rte_eth_dev_info->rx_seg_capa field.
>>
>> The new approach would allow splitting the ingress packets into
>> multiple parts pushed to the memory with different attributes.
>> For example, the packet headers can be pushed to the embedded
>> data buffers within mbufs and the application data into
>> the external buffers attached to mbufs allocated from the
>> different memory pools. The memory attributes for the split
>> parts may differ either - for example the application data
>> may be pushed into the external memory located on the dedicated
>> physical device, say GPU or NVMe. This would improve the DPDK
>> receiving datapath flexibility with preserving compatibility
>> with existing API.
>>
>> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
>> Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
>> Acked-by: Jerin Jacob <jerinj@marvell.com>

With below review notes processed:
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>

>> ---
>>  doc/guides/nics/features.rst           |  15 ++++
>>  doc/guides/rel_notes/deprecation.rst   |   5 --
>>  doc/guides/rel_notes/release_20_11.rst |   9 ++
>>  lib/librte_ethdev/rte_ethdev.c         | 152 +++++++++++++++++++++++++++------
>>  lib/librte_ethdev/rte_ethdev.h         |  88 ++++++++++++++++++-
>>  5 files changed, 238 insertions(+), 31 deletions(-)
>>
>> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
>> index dd8c955..832ea3b 100644
>> --- a/doc/guides/nics/features.rst
>> +++ b/doc/guides/nics/features.rst
>> @@ -185,6 +185,21 @@ Supports receiving segmented mbufs.
>>  * **[related]    eth_dev_ops**: ``rx_pkt_burst``.
>>  
>>  
>> +.. _nic_features_buffer_split:
>> +
>> +Buffer Split on Rx
>> +------------------
>> +
>> +Scatters the packets being received on specified boundaries to segmented mbufs.
>> +
>> +* **[uses]       rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
>> +* **[uses]       rte_eth_rxconf**: ``rx_conf.rx_seg, rx_conf.rx_nseg``.
>> +* **[implements] datapath**: ``Buffer Split functionality``.
>> +* **[provides]   rte_eth_dev_info**: ``rx_offload_capa:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
>> +* **[provides]   eth_dev_ops**: ``rxq_info_get:buffer_split``.
>> +* **[related] API**: ``rte_eth_rx_queue_setup()``.
>> +
>> +
>>  .. _nic_features_lro:
>>  
>>  LRO
>> diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
>> index 584e720..232cd54 100644
>> --- a/doc/guides/rel_notes/deprecation.rst
>> +++ b/doc/guides/rel_notes/deprecation.rst
>> @@ -138,11 +138,6 @@ Deprecation Notices
>>    In 19.11 PMDs will still update the field even when the offload is not
>>    enabled.
>>  
>> -* ethdev: Add new fields to ``rte_eth_rxconf`` to configure the receiving
>> -  queues to split ingress packets into multiple segments according to the
>> -  specified lengths into the buffers allocated from the specified
>> -  memory pools. The backward compatibility to existing API is preserved.
>> -
>>  * ethdev: ``rx_descriptor_done`` dev_ops and ``rte_eth_rx_descriptor_done``
>>    will be removed in 21.11.
>>    Existing ``rte_eth_rx_descriptor_status`` and ``rte_eth_tx_descriptor_status``
>> diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
>> index bcc0fc2..bcc2479 100644
>> --- a/doc/guides/rel_notes/release_20_11.rst
>> +++ b/doc/guides/rel_notes/release_20_11.rst
>> @@ -60,6 +60,12 @@ New Features
>>    Added the FEC API which provides functions for query FEC capabilities and
>>    current FEC mode from device. Also, API for configuring FEC mode is also provided.
>>  
>> +* **Introduced extended buffer description for receiving.**
>> +
>> +  Added the extended Rx buffer description for Rx queue setup routine
>> +  providing the individual settings for each Rx segment with maximal size,
>> +  buffer offset and memory pool to allocate data buffers from.
>> +
>>  * **Updated Broadcom bnxt driver.**
>>  
>>    Updated the Broadcom bnxt driver with new features and improvements, including:
>> @@ -253,6 +259,9 @@ API Changes
>>    As the data of ``uint8_t`` will be truncated when queue number under
>>    a TC is greater than 256.
>>  
>> +* ethdev: Added fields rx_seg and rx_nseg to rte_eth_rxconf structure
>> +  to provide extended description of the receiving buffer.
>> +
>>  * vhost: Moved vDPA APIs from experimental to stable.
>>  
>>  * rawdev: Added a structure size parameter to the functions
>> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
>> index 892c246..0eda3f4 100644
>> --- a/lib/librte_ethdev/rte_ethdev.c
>> +++ b/lib/librte_ethdev/rte_ethdev.c
>> @@ -105,6 +105,9 @@ struct rte_eth_xstats_name_off {
>>  #define RTE_RX_OFFLOAD_BIT2STR(_name)	\
>>  	{ DEV_RX_OFFLOAD_##_name, #_name }
>>  
>> +#define RTE_ETH_RX_OFFLOAD_BIT2STR(_name)	\
>> +	{ RTE_ETH_RX_OFFLOAD_##_name, #_name }
>> +
>>  static const struct {
>>  	uint64_t offload;
>>  	const char *name;
>> @@ -128,9 +131,11 @@ struct rte_eth_xstats_name_off {
>>  	RTE_RX_OFFLOAD_BIT2STR(SCTP_CKSUM),
>>  	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>>  	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
>> +	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
>>  };
>>  
>>  #undef RTE_RX_OFFLOAD_BIT2STR
>> +#undef RTE_ETH_RX_OFFLOAD_BIT2STR
>>  
>>  #define RTE_TX_OFFLOAD_BIT2STR(_name)	\
>>  	{ DEV_TX_OFFLOAD_##_name, #_name }
>> @@ -1763,6 +1768,77 @@ struct rte_eth_dev *
>>  	return ret;
>>  }
>>  
>> +static int
>> +rte_eth_rx_queue_check_split(const struct rte_eth_rxseg *rx_seg,
>> +			     uint16_t n_seg, uint32_t *mbp_buf_size,
>> +			     const struct rte_eth_dev_info *dev_info)
>> +{
>> +	const struct rte_eth_rxseg_capa *seg_capa = &dev_info->rx_seg_capa;
>> +	struct rte_mempool *mp_first;
>> +	uint32_t offset_mask;
>> +	uint16_t seg_idx;
>> +
>> +	if (n_seg > seg_capa->max_seg) {
>> +		RTE_ETHDEV_LOG(ERR,
>> +			       "Requested Rx segments %u exceed supported %u\n",
>> +			       n_seg, seg_capa->max_seg);
>> +		return -EINVAL;
>> +	}
>> +	/*
>> +	 * Check the sizes and offsets against buffer sizes
>> +	 * for each segment specified in extended configuration.
>> +	 */
>> +	mp_first = rx_seg[0].conf.split.mp;
>> +	offset_mask = (1u << seg_capa->offset_align_log2) - 1;
>> +	for (seg_idx = 0; seg_idx < n_seg; seg_idx++) {
>> +		struct rte_mempool *mpl = rx_seg[seg_idx].conf.split.mp;
>> +		uint32_t length = rx_seg[seg_idx].conf.split.length;
>> +		uint32_t offset = rx_seg[seg_idx].conf.split.offset;
>> +
>> +		if (mpl == NULL) {
>> +			RTE_ETHDEV_LOG(ERR, "null mempool pointer\n");
>> +			return -EINVAL;
>> +		}
>> +		if (seg_idx != 0 && mp_first != mpl &&
>> +		    seg_capa->multi_pools == 0) {
>> +			RTE_ETHDEV_LOG(ERR, "Receiving to multiple pools is not supported\n");
>> +			return -ENOTSUP;
>> +		}
>> +		if (offset != 0) {
>> +			if (seg_capa->offset_allowed == 0) {
>> +				RTE_ETHDEV_LOG(ERR, "Rx segmentation with offset is not supported\n");
>> +				return -ENOTSUP;
>> +			}
>> +			if (offset & offset_mask) {
>> +				RTE_ETHDEV_LOG(ERR, "Rx segmentat invalid offset alignment %u, %u\n",
>> +					       offset,
>> +					       seg_capa->offset_align_log2);
>> +				return -EINVAL;
>> +			}
>> +		}
>> +		if (mpl->private_data_size <
>> +			sizeof(struct rte_pktmbuf_pool_private)) {
>> +			RTE_ETHDEV_LOG(ERR,
>> +				       "%s private_data_size %u < %u\n",
>> +				       mpl->name, mpl->private_data_size,
>> +				       (unsigned int)sizeof
>> +					(struct rte_pktmbuf_pool_private));
>> +			return -ENOSPC;
>> +		}
>> +		offset += seg_idx != 0 ? 0 : RTE_PKTMBUF_HEADROOM;
>> +		*mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
>> +		length = length != 0 ? length : *mbp_buf_size;
>> +		if (*mbp_buf_size < length + offset) {
>> +			RTE_ETHDEV_LOG(ERR,
>> +				       "%s mbuf_data_room_size %u < %u (segment length=%u + segment offset=%u)\n",
>> +				       mpl->name, *mbp_buf_size,
>> +				       length + offset, length, offset);
>> +			return -EINVAL;
>> +		}
>> +	}
>> +	return 0;
>> +}
>> +
>>  int
>>  rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
>>  		       uint16_t nb_rx_desc, unsigned int socket_id,
>> @@ -1784,38 +1860,64 @@ struct rte_eth_dev *
>>  		return -EINVAL;
>>  	}
>>  
>> -	if (mp == NULL) {
>> -		RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
>> -		return -EINVAL;
>> -	}
>> -
>>  	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_setup, -ENOTSUP);
>>  
>> -	/*
>> -	 * Check the size of the mbuf data buffer.
>> -	 * This value must be provided in the private data of the memory pool.
>> -	 * First check that the memory pool has a valid private data.
>> -	 */
>>  	ret = rte_eth_dev_info_get(port_id, &dev_info);
>>  	if (ret != 0)
>>  		return ret;
>>  
>> -	if (mp->private_data_size < sizeof(struct rte_pktmbuf_pool_private)) {
>> -		RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
>> -			mp->name, (int)mp->private_data_size,
>> -			(int)sizeof(struct rte_pktmbuf_pool_private));
>> -		return -ENOSPC;
>> -	}
>> -	mbp_buf_size = rte_pktmbuf_data_room_size(mp);
>> +	if (mp != NULL) {
>> +		/* Single pool configuration check. */
>> +		if (rx_conf->rx_nseg != 0) {
>> +			RTE_ETHDEV_LOG(ERR,
>> +				       "Ambiguous segment configuration\n");
>> +			return -EINVAL;
>> +		}
>> +		/*
>> +		 * Check the size of the mbuf data buffer, this value
>> +		 * must be provided in the private data of the memory pool.
>> +		 * First check that the memory pool(s) has a valid private data.
>> +		 */
>> +		if (mp->private_data_size <
>> +				sizeof(struct rte_pktmbuf_pool_private)) {
>> +			RTE_ETHDEV_LOG(ERR, "%s private_data_size %u < %u\n",
>> +				mp->name, mp->private_data_size,
>> +				(unsigned int)
>> +				sizeof(struct rte_pktmbuf_pool_private));
>> +			return -ENOSPC;
>> +		}
>> +		mbp_buf_size = rte_pktmbuf_data_room_size(mp);
>> +		if (mbp_buf_size < dev_info.min_rx_bufsize +
>> +				   RTE_PKTMBUF_HEADROOM) {
>> +			RTE_ETHDEV_LOG(ERR,
>> +				       "%s mbuf_data_room_size %u < %u (RTE_PKTMBUF_HEADROOM=%u + min_rx_bufsize(dev)=%u)\n",
>> +				       mp->name, mbp_buf_size,
>> +				       RTE_PKTMBUF_HEADROOM +
>> +				       dev_info.min_rx_bufsize,
>> +				       RTE_PKTMBUF_HEADROOM,
>> +				       dev_info.min_rx_bufsize);
>> +			return -EINVAL;
>> +		}
>> +	} else {
>> +		const struct rte_eth_rxseg *rx_seg = rx_conf->rx_seg;
>> +		uint16_t n_seg = rx_conf->rx_nseg;
>>  
>> -	if (mbp_buf_size < dev_info.min_rx_bufsize + RTE_PKTMBUF_HEADROOM) {
>> -		RTE_ETHDEV_LOG(ERR,
>> -			"%s mbuf_data_room_size %d < %d (RTE_PKTMBUF_HEADROOM=%d + min_rx_bufsize(dev)=%d)\n",
>> -			mp->name, (int)mbp_buf_size,
>> -			(int)(RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize),
>> -			(int)RTE_PKTMBUF_HEADROOM,
>> -			(int)dev_info.min_rx_bufsize);
>> -		return -EINVAL;
>> +		/* Extended multi-segment configuration check. */
>> +		if (rx_conf->rx_seg == NULL || rx_conf->rx_nseg == 0) {
>> +			RTE_ETHDEV_LOG(ERR,
>> +				       "Memory pool is null and no extended configuration provided\n");
>> +			return -EINVAL;
>> +		}
>> +		if (rx_conf->offloads & RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT) {
>> +			ret = rte_eth_rx_queue_check_split(rx_seg, n_seg,
>> +							   &mbp_buf_size,
>> +							   &dev_info);
>> +			if (ret != 0)
>> +				return ret;
>> +		} else {
>> +			RTE_ETHDEV_LOG(ERR, "No Rx segmentation offload configured\n");
>> +			return -EINVAL;
>> +		}
>>  	}
>>  
>>  	/* Use default specified by driver, if nb_rx_desc is zero */
>> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
>> index 5bcfbb8..6f18bec 100644
>> --- a/lib/librte_ethdev/rte_ethdev.h
>> +++ b/lib/librte_ethdev/rte_ethdev.h
>> @@ -970,6 +970,27 @@ struct rte_eth_txmode {
>>  };
>>  
>>  /**
>> + * A structure used to configure an Rx packet segment to split.
>> + */
>> +struct rte_eth_rxseg_split {
>> +	struct rte_mempool *mp; /**< Memory pool to allocate segment from. */
>> +	uint16_t length; /**< Segment data length, configures split point. */
>> +	uint16_t offset; /**< Data offset from beginning of mbuf data buffer. */
>> +	uint32_t reserved; /**< Reserved field. */
>> +};
>> +
>> +/**
>> + * A common structure used to describe Rx packet segment properties.
>> + */
>> +struct rte_eth_rxseg {
>> +	union {
> 
> Why not just 'union rte_eth_rxseg' ?
> 
>> +		/* The settings for buffer split offload. */
>> +		struct rte_eth_rxseg_split split;
> 
> Pointer to a split table must be here. I.e.
> struct rte_eth_rxseg_split *split;
> Also it must be specified how the array is terminated.
> We need either a number of define last item condition
> (mp == NULL ?)
> 
>> +		/* The other features settings should be added here. */
>> +	} conf;
>> +};
> 
> 
> 
>> +
>> +/**
>>   * A structure used to configure an RX ring of an Ethernet port.
>>   */
>>  struct rte_eth_rxconf {
>> @@ -977,6 +998,46 @@ struct rte_eth_rxconf {
>>  	uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
>>  	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>>  	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>> +	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
>> +	/**
>> +	 * Points to the array of segment descriptions. Each array element
>> +	 * describes the properties for each segment in the receiving
>> +	 * buffer according to feature descripting structure.
>> +	 *
>> +	 * The supported capabilities of receiving segmentation is reported
>> +	 * in rte_eth_dev_info ->rx_seg_capa field.
>> +	 *
>> +	 * If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag is set in offloads field,
>> +	 * the PMD will split the received packets into multiple segments
>> +	 * according to the specification in the description array:
>> +	 *
>> +	 * - the first network buffer will be allocated from the memory pool,
>> +	 *   specified in the first array element, the second buffer, from the
>> +	 *   pool in the second element, and so on.
>> +	 *
>> +	 * - the offsets from the segment description elements specify
>> +	 *   the data offset from the buffer beginning except the first mbuf.
>> +	 *   For this one the offset is added with RTE_PKTMBUF_HEADROOM.
>> +	 *
>> +	 * - the lengths in the elements define the maximal data amount
>> +	 *   being received to each segment. The receiving starts with filling
>> +	 *   up the first mbuf data buffer up to specified length. If the
>> +	 *   there are data remaining (packet is longer than buffer in the first
>> +	 *   mbuf) the following data will be pushed to the next segment
>> +	 *   up to its own length, and so on.
>> +	 *
>> +	 * - If the length in the segment description element is zero
>> +	 *   the actual buffer size will be deduced from the appropriate
>> +	 *   memory pool properties.
>> +	 *
>> +	 * - if there is not enough elements to describe the buffer for entire
>> +	 *   packet of maximal length the following parameters will be used
>> +	 *   for the all remaining segments:
>> +	 *     - pool from the last valid element
>> +	 *     - the buffer size from this pool
>> +	 *     - zero offset
>> +	 */
>> +	struct rte_eth_rxseg *rx_seg;
> 
> It must not be a pointer. It looks really strange this way
> taking into account that it is a union in fact.
> Also, why is it put here in the middle of exsiting structure?
> IMHO it should be added after offlaods.
> 
>>  	/**
>>  	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>>  	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
>> @@ -1260,6 +1321,7 @@ struct rte_eth_conf {
>>  #define DEV_RX_OFFLOAD_SCTP_CKSUM	0x00020000
>>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>>  #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
>> +#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
>>  
>>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>>  				 DEV_RX_OFFLOAD_UDP_CKSUM | \
>> @@ -1376,6 +1438,17 @@ struct rte_eth_switch_info {
>>  };
>>  
>>  /**
>> + * Ethernet device Rx buffer segmentation capabilities.
>> + */
>> +__extension__
>> +struct rte_eth_rxseg_capa {
>> +	uint16_t max_seg; /**< Maximum amount of segments to split. */
> 
> May be 'max_segs' to avoid confusing vs maximum segment length.
> 
>> +	uint16_t multi_pools:1; /**< Supports receiving to multiple pools.*/
>> +	uint16_t offset_allowed:1; /**< Supports buffer offsets. */
>> +	uint16_t offset_align_log2:4; /**< Required offset alignment. */
> 
> 4 bits are even insufficient to specify cache-line alignment.
> IMHO at least 8 bits are required.
> 
> Consider to put 32 width bit-fields at start of the structure.
> Than, max_segs (16), offset_align_log2 (8), plus reserved (8).
> 
>> +};
>> +
>> +/**
>>   * Ethernet device information
>>   */
>>  
>> @@ -1403,6 +1476,7 @@ struct rte_eth_dev_info {
>>  	/** Maximum number of hash MAC addresses for MTA and UTA. */
>>  	uint16_t max_vfs; /**< Maximum number of VFs. */
>>  	uint16_t max_vmdq_pools; /**< Maximum number of VMDq pools. */
>> +	struct rte_eth_rxseg_capa rx_seg_capa; /**< Segmentation capability.*/
> 
> Why is it put in the middle of existing structure?
> 
>>  	uint64_t rx_offload_capa;
>>  	/**< All RX offload capabilities including all per-queue ones */
>>  	uint64_t tx_offload_capa;
>> @@ -2027,9 +2101,21 @@ int rte_eth_dev_configure(uint16_t port_id, uint16_t nb_rx_queue,
>>   *   No need to repeat any bit in rx_conf->offloads which has already been
>>   *   enabled in rte_eth_dev_configure() at port level. An offloading enabled
>>   *   at port level can't be disabled at queue level.
>> + *   The configuration structure also contains the pointer to the array
>> + *   of the receiving buffer segment descriptions, see rx_seg and rx_nseg
>> + *   fields, this extended configuration might be used by split offloads like
>> + *   RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT. If mp_pool is not NULL,
>> + *   the extended configuration fields must be set to NULL and zero.
>>   * @param mb_pool
>>   *   The pointer to the memory pool from which to allocate *rte_mbuf* network
>> - *   memory buffers to populate each descriptor of the receive ring.
>> + *   memory buffers to populate each descriptor of the receive ring. There are
>> + *   two options to provide Rx buffer configuration:
>> + *   - single pool:
>> + *     mb_pool is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is 0.
>> + *   - multiple segments description:
>> + *     mb_pool is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not 0.
>> + *     Taken only if flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is set in offloads.
>> + *
>>   * @return
>>   *   - 0: Success, receive queue correctly set up.
>>   *   - -EIO: if device is removed.
>>
>
  
Slava Ovsiienko Oct. 16, 2020, 9:15 a.m. UTC | #3
Hi, Andrew

> -----Original Message-----
> From: Andrew Rybchenko <arybchenko@solarflare.com>
> Sent: Friday, October 16, 2020 11:52
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org
> Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> stephen@networkplumber.org; ferruh.yigit@intel.com;
> olivier.matz@6wind.com; jerinjacobk@gmail.com;
> maxime.coquelin@redhat.com; david.marchand@redhat.com
> Subject: Re: [PATCH v8 1/6] ethdev: introduce Rx buffer split
> 
> On 10/16/20 10:48 AM, Viacheslav Ovsiienko wrote:
> > The DPDK datapath in the transmit direction is very flexible.
> > An application can build the multi-segment packet and manages almost
> > all data aspects - the memory pools where segments are allocated from,
> > the segment lengths, the memory attributes like external buffers,
> > registered for DMA, etc.
> >
[snip]
> > +struct rte_eth_rxseg {
> > +	union {
> 
> Why not just 'union rte_eth_rxseg' ?
> 
> > +		/* The settings for buffer split offload. */
> > +		struct rte_eth_rxseg_split split;
> 
> Pointer to a split table must be here. I.e.
> struct rte_eth_rxseg_split *split;
OK, will try to simplify with that, thanks.

> Also it must be specified how the array is terminated.
> We need either a number of define last item condition (mp == NULL ?)

We have one, please see: "rte_eth_rxconf->rx_nseg"
> 
> > +		/* The other features settings should be added here. */
> > +	} conf;
> > +};
> 
> 
> 
> > +
> > +/**
> >   * A structure used to configure an RX ring of an Ethernet port.
> >   */
> >  struct rte_eth_rxconf {
> > @@ -977,6 +998,46 @@ struct rte_eth_rxconf {
> >  	uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
> >  	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available.
> */
> > +	struct rte_eth_rxseg *rx_seg;
> 
> It must not be a pointer. It looks really strange this way taking into account
> that it is a union in fact.
> Also, why is it put here in the middle of exsiting structure?
> IMHO it should be added after offlaods.
OK, agree, will move.

> 
> >  	/**
> >  	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >  	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa @@
> > -1260,6 +1321,7 @@ struct rte_eth_conf {
> >  #define DEV_RX_OFFLOAD_SCTP_CKSUM	0x00020000
> >  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> >  #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
> > +#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> >
> >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >  				 DEV_RX_OFFLOAD_UDP_CKSUM | \
> > @@ -1376,6 +1438,17 @@ struct rte_eth_switch_info {  };
> >
> >  /**
> > + * Ethernet device Rx buffer segmentation capabilities.
> > + */
> > +__extension__
> > +struct rte_eth_rxseg_capa {
> > +	uint16_t max_seg; /**< Maximum amount of segments to split. */
> 
> May be 'max_segs' to avoid confusing vs maximum segment length.
> 
OK, max_nseg would be more appropriate.

> > +	uint16_t multi_pools:1; /**< Supports receiving to multiple pools.*/
> > +	uint16_t offset_allowed:1; /**< Supports buffer offsets. */
> > +	uint16_t offset_align_log2:4; /**< Required offset alignment. */
> 
> 4 bits are even insufficient to specify cache-line alignment.
> IMHO at least 8 bits are required.

4 bits seems to be quite sufficient. It is a log2, tells how many lsbs in offset should be zeroes.
2^15 is 32K, it covers all reasonable alignments for uint16_t type.

> 
> Consider to put 32 width bit-fields at start of the structure.
> Than, max_segs (16), offset_align_log2 (8), plus reserved (8).
OK, np.

With best regards, Slava
  
Ferruh Yigit Oct. 16, 2020, 9:19 a.m. UTC | #4
On 10/16/2020 8:48 AM, Viacheslav Ovsiienko wrote:
> The DPDK datapath in the transmit direction is very flexible.
> An application can build the multi-segment packet and manages
> almost all data aspects - the memory pools where segments
> are allocated from, the segment lengths, the memory attributes
> like external buffers, registered for DMA, etc.
> 
> In the receiving direction, the datapath is much less flexible,
> an application can only specify the memory pool to configure the
> receiving queue and nothing more. In order to extend receiving
> datapath capabilities it is proposed to add the way to provide
> extended information how to split the packets being received.
> 
> The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
> capabilities is introduced to present the way for PMD to report to
> application about supporting Rx packet split to configurable
> segments. Prior invoking the rte_eth_rx_queue_setup() routine
> application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.
> 
> The following structure is introduced to specify the Rx packet
> segment for RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT offload:
> 
> struct rte_eth_rxseg_split {
> 
>      struct rte_mempool *mp; /* memory pools to allocate segment from */
>      uint16_t length; /* segment maximal data length,
> 		       	configures "split point" */
>      uint16_t offset; /* data offset from beginning
> 		       	of mbuf data buffer */
>      uint32_t reserved; /* reserved field */
> };
> 
> The segment descriptions are added to the rte_eth_rxconf structure:
>     rx_seg - pointer the array of segment descriptions, each element
>               describes the memory pool, maximal data length, initial
>               data offset from the beginning of data buffer in mbuf.
> 	     This array allows to specify the different settings for
> 	     each segment in individual fashion.
>     rx_nseg - number of elements in the array
> 
> If the extended segment descriptions is provided with these new
> fields the mp parameter of the rte_eth_rx_queue_setup must be
> specified as NULL to avoid ambiguity.
> 
> There are two options to specify Rx buffer configuration:
> - mp is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is zero,
>    it is compatible configuration, follows existing implementation,
>    provides single pool and no description for segment sizes
>    and offsets.
> - mp is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not
>    zero, it provides the extended configuration, individually for
>    each segment.
> 
> f the Rx queue is configured with new settings the packets being
> received will be split into multiple segments pushed to the mbufs
> with specified attributes. The PMD will split the received packets
> into multiple segments according to the specification in the
> description array.
> 
> For example, let's suppose we configured the Rx queue with the
> following segments:
>      seg0 - pool0, len0=14B, off0=2
>      seg1 - pool1, len1=20B, off1=128B
>      seg2 - pool2, len2=20B, off2=0B
>      seg3 - pool3, len3=512B, off3=0B
> 
> The packet 46 bytes long will look like the following:
>      seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>      seg1 - 20B long @ 128 in mbuf from pool1
>      seg2 - 12B long @ 0 in mbuf from pool2
> 
> The packet 1500 bytes long will look like the following:
>      seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>      seg1 - 20B @ 128 in mbuf from pool1
>      seg2 - 20B @ 0 in mbuf from pool2
>      seg3 - 512B @ 0 in mbuf from pool3
>      seg4 - 512B @ 0 in mbuf from pool3
>      seg5 - 422B @ 0 in mbuf from pool3
> 
> The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
> configured to support new buffer split feature (if rx_nseg
> is greater than one).
> 
> The split limitations imposed by underlying PMD is reported
> in the new introduced rte_eth_dev_info->rx_seg_capa field.
> 
> The new approach would allow splitting the ingress packets into
> multiple parts pushed to the memory with different attributes.
> For example, the packet headers can be pushed to the embedded
> data buffers within mbufs and the application data into
> the external buffers attached to mbufs allocated from the
> different memory pools. The memory attributes for the split
> parts may differ either - for example the application data
> may be pushed into the external memory located on the dedicated
> physical device, say GPU or NVMe. This would improve the DPDK
> receiving datapath flexibility with preserving compatibility
> with existing API.
> 
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>

<...>

> +/**
>    * A structure used to configure an RX ring of an Ethernet port.
>    */
>   struct rte_eth_rxconf {
> @@ -977,6 +998,46 @@ struct rte_eth_rxconf {
>   	uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
>   	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>   	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> +	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +	/**
> +	 * Points to the array of segment descriptions. Each array element
> +	 * describes the properties for each segment in the receiving
> +	 * buffer according to feature descripting structure.
> +	 *
> +	 * The supported capabilities of receiving segmentation is reported
> +	 * in rte_eth_dev_info ->rx_seg_capa field.
> +	 *
> +	 * If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag is set in offloads field,
> +	 * the PMD will split the received packets into multiple segments
> +	 * according to the specification in the description array:
> +	 *
> +	 * - the first network buffer will be allocated from the memory pool,
> +	 *   specified in the first array element, the second buffer, from the
> +	 *   pool in the second element, and so on.
> +	 *
> +	 * - the offsets from the segment description elements specify
> +	 *   the data offset from the buffer beginning except the first mbuf.
> +	 *   For this one the offset is added with RTE_PKTMBUF_HEADROOM.
> +	 *
> +	 * - the lengths in the elements define the maximal data amount
> +	 *   being received to each segment. The receiving starts with filling
> +	 *   up the first mbuf data buffer up to specified length. If the
> +	 *   there are data remaining (packet is longer than buffer in the first
> +	 *   mbuf) the following data will be pushed to the next segment
> +	 *   up to its own length, and so on.
> +	 *
> +	 * - If the length in the segment description element is zero
> +	 *   the actual buffer size will be deduced from the appropriate
> +	 *   memory pool properties.
> +	 *
> +	 * - if there is not enough elements to describe the buffer for entire
> +	 *   packet of maximal length the following parameters will be used
> +	 *   for the all remaining segments:
> +	 *     - pool from the last valid element
> +	 *     - the buffer size from this pool
> +	 *     - zero offset
> +	 */
> +	struct rte_eth_rxseg *rx_seg;

"struct rte_eth_rxconf" is very commonly used, I think all applications does the 
'rte_eth_rx_queue_setup()', but "buffer split" is not a common usage,

I am against the "struct rte_eth_rxseg *rx_seg;" field creating this much noise 
in the "struct rte_eth_rxconf" documentation.
As mentioned before, can you please move the above detailed documentation to 
where "struct rte_eth_rxseg" defined, and in this struct put a single comment 
for "struct rte_eth_rxseg *rx_seg" ?
  
Andrew Rybchenko Oct. 16, 2020, 9:21 a.m. UTC | #5
On 10/16/20 12:19 PM, Ferruh Yigit wrote:
> On 10/16/2020 8:48 AM, Viacheslav Ovsiienko wrote:
>> The DPDK datapath in the transmit direction is very flexible.
>> An application can build the multi-segment packet and manages
>> almost all data aspects - the memory pools where segments
>> are allocated from, the segment lengths, the memory attributes
>> like external buffers, registered for DMA, etc.
>>
>> In the receiving direction, the datapath is much less flexible,
>> an application can only specify the memory pool to configure the
>> receiving queue and nothing more. In order to extend receiving
>> datapath capabilities it is proposed to add the way to provide
>> extended information how to split the packets being received.
>>
>> The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
>> capabilities is introduced to present the way for PMD to report to
>> application about supporting Rx packet split to configurable
>> segments. Prior invoking the rte_eth_rx_queue_setup() routine
>> application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.
>>
>> The following structure is introduced to specify the Rx packet
>> segment for RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT offload:
>>
>> struct rte_eth_rxseg_split {
>>
>>      struct rte_mempool *mp; /* memory pools to allocate segment from */
>>      uint16_t length; /* segment maximal data length,
>>                    configures "split point" */
>>      uint16_t offset; /* data offset from beginning
>>                    of mbuf data buffer */
>>      uint32_t reserved; /* reserved field */
>> };
>>
>> The segment descriptions are added to the rte_eth_rxconf structure:
>>     rx_seg - pointer the array of segment descriptions, each element
>>               describes the memory pool, maximal data length, initial
>>               data offset from the beginning of data buffer in mbuf.
>>          This array allows to specify the different settings for
>>          each segment in individual fashion.
>>     rx_nseg - number of elements in the array
>>
>> If the extended segment descriptions is provided with these new
>> fields the mp parameter of the rte_eth_rx_queue_setup must be
>> specified as NULL to avoid ambiguity.
>>
>> There are two options to specify Rx buffer configuration:
>> - mp is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is zero,
>>    it is compatible configuration, follows existing implementation,
>>    provides single pool and no description for segment sizes
>>    and offsets.
>> - mp is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not
>>    zero, it provides the extended configuration, individually for
>>    each segment.
>>
>> f the Rx queue is configured with new settings the packets being
>> received will be split into multiple segments pushed to the mbufs
>> with specified attributes. The PMD will split the received packets
>> into multiple segments according to the specification in the
>> description array.
>>
>> For example, let's suppose we configured the Rx queue with the
>> following segments:
>>      seg0 - pool0, len0=14B, off0=2
>>      seg1 - pool1, len1=20B, off1=128B
>>      seg2 - pool2, len2=20B, off2=0B
>>      seg3 - pool3, len3=512B, off3=0B
>>
>> The packet 46 bytes long will look like the following:
>>      seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>>      seg1 - 20B long @ 128 in mbuf from pool1
>>      seg2 - 12B long @ 0 in mbuf from pool2
>>
>> The packet 1500 bytes long will look like the following:
>>      seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>>      seg1 - 20B @ 128 in mbuf from pool1
>>      seg2 - 20B @ 0 in mbuf from pool2
>>      seg3 - 512B @ 0 in mbuf from pool3
>>      seg4 - 512B @ 0 in mbuf from pool3
>>      seg5 - 422B @ 0 in mbuf from pool3
>>
>> The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
>> configured to support new buffer split feature (if rx_nseg
>> is greater than one).
>>
>> The split limitations imposed by underlying PMD is reported
>> in the new introduced rte_eth_dev_info->rx_seg_capa field.
>>
>> The new approach would allow splitting the ingress packets into
>> multiple parts pushed to the memory with different attributes.
>> For example, the packet headers can be pushed to the embedded
>> data buffers within mbufs and the application data into
>> the external buffers attached to mbufs allocated from the
>> different memory pools. The memory attributes for the split
>> parts may differ either - for example the application data
>> may be pushed into the external memory located on the dedicated
>> physical device, say GPU or NVMe. This would improve the DPDK
>> receiving datapath flexibility with preserving compatibility
>> with existing API.
>>
>> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
>> Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
>> Acked-by: Jerin Jacob <jerinj@marvell.com>
> 
> <...>
> 
>> +/**
>>    * A structure used to configure an RX ring of an Ethernet port.
>>    */
>>   struct rte_eth_rxconf {
>> @@ -977,6 +998,46 @@ struct rte_eth_rxconf {
>>       uint16_t rx_free_thresh; /**< Drives the freeing of RX
>> descriptors. */
>>       uint8_t rx_drop_en; /**< Drop packets if no descriptors are
>> available. */
>>       uint8_t rx_deferred_start; /**< Do not start queue with
>> rte_eth_dev_start(). */
>> +    uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
>> +    /**
>> +     * Points to the array of segment descriptions. Each array element
>> +     * describes the properties for each segment in the receiving
>> +     * buffer according to feature descripting structure.
>> +     *
>> +     * The supported capabilities of receiving segmentation is reported
>> +     * in rte_eth_dev_info ->rx_seg_capa field.
>> +     *
>> +     * If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag is set in offloads field,
>> +     * the PMD will split the received packets into multiple segments
>> +     * according to the specification in the description array:
>> +     *
>> +     * - the first network buffer will be allocated from the memory
>> pool,
>> +     *   specified in the first array element, the second buffer,
>> from the
>> +     *   pool in the second element, and so on.
>> +     *
>> +     * - the offsets from the segment description elements specify
>> +     *   the data offset from the buffer beginning except the first
>> mbuf.
>> +     *   For this one the offset is added with RTE_PKTMBUF_HEADROOM.
>> +     *
>> +     * - the lengths in the elements define the maximal data amount
>> +     *   being received to each segment. The receiving starts with
>> filling
>> +     *   up the first mbuf data buffer up to specified length. If the
>> +     *   there are data remaining (packet is longer than buffer in
>> the first
>> +     *   mbuf) the following data will be pushed to the next segment
>> +     *   up to its own length, and so on.
>> +     *
>> +     * - If the length in the segment description element is zero
>> +     *   the actual buffer size will be deduced from the appropriate
>> +     *   memory pool properties.
>> +     *
>> +     * - if there is not enough elements to describe the buffer for
>> entire
>> +     *   packet of maximal length the following parameters will be used
>> +     *   for the all remaining segments:
>> +     *     - pool from the last valid element
>> +     *     - the buffer size from this pool
>> +     *     - zero offset
>> +     */
>> +    struct rte_eth_rxseg *rx_seg;
> 
> "struct rte_eth_rxconf" is very commonly used, I think all applications
> does the 'rte_eth_rx_queue_setup()', but "buffer split" is not a common
> usage,
> 
> I am against the "struct rte_eth_rxseg *rx_seg;" field creating this
> much noise in the "struct rte_eth_rxconf" documentation.
> As mentioned before, can you please move the above detailed
> documentation to where "struct rte_eth_rxseg" defined, and in this
> struct put a single comment for "struct rte_eth_rxseg *rx_seg" ?

+1
  
Slava Ovsiienko Oct. 16, 2020, 9:22 a.m. UTC | #6
> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@intel.com>
> Sent: Friday, October 16, 2020 12:19
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org
> Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> stephen@networkplumber.org; olivier.matz@6wind.com;
> jerinjacobk@gmail.com; maxime.coquelin@redhat.com;
> david.marchand@redhat.com; arybchenko@solarflare.com
> Subject: Re: [PATCH v8 1/6] ethdev: introduce Rx buffer split
> 
> On 10/16/2020 8:48 AM, Viacheslav Ovsiienko wrote:
> > The DPDK datapath in the transmit direction is very flexible.
> > An application can build the multi-segment packet and manages almost
> > all data aspects - the memory pools where segments are allocated from,
> > the segment lengths, the memory attributes like external buffers,
> > registered for DMA, etc.
> >
> > In the receiving direction, the datapath is much less flexible, an
> > application can only specify the memory pool to configure the
> > receiving queue and nothing more. In order to extend receiving
> > datapath capabilities it is proposed to add the way to provide
> > extended information how to split the packets being received.
> >
> > The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
> > capabilities is introduced to present the way for PMD to report to
> > application about supporting Rx packet split to configurable segments.
> > Prior invoking the rte_eth_rx_queue_setup() routine application should
> > check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.
> >
> > The following structure is introduced to specify the Rx packet segment
> > for RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT offload:
> >
> > struct rte_eth_rxseg_split {
> >
> >      struct rte_mempool *mp; /* memory pools to allocate segment from */
> >      uint16_t length; /* segment maximal data length,
> > 		       	configures "split point" */
> >      uint16_t offset; /* data offset from beginning
> > 		       	of mbuf data buffer */
> >      uint32_t reserved; /* reserved field */ };
> >
> > The segment descriptions are added to the rte_eth_rxconf structure:
> >     rx_seg - pointer the array of segment descriptions, each element
> >               describes the memory pool, maximal data length, initial
> >               data offset from the beginning of data buffer in mbuf.
> > 	     This array allows to specify the different settings for
> > 	     each segment in individual fashion.
> >     rx_nseg - number of elements in the array
> >
> > If the extended segment descriptions is provided with these new fields
> > the mp parameter of the rte_eth_rx_queue_setup must be specified as
> > NULL to avoid ambiguity.
> >
> > There are two options to specify Rx buffer configuration:
> > - mp is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is zero,
> >    it is compatible configuration, follows existing implementation,
> >    provides single pool and no description for segment sizes
> >    and offsets.
> > - mp is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not
> >    zero, it provides the extended configuration, individually for
> >    each segment.
> >
> > f the Rx queue is configured with new settings the packets being
> > received will be split into multiple segments pushed to the mbufs with
> > specified attributes. The PMD will split the received packets into
> > multiple segments according to the specification in the description
> > array.
> >
> > For example, let's suppose we configured the Rx queue with the
> > following segments:
> >      seg0 - pool0, len0=14B, off0=2
> >      seg1 - pool1, len1=20B, off1=128B
> >      seg2 - pool2, len2=20B, off2=0B
> >      seg3 - pool3, len3=512B, off3=0B
> >
> > The packet 46 bytes long will look like the following:
> >      seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
> >      seg1 - 20B long @ 128 in mbuf from pool1
> >      seg2 - 12B long @ 0 in mbuf from pool2
> >
> > The packet 1500 bytes long will look like the following:
> >      seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
> >      seg1 - 20B @ 128 in mbuf from pool1
> >      seg2 - 20B @ 0 in mbuf from pool2
> >      seg3 - 512B @ 0 in mbuf from pool3
> >      seg4 - 512B @ 0 in mbuf from pool3
> >      seg5 - 422B @ 0 in mbuf from pool3
> >
> > The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
> configured
> > to support new buffer split feature (if rx_nseg is greater than one).
> >
> > The split limitations imposed by underlying PMD is reported in the new
> > introduced rte_eth_dev_info->rx_seg_capa field.
> >
> > The new approach would allow splitting the ingress packets into
> > multiple parts pushed to the memory with different attributes.
> > For example, the packet headers can be pushed to the embedded data
> > buffers within mbufs and the application data into the external
> > buffers attached to mbufs allocated from the different memory pools.
> > The memory attributes for the split parts may differ either - for
> > example the application data may be pushed into the external memory
> > located on the dedicated physical device, say GPU or NVMe. This would
> > improve the DPDK receiving datapath flexibility with preserving
> > compatibility with existing API.
> >
> > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> > Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> > Acked-by: Jerin Jacob <jerinj@marvell.com>
> 
> <...>
> 
> > +/**
> >    * A structure used to configure an RX ring of an Ethernet port.
> >    */
> >   struct rte_eth_rxconf {
> > @@ -977,6 +998,46 @@ struct rte_eth_rxconf {
> >   	uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
> >   	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available.
> */
> >   	uint8_t rx_deferred_start; /**< Do not start queue with
> > rte_eth_dev_start(). */
> > +	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> > +	/**
> > +	 * Points to the array of segment descriptions. Each array element
> > +	 * describes the properties for each segment in the receiving
> > +	 * buffer according to feature descripting structure.
> > +	 *
> > +	 * The supported capabilities of receiving segmentation is reported
> > +	 * in rte_eth_dev_info ->rx_seg_capa field.
> > +	 *
> > +	 * If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag is set in offloads field,
> > +	 * the PMD will split the received packets into multiple segments
> > +	 * according to the specification in the description array:
> > +	 *
> > +	 * - the first network buffer will be allocated from the memory pool,
> > +	 *   specified in the first array element, the second buffer, from the
> > +	 *   pool in the second element, and so on.
> > +	 *
> > +	 * - the offsets from the segment description elements specify
> > +	 *   the data offset from the buffer beginning except the first mbuf.
> > +	 *   For this one the offset is added with RTE_PKTMBUF_HEADROOM.
> > +	 *
> > +	 * - the lengths in the elements define the maximal data amount
> > +	 *   being received to each segment. The receiving starts with filling
> > +	 *   up the first mbuf data buffer up to specified length. If the
> > +	 *   there are data remaining (packet is longer than buffer in the first
> > +	 *   mbuf) the following data will be pushed to the next segment
> > +	 *   up to its own length, and so on.
> > +	 *
> > +	 * - If the length in the segment description element is zero
> > +	 *   the actual buffer size will be deduced from the appropriate
> > +	 *   memory pool properties.
> > +	 *
> > +	 * - if there is not enough elements to describe the buffer for entire
> > +	 *   packet of maximal length the following parameters will be used
> > +	 *   for the all remaining segments:
> > +	 *     - pool from the last valid element
> > +	 *     - the buffer size from this pool
> > +	 *     - zero offset
> > +	 */
> > +	struct rte_eth_rxseg *rx_seg;
> 
> "struct rte_eth_rxconf" is very commonly used, I think all applications does the
> 'rte_eth_rx_queue_setup()', but "buffer split" is not a common usage,
> 
> I am against the "struct rte_eth_rxseg *rx_seg;" field creating this much noise
> in the "struct rte_eth_rxconf" documentation.
> As mentioned before, can you please move the above detailed documentation
> to where "struct rte_eth_rxseg" defined, and in this struct put a single
> comment for "struct rte_eth_rxseg *rx_seg" ?

Sure, we had doubts about putting this wordy comment to the rxconf structure either.
Now we can move the comment to the rte_eth_rxseg_split declaration.

With best regards, Slava
  
Andrew Rybchenko Oct. 16, 2020, 9:27 a.m. UTC | #7
On 10/16/20 12:15 PM, Slava Ovsiienko wrote:
> Hi, Andrew
> 
>> -----Original Message-----
>> From: Andrew Rybchenko <arybchenko@solarflare.com>
>> Sent: Friday, October 16, 2020 11:52
>> To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org
>> Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
>> stephen@networkplumber.org; ferruh.yigit@intel.com;
>> olivier.matz@6wind.com; jerinjacobk@gmail.com;
>> maxime.coquelin@redhat.com; david.marchand@redhat.com
>> Subject: Re: [PATCH v8 1/6] ethdev: introduce Rx buffer split
>>
>> On 10/16/20 10:48 AM, Viacheslav Ovsiienko wrote:
>>> The DPDK datapath in the transmit direction is very flexible.
>>> An application can build the multi-segment packet and manages almost
>>> all data aspects - the memory pools where segments are allocated from,
>>> the segment lengths, the memory attributes like external buffers,
>>> registered for DMA, etc.
>>>
> [snip]
>>> +struct rte_eth_rxseg {
>>> +	union {
>>
>> Why not just 'union rte_eth_rxseg' ?
>>
>>> +		/* The settings for buffer split offload. */
>>> +		struct rte_eth_rxseg_split split;
>>
>> Pointer to a split table must be here. I.e.
>> struct rte_eth_rxseg_split *split;
> OK, will try to simplify with that, thanks.
> 
>> Also it must be specified how the array is terminated.
>> We need either a number of define last item condition (mp == NULL ?)
> 
> We have one, please see: "rte_eth_rxconf->rx_nseg"

A bit confusing to have it outside of union far from split
but may be acceptable for the experimental API...
If there are better ideas - welcome.

>>
>>> +		/* The other features settings should be added here. */
>>> +	} conf;
>>> +};
>>
>>
>>
>>> +
>>> +/**
>>>   * A structure used to configure an RX ring of an Ethernet port.
>>>   */
>>>  struct rte_eth_rxconf {
>>> @@ -977,6 +998,46 @@ struct rte_eth_rxconf {
>>>  	uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
>>>  	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available.
>> */
>>> +	struct rte_eth_rxseg *rx_seg;
>>
>> It must not be a pointer. It looks really strange this way taking into account
>> that it is a union in fact.
>> Also, why is it put here in the middle of exsiting structure?
>> IMHO it should be added after offlaods.
> OK, agree, will move.
> 
>>
>>>  	/**
>>>  	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>>>  	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa @@
>>> -1260,6 +1321,7 @@ struct rte_eth_conf {
>>>  #define DEV_RX_OFFLOAD_SCTP_CKSUM	0x00020000
>>>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>>>  #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
>>> +#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
>>>
>>>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>>>  				 DEV_RX_OFFLOAD_UDP_CKSUM | \
>>> @@ -1376,6 +1438,17 @@ struct rte_eth_switch_info {  };
>>>
>>>  /**
>>> + * Ethernet device Rx buffer segmentation capabilities.
>>> + */
>>> +__extension__
>>> +struct rte_eth_rxseg_capa {
>>> +	uint16_t max_seg; /**< Maximum amount of segments to split. */
>>
>> May be 'max_segs' to avoid confusing vs maximum segment length.
>>
> OK, max_nseg would be more appropriate.

Agreed.

> 
>>> +	uint16_t multi_pools:1; /**< Supports receiving to multiple pools.*/
>>> +	uint16_t offset_allowed:1; /**< Supports buffer offsets. */
>>> +	uint16_t offset_align_log2:4; /**< Required offset alignment. */
>>
>> 4 bits are even insufficient to specify cache-line alignment.
>> IMHO at least 8 bits are required.
> 
> 4 bits seems to be quite sufficient. It is a log2, tells how many lsbs in offset should be zeroes.
> 2^15 is 32K, it covers all reasonable alignments for uint16_t type.
> 
>>
>> Consider to put 32 width bit-fields at start of the structure.
>> Than, max_segs (16), offset_align_log2 (8), plus reserved (8).
> OK, np.
> 
> With best regards, Slava
>
  
Slava Ovsiienko Oct. 16, 2020, 9:34 a.m. UTC | #8
> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Sent: Friday, October 16, 2020 12:28
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org
> Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> stephen@networkplumber.org; ferruh.yigit@intel.com;
> olivier.matz@6wind.com; jerinjacobk@gmail.com;
> maxime.coquelin@redhat.com; david.marchand@redhat.com
> Subject: Re: [dpdk-dev] [PATCH v8 1/6] ethdev: introduce Rx buffer split
> 
> On 10/16/20 12:15 PM, Slava Ovsiienko wrote:
> > Hi, Andrew
> >
> >> -----Original Message-----
> >> From: Andrew Rybchenko <arybchenko@solarflare.com>
> >> Sent: Friday, October 16, 2020 11:52
> >> To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org
> >> Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> >> stephen@networkplumber.org; ferruh.yigit@intel.com;
> >> olivier.matz@6wind.com; jerinjacobk@gmail.com;
> >> maxime.coquelin@redhat.com; david.marchand@redhat.com
> >> Subject: Re: [PATCH v8 1/6] ethdev: introduce Rx buffer split
> >>
> >> On 10/16/20 10:48 AM, Viacheslav Ovsiienko wrote:
> >>> The DPDK datapath in the transmit direction is very flexible.
> >>> An application can build the multi-segment packet and manages almost
> >>> all data aspects - the memory pools where segments are allocated
> >>> from, the segment lengths, the memory attributes like external
> >>> buffers, registered for DMA, etc.
> >>>
> > [snip]
> >>> +struct rte_eth_rxseg {
> >>> +	union {
> >>
> >> Why not just 'union rte_eth_rxseg' ?
> >>
> >>> +		/* The settings for buffer split offload. */
> >>> +		struct rte_eth_rxseg_split split;
> >>
> >> Pointer to a split table must be here. I.e.
> >> struct rte_eth_rxseg_split *split;
> > OK, will try to simplify with that, thanks.
> >
> >> Also it must be specified how the array is terminated.
> >> We need either a number of define last item condition (mp == NULL ?)
> >
> > We have one, please see: "rte_eth_rxconf->rx_nseg"
> 
> A bit confusing to have it outside of union far from split but may be acceptable
> for the experimental API...
> If there are better ideas - welcome.

It can't be inside union - it does not share the memory layout with descriptors itself.
It can't be inside the descriptor structure either - rx_nseg describes the size of array, not segment property.
It is uint16_t rx_nseg and is placed before offload to save some space due to rte_eth_rxconf members alignment.
Will mark rte_eth_rxseg_split as experimental, nice clue, thanks.

With best regards, Slava
  
Thomas Monjalon Oct. 16, 2020, 9:37 a.m. UTC | #9
16/10/2020 11:15, Slava Ovsiienko:
> From: Andrew Rybchenko <arybchenko@solarflare.com>
> > On 10/16/20 10:48 AM, Viacheslav Ovsiienko wrote:
> > >  /**
> > > + * Ethernet device Rx buffer segmentation capabilities.
> > > + */
> > > +__extension__
> > > +struct rte_eth_rxseg_capa {
> > > +	uint16_t max_seg; /**< Maximum amount of segments to split. */
> > 
> > May be 'max_segs' to avoid confusing vs maximum segment length.
> > 
> OK, max_nseg would be more appropriate.
> 
> > > +	uint16_t multi_pools:1; /**< Supports receiving to multiple pools.*/
> > > +	uint16_t offset_allowed:1; /**< Supports buffer offsets. */
> > > +	uint16_t offset_align_log2:4; /**< Required offset alignment. */
> > 
> > 4 bits are even insufficient to specify cache-line alignment.
> > IMHO at least 8 bits are required.
> 
> 4 bits seems to be quite sufficient. It is a log2, tells how many lsbs in offset should be zeroes.
> 2^15 is 32K, it covers all reasonable alignments for uint16_t type.

bitfields with uint16_t is not standard I think,
we could experience build issues.
If I remember well uint32_t is safer.
  
Slava Ovsiienko Oct. 16, 2020, 9:38 a.m. UTC | #10
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Friday, October 16, 2020 12:37
> To: Andrew Rybchenko <arybchenko@solarflare.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; stephen@networkplumber.org; ferruh.yigit@intel.com;
> olivier.matz@6wind.com; jerinjacobk@gmail.com;
> maxime.coquelin@redhat.com; david.marchand@redhat.com
> Subject: Re: [PATCH v8 1/6] ethdev: introduce Rx buffer split
> 
> 16/10/2020 11:15, Slava Ovsiienko:
> > From: Andrew Rybchenko <arybchenko@solarflare.com>
> > > On 10/16/20 10:48 AM, Viacheslav Ovsiienko wrote:
> > > >  /**
> > > > + * Ethernet device Rx buffer segmentation capabilities.
> > > > + */
> > > > +__extension__
> > > > +struct rte_eth_rxseg_capa {
> > > > +	uint16_t max_seg; /**< Maximum amount of segments to split. */
> > >
> > > May be 'max_segs' to avoid confusing vs maximum segment length.
> > >
> > OK, max_nseg would be more appropriate.
> >
> > > > +	uint16_t multi_pools:1; /**< Supports receiving to multiple pools.*/
> > > > +	uint16_t offset_allowed:1; /**< Supports buffer offsets. */
> > > > +	uint16_t offset_align_log2:4; /**< Required offset alignment. */
> > >
> > > 4 bits are even insufficient to specify cache-line alignment.
> > > IMHO at least 8 bits are required.
> >
> > 4 bits seems to be quite sufficient. It is a log2, tells how many lsbs in offset
> should be zeroes.
> > 2^15 is 32K, it covers all reasonable alignments for uint16_t type.
> 
> bitfields with uint16_t is not standard I think, we could experience build issues.
> If I remember well uint32_t is safer.
> 
Agree, it is being replaced with uint32_t and embraces max_nseg as well.
  

Patch

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index dd8c955..832ea3b 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -185,6 +185,21 @@  Supports receiving segmented mbufs.
 * **[related]    eth_dev_ops**: ``rx_pkt_burst``.
 
 
+.. _nic_features_buffer_split:
+
+Buffer Split on Rx
+------------------
+
+Scatters the packets being received on specified boundaries to segmented mbufs.
+
+* **[uses]       rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
+* **[uses]       rte_eth_rxconf**: ``rx_conf.rx_seg, rx_conf.rx_nseg``.
+* **[implements] datapath**: ``Buffer Split functionality``.
+* **[provides]   rte_eth_dev_info**: ``rx_offload_capa:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
+* **[provides]   eth_dev_ops**: ``rxq_info_get:buffer_split``.
+* **[related] API**: ``rte_eth_rx_queue_setup()``.
+
+
 .. _nic_features_lro:
 
 LRO
diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 584e720..232cd54 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -138,11 +138,6 @@  Deprecation Notices
   In 19.11 PMDs will still update the field even when the offload is not
   enabled.
 
-* ethdev: Add new fields to ``rte_eth_rxconf`` to configure the receiving
-  queues to split ingress packets into multiple segments according to the
-  specified lengths into the buffers allocated from the specified
-  memory pools. The backward compatibility to existing API is preserved.
-
 * ethdev: ``rx_descriptor_done`` dev_ops and ``rte_eth_rx_descriptor_done``
   will be removed in 21.11.
   Existing ``rte_eth_rx_descriptor_status`` and ``rte_eth_tx_descriptor_status``
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index bcc0fc2..bcc2479 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -60,6 +60,12 @@  New Features
   Added the FEC API which provides functions for query FEC capabilities and
   current FEC mode from device. Also, API for configuring FEC mode is also provided.
 
+* **Introduced extended buffer description for receiving.**
+
+  Added the extended Rx buffer description for Rx queue setup routine
+  providing the individual settings for each Rx segment with maximal size,
+  buffer offset and memory pool to allocate data buffers from.
+
 * **Updated Broadcom bnxt driver.**
 
   Updated the Broadcom bnxt driver with new features and improvements, including:
@@ -253,6 +259,9 @@  API Changes
   As the data of ``uint8_t`` will be truncated when queue number under
   a TC is greater than 256.
 
+* ethdev: Added fields rx_seg and rx_nseg to rte_eth_rxconf structure
+  to provide extended description of the receiving buffer.
+
 * vhost: Moved vDPA APIs from experimental to stable.
 
 * rawdev: Added a structure size parameter to the functions
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 892c246..0eda3f4 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -105,6 +105,9 @@  struct rte_eth_xstats_name_off {
 #define RTE_RX_OFFLOAD_BIT2STR(_name)	\
 	{ DEV_RX_OFFLOAD_##_name, #_name }
 
+#define RTE_ETH_RX_OFFLOAD_BIT2STR(_name)	\
+	{ RTE_ETH_RX_OFFLOAD_##_name, #_name }
+
 static const struct {
 	uint64_t offload;
 	const char *name;
@@ -128,9 +131,11 @@  struct rte_eth_xstats_name_off {
 	RTE_RX_OFFLOAD_BIT2STR(SCTP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
+#undef RTE_ETH_RX_OFFLOAD_BIT2STR
 
 #define RTE_TX_OFFLOAD_BIT2STR(_name)	\
 	{ DEV_TX_OFFLOAD_##_name, #_name }
@@ -1763,6 +1768,77 @@  struct rte_eth_dev *
 	return ret;
 }
 
+static int
+rte_eth_rx_queue_check_split(const struct rte_eth_rxseg *rx_seg,
+			     uint16_t n_seg, uint32_t *mbp_buf_size,
+			     const struct rte_eth_dev_info *dev_info)
+{
+	const struct rte_eth_rxseg_capa *seg_capa = &dev_info->rx_seg_capa;
+	struct rte_mempool *mp_first;
+	uint32_t offset_mask;
+	uint16_t seg_idx;
+
+	if (n_seg > seg_capa->max_seg) {
+		RTE_ETHDEV_LOG(ERR,
+			       "Requested Rx segments %u exceed supported %u\n",
+			       n_seg, seg_capa->max_seg);
+		return -EINVAL;
+	}
+	/*
+	 * Check the sizes and offsets against buffer sizes
+	 * for each segment specified in extended configuration.
+	 */
+	mp_first = rx_seg[0].conf.split.mp;
+	offset_mask = (1u << seg_capa->offset_align_log2) - 1;
+	for (seg_idx = 0; seg_idx < n_seg; seg_idx++) {
+		struct rte_mempool *mpl = rx_seg[seg_idx].conf.split.mp;
+		uint32_t length = rx_seg[seg_idx].conf.split.length;
+		uint32_t offset = rx_seg[seg_idx].conf.split.offset;
+
+		if (mpl == NULL) {
+			RTE_ETHDEV_LOG(ERR, "null mempool pointer\n");
+			return -EINVAL;
+		}
+		if (seg_idx != 0 && mp_first != mpl &&
+		    seg_capa->multi_pools == 0) {
+			RTE_ETHDEV_LOG(ERR, "Receiving to multiple pools is not supported\n");
+			return -ENOTSUP;
+		}
+		if (offset != 0) {
+			if (seg_capa->offset_allowed == 0) {
+				RTE_ETHDEV_LOG(ERR, "Rx segmentation with offset is not supported\n");
+				return -ENOTSUP;
+			}
+			if (offset & offset_mask) {
+				RTE_ETHDEV_LOG(ERR, "Rx segmentat invalid offset alignment %u, %u\n",
+					       offset,
+					       seg_capa->offset_align_log2);
+				return -EINVAL;
+			}
+		}
+		if (mpl->private_data_size <
+			sizeof(struct rte_pktmbuf_pool_private)) {
+			RTE_ETHDEV_LOG(ERR,
+				       "%s private_data_size %u < %u\n",
+				       mpl->name, mpl->private_data_size,
+				       (unsigned int)sizeof
+					(struct rte_pktmbuf_pool_private));
+			return -ENOSPC;
+		}
+		offset += seg_idx != 0 ? 0 : RTE_PKTMBUF_HEADROOM;
+		*mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
+		length = length != 0 ? length : *mbp_buf_size;
+		if (*mbp_buf_size < length + offset) {
+			RTE_ETHDEV_LOG(ERR,
+				       "%s mbuf_data_room_size %u < %u (segment length=%u + segment offset=%u)\n",
+				       mpl->name, *mbp_buf_size,
+				       length + offset, length, offset);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
 int
 rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
 		       uint16_t nb_rx_desc, unsigned int socket_id,
@@ -1784,38 +1860,64 @@  struct rte_eth_dev *
 		return -EINVAL;
 	}
 
-	if (mp == NULL) {
-		RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
-		return -EINVAL;
-	}
-
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_setup, -ENOTSUP);
 
-	/*
-	 * Check the size of the mbuf data buffer.
-	 * This value must be provided in the private data of the memory pool.
-	 * First check that the memory pool has a valid private data.
-	 */
 	ret = rte_eth_dev_info_get(port_id, &dev_info);
 	if (ret != 0)
 		return ret;
 
-	if (mp->private_data_size < sizeof(struct rte_pktmbuf_pool_private)) {
-		RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
-			mp->name, (int)mp->private_data_size,
-			(int)sizeof(struct rte_pktmbuf_pool_private));
-		return -ENOSPC;
-	}
-	mbp_buf_size = rte_pktmbuf_data_room_size(mp);
+	if (mp != NULL) {
+		/* Single pool configuration check. */
+		if (rx_conf->rx_nseg != 0) {
+			RTE_ETHDEV_LOG(ERR,
+				       "Ambiguous segment configuration\n");
+			return -EINVAL;
+		}
+		/*
+		 * Check the size of the mbuf data buffer, this value
+		 * must be provided in the private data of the memory pool.
+		 * First check that the memory pool(s) has a valid private data.
+		 */
+		if (mp->private_data_size <
+				sizeof(struct rte_pktmbuf_pool_private)) {
+			RTE_ETHDEV_LOG(ERR, "%s private_data_size %u < %u\n",
+				mp->name, mp->private_data_size,
+				(unsigned int)
+				sizeof(struct rte_pktmbuf_pool_private));
+			return -ENOSPC;
+		}
+		mbp_buf_size = rte_pktmbuf_data_room_size(mp);
+		if (mbp_buf_size < dev_info.min_rx_bufsize +
+				   RTE_PKTMBUF_HEADROOM) {
+			RTE_ETHDEV_LOG(ERR,
+				       "%s mbuf_data_room_size %u < %u (RTE_PKTMBUF_HEADROOM=%u + min_rx_bufsize(dev)=%u)\n",
+				       mp->name, mbp_buf_size,
+				       RTE_PKTMBUF_HEADROOM +
+				       dev_info.min_rx_bufsize,
+				       RTE_PKTMBUF_HEADROOM,
+				       dev_info.min_rx_bufsize);
+			return -EINVAL;
+		}
+	} else {
+		const struct rte_eth_rxseg *rx_seg = rx_conf->rx_seg;
+		uint16_t n_seg = rx_conf->rx_nseg;
 
-	if (mbp_buf_size < dev_info.min_rx_bufsize + RTE_PKTMBUF_HEADROOM) {
-		RTE_ETHDEV_LOG(ERR,
-			"%s mbuf_data_room_size %d < %d (RTE_PKTMBUF_HEADROOM=%d + min_rx_bufsize(dev)=%d)\n",
-			mp->name, (int)mbp_buf_size,
-			(int)(RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize),
-			(int)RTE_PKTMBUF_HEADROOM,
-			(int)dev_info.min_rx_bufsize);
-		return -EINVAL;
+		/* Extended multi-segment configuration check. */
+		if (rx_conf->rx_seg == NULL || rx_conf->rx_nseg == 0) {
+			RTE_ETHDEV_LOG(ERR,
+				       "Memory pool is null and no extended configuration provided\n");
+			return -EINVAL;
+		}
+		if (rx_conf->offloads & RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT) {
+			ret = rte_eth_rx_queue_check_split(rx_seg, n_seg,
+							   &mbp_buf_size,
+							   &dev_info);
+			if (ret != 0)
+				return ret;
+		} else {
+			RTE_ETHDEV_LOG(ERR, "No Rx segmentation offload configured\n");
+			return -EINVAL;
+		}
 	}
 
 	/* Use default specified by driver, if nb_rx_desc is zero */
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 5bcfbb8..6f18bec 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -970,6 +970,27 @@  struct rte_eth_txmode {
 };
 
 /**
+ * A structure used to configure an Rx packet segment to split.
+ */
+struct rte_eth_rxseg_split {
+	struct rte_mempool *mp; /**< Memory pool to allocate segment from. */
+	uint16_t length; /**< Segment data length, configures split point. */
+	uint16_t offset; /**< Data offset from beginning of mbuf data buffer. */
+	uint32_t reserved; /**< Reserved field. */
+};
+
+/**
+ * A common structure used to describe Rx packet segment properties.
+ */
+struct rte_eth_rxseg {
+	union {
+		/* The settings for buffer split offload. */
+		struct rte_eth_rxseg_split split;
+		/* The other features settings should be added here. */
+	} conf;
+};
+
+/**
  * A structure used to configure an RX ring of an Ethernet port.
  */
 struct rte_eth_rxconf {
@@ -977,6 +998,46 @@  struct rte_eth_rxconf {
 	uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
+	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	/**
+	 * Points to the array of segment descriptions. Each array element
+	 * describes the properties for each segment in the receiving
+	 * buffer according to feature descripting structure.
+	 *
+	 * The supported capabilities of receiving segmentation is reported
+	 * in rte_eth_dev_info ->rx_seg_capa field.
+	 *
+	 * If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag is set in offloads field,
+	 * the PMD will split the received packets into multiple segments
+	 * according to the specification in the description array:
+	 *
+	 * - the first network buffer will be allocated from the memory pool,
+	 *   specified in the first array element, the second buffer, from the
+	 *   pool in the second element, and so on.
+	 *
+	 * - the offsets from the segment description elements specify
+	 *   the data offset from the buffer beginning except the first mbuf.
+	 *   For this one the offset is added with RTE_PKTMBUF_HEADROOM.
+	 *
+	 * - the lengths in the elements define the maximal data amount
+	 *   being received to each segment. The receiving starts with filling
+	 *   up the first mbuf data buffer up to specified length. If the
+	 *   there are data remaining (packet is longer than buffer in the first
+	 *   mbuf) the following data will be pushed to the next segment
+	 *   up to its own length, and so on.
+	 *
+	 * - If the length in the segment description element is zero
+	 *   the actual buffer size will be deduced from the appropriate
+	 *   memory pool properties.
+	 *
+	 * - if there is not enough elements to describe the buffer for entire
+	 *   packet of maximal length the following parameters will be used
+	 *   for the all remaining segments:
+	 *     - pool from the last valid element
+	 *     - the buffer size from this pool
+	 *     - zero offset
+	 */
+	struct rte_eth_rxseg *rx_seg;
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1260,6 +1321,7 @@  struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_SCTP_CKSUM	0x00020000
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
+#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \
@@ -1376,6 +1438,17 @@  struct rte_eth_switch_info {
 };
 
 /**
+ * Ethernet device Rx buffer segmentation capabilities.
+ */
+__extension__
+struct rte_eth_rxseg_capa {
+	uint16_t max_seg; /**< Maximum amount of segments to split. */
+	uint16_t multi_pools:1; /**< Supports receiving to multiple pools.*/
+	uint16_t offset_allowed:1; /**< Supports buffer offsets. */
+	uint16_t offset_align_log2:4; /**< Required offset alignment. */
+};
+
+/**
  * Ethernet device information
  */
 
@@ -1403,6 +1476,7 @@  struct rte_eth_dev_info {
 	/** Maximum number of hash MAC addresses for MTA and UTA. */
 	uint16_t max_vfs; /**< Maximum number of VFs. */
 	uint16_t max_vmdq_pools; /**< Maximum number of VMDq pools. */
+	struct rte_eth_rxseg_capa rx_seg_capa; /**< Segmentation capability.*/
 	uint64_t rx_offload_capa;
 	/**< All RX offload capabilities including all per-queue ones */
 	uint64_t tx_offload_capa;
@@ -2027,9 +2101,21 @@  int rte_eth_dev_configure(uint16_t port_id, uint16_t nb_rx_queue,
  *   No need to repeat any bit in rx_conf->offloads which has already been
  *   enabled in rte_eth_dev_configure() at port level. An offloading enabled
  *   at port level can't be disabled at queue level.
+ *   The configuration structure also contains the pointer to the array
+ *   of the receiving buffer segment descriptions, see rx_seg and rx_nseg
+ *   fields, this extended configuration might be used by split offloads like
+ *   RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT. If mp_pool is not NULL,
+ *   the extended configuration fields must be set to NULL and zero.
  * @param mb_pool
  *   The pointer to the memory pool from which to allocate *rte_mbuf* network
- *   memory buffers to populate each descriptor of the receive ring.
+ *   memory buffers to populate each descriptor of the receive ring. There are
+ *   two options to provide Rx buffer configuration:
+ *   - single pool:
+ *     mb_pool is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is 0.
+ *   - multiple segments description:
+ *     mb_pool is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not 0.
+ *     Taken only if flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is set in offloads.
+ *
  * @return
  *   - 0: Success, receive queue correctly set up.
  *   - -EIO: if device is removed.