[RFC] ethdev: introduce Rx buffer split

Message ID MWHPR12MB136076E652230CEBD6EE6562DF5F0@MWHPR12MB1360.namprd12.prod.outlook.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series [RFC] ethdev: introduce Rx buffer split |

Checks

Context Check Description
ci/Intel-compilation fail apply issues

Commit Message

Slava Ovsiienko Aug. 17, 2020, 5:49 p.m. UTC
  From 7f7052d8b85ff3ff7011bd844b6d3169c6e51923 Mon Sep 17 00:00:00 2001
From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Date: Mon, 17 Aug 2020 16:57:43 +0000
Subject: [RFC] ethdev: introduce Rx buffer split

The DPDK datapath in the transmit direction is very flexible.
An application can build the multisegment packet and manages
almost all data aspects - the memory pools where segments
are allocated from, the segment lengths, the memory attributes
like external buffers, registered for DMA, etc.

In the receiving direction, the datapath is much less flexible,
an application can only specify the memory pool to configure the
receiving queue and nothing more. In order to extend receiving
datapath capabilities it is proposed to add the way to provide
extended infoirmation how to split the packets being received.

The following structure is introduced to specify the Rx packet
segment:

struct rte_eth_rxseg {
    struct rte_mempool *mp; /* memory pools to allocate segment from */
    uint16_t length; /* segment maximal data length */
    uint16_t offset; /* data offset from beginning of mbuf data buffer */
    uint32_t reserved; /* reserved field */
};

The new routine rte_eth_rx_queue_setup_ex() is introduced to
setup the given Rx queue using the new extended Rx packet segment
description:

int
rte_eth_rx_queue_setup_ex(uint16_t port_id, uint16_t rx_queue_id,
                          uint16_t nb_rx_desc, unsigned int socket_id,
                          const struct rte_eth_rxconf *rx_conf,
		          const struct rte_eth_rxseg *rx_seg,
                          uint16_t n_seg)

This routine presents the two new parameters:
    rx_seg - pointer the array of segment descriptions, each element
             describes the memory pool, maximal data length, initial
             data offset from the beginning of data buffer in mbuf
    n_seg - number of elements in the array

The new offload flag DEV_RX_OFFLOAD_BUFFER_SPLIT in device
capabilities is introduced to present the way for PMD to report to
application about supporting Rx packet split to configurable
segments. Prior invoking the rte_eth_rx_queue_setup_ex() routine
application should check DEV_RX_OFFLOAD_BUFFER_SPLIT flag.

If the Rx queue is configured with new routine the packets being
received will be split into multiple segments pushed to the mbufs
with specified attributes. The PMD will allocate the first mbuf
from the pool specified in the first segment descriptor and puts
the data staring at specified offeset in the allocated mbuf data
buffer. If packet length exceeds the specified segment length
the next mbuf will be allocated according to the next segment
descriptor (if any) and data will be put in its data buffer at
specified offset and not exceeding specified length. If there is
no next descriptor the next mbuf will be allocated and filled in the
same way (from the same pool and with the same buffer offset/length)
as the current one.

For example, let's suppose we configured the Rx queue with the
following segments:
    seg0 - pool0, len0=14B, off0=RTE_PKTMBUF_HEADROOM
    seg1 - pool1, len1=20B, off1=0B
    seg2 - pool2, len2=20B, off2=0B
    seg3 - pool3, len3=512B, off3=0B

The packet 46 bytes long will look like the following:
    seg0 - 14B long @ RTE_PKTMBUF_HEADROOM in mbuf from pool0
    seg1 - 20B long @ 0 in mbuf from pool1
    seg2 - 12B long @ 0 in mbuf from pool2

The packet 1500 bytes long will look like the following:
    seg0 - 14B @ RTE_PKTMBUF_HEADROOM in mbuf from pool0
    seg1 - 20B @ 0 in mbuf from pool1
    seg2 - 20B @ 0 in mbuf from pool2
    seg3 - 512B @ 0 in mbuf from pool3
    seg4 - 512B @ 0 in mbuf from pool3
    seg5 - 422B @ 0 in mbuf from pool3

The offload DEV_RX_OFFLOAD_SCATTER must be present and
configured to support new buffer spllit feature (if n_seg
is greater than one).

The new approach would allow splitting the ingress packets into
multiple parts pushed to the memory with different attributes.
For example, the packet headers can be pushed to the embedded
data buffers within mbufs and the application data into
the external buffers attached to mbufs allocated from the
different memory pools. The memory attributes for the split
parts may differ either - for example the application data
may be pushed into the external memory located on the dedicated
physical device, say GPU or NVMe. This would improve the DPDK
receiving datapath flexibility with preserving compatibility
with existing API.

Also, the proposed segment description might be used to specify
Rx packet split for some other features. For example, provide
the way to specify the extra memory pool for the Header Split
feature of some Intel PMD.

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 lib/librte_ethdev/rte_ethdev.c      | 166 ++++++++++++++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h      |  15 ++++
 lib/librte_ethdev/rte_ethdev_core.h |  10 +++
 3 files changed, 191 insertions(+)
  

Comments

Andrew Rybchenko Sept. 17, 2020, 4:55 p.m. UTC | #1
On 8/17/20 8:49 PM, Slava Ovsiienko wrote:
>>From 7f7052d8b85ff3ff7011bd844b6d3169c6e51923 Mon Sep 17 00:00:00 2001
> From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> Date: Mon, 17 Aug 2020 16:57:43 +0000
> Subject: [RFC] ethdev: introduce Rx buffer split
> 
> The DPDK datapath in the transmit direction is very flexible.
> An application can build the multisegment packet and manages
> almost all data aspects - the memory pools where segments
> are allocated from, the segment lengths, the memory attributes
> like external buffers, registered for DMA, etc.
> 
> In the receiving direction, the datapath is much less flexible,
> an application can only specify the memory pool to configure the
> receiving queue and nothing more. In order to extend receiving
> datapath capabilities it is proposed to add the way to provide
> extended infoirmation how to split the packets being received.

infoirmation -> information

> 
> The following structure is introduced to specify the Rx packet
> segment:
> 
> struct rte_eth_rxseg {
>      struct rte_mempool *mp; /* memory pools to allocate segment from */
>      uint16_t length; /* segment maximal data length */
>      uint16_t offset; /* data offset from beginning of mbuf data buffer */
>      uint32_t reserved; /* reserved field */
> };
> 
> The new routine rte_eth_rx_queue_setup_ex() is introduced to
> setup the given Rx queue using the new extended Rx packet segment
> description:
> 
> int
> rte_eth_rx_queue_setup_ex(uint16_t port_id, uint16_t rx_queue_id,
>                            uint16_t nb_rx_desc, unsigned int socket_id,
>                            const struct rte_eth_rxconf *rx_conf,
> 		          const struct rte_eth_rxseg *rx_seg,
>                            uint16_t n_seg)
> 
> This routine presents the two new parameters:
>      rx_seg - pointer the array of segment descriptions, each element
>               describes the memory pool, maximal data length, initial
>               data offset from the beginning of data buffer in mbuf
>      n_seg - number of elements in the array
> 
> The new offload flag DEV_RX_OFFLOAD_BUFFER_SPLIT in device
> capabilities is introduced to present the way for PMD to report to
> application about supporting Rx packet split to configurable
> segments. Prior invoking the rte_eth_rx_queue_setup_ex() routine
> application should check DEV_RX_OFFLOAD_BUFFER_SPLIT flag.
> 
> If the Rx queue is configured with new routine the packets being
> received will be split into multiple segments pushed to the mbufs
> with specified attributes. The PMD will allocate the first mbuf
> from the pool specified in the first segment descriptor and puts
> the data staring at specified offeset in the allocated mbuf data

offeset -> offset

> buffer. If packet length exceeds the specified segment length
> the next mbuf will be allocated according to the next segment
> descriptor (if any) and data will be put in its data buffer at
> specified offset and not exceeding specified length. If there is
> no next descriptor the next mbuf will be allocated and filled in the
> same way (from the same pool and with the same buffer offset/length)
> as the current one.
> 
> For example, let's suppose we configured the Rx queue with the
> following segments:
>      seg0 - pool0, len0=14B, off0=RTE_PKTMBUF_HEADROOM
>      seg1 - pool1, len1=20B, off1=0B
>      seg2 - pool2, len2=20B, off2=0B
>      seg3 - pool3, len3=512B, off3=0B
> 
> The packet 46 bytes long will look like the following:
>      seg0 - 14B long @ RTE_PKTMBUF_HEADROOM in mbuf from pool0
>      seg1 - 20B long @ 0 in mbuf from pool1
>      seg2 - 12B long @ 0 in mbuf from pool2
> 
> The packet 1500 bytes long will look like the following:
>      seg0 - 14B @ RTE_PKTMBUF_HEADROOM in mbuf from pool0
>      seg1 - 20B @ 0 in mbuf from pool1
>      seg2 - 20B @ 0 in mbuf from pool2
>      seg3 - 512B @ 0 in mbuf from pool3
>      seg4 - 512B @ 0 in mbuf from pool3
>      seg5 - 422B @ 0 in mbuf from pool3

The behaviour is logical, but what to do if HW can't do it,
i.e. use the last segment many times. Should it reject
configuration if provided segments are insufficient to fit
MTU packet? How to report the limitation?
(I'm still trying to convince that SCATTER and BUFFER_SPLIT
should be independent).

> 
> The offload DEV_RX_OFFLOAD_SCATTER must be present and
> configured to support new buffer spllit feature (if n_seg

spllit -> split

> is greater than one).
> 
> The new approach would allow splitting the ingress packets into
> multiple parts pushed to the memory with different attributes.
> For example, the packet headers can be pushed to the embedded
> data buffers within mbufs and the application data into
> the external buffers attached to mbufs allocated from the
> different memory pools. The memory attributes for the split
> parts may differ either - for example the application data
> may be pushed into the external memory located on the dedicated
> physical device, say GPU or NVMe. This would improve the DPDK
> receiving datapath flexibility with preserving compatibility
> with existing API.
> 
> Also, the proposed segment description might be used to specify
> Rx packet split for some other features. For example, provide
> the way to specify the extra memory pool for the Header Split
> feature of some Intel PMD.
> 
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> ---
>   lib/librte_ethdev/rte_ethdev.c      | 166 ++++++++++++++++++++++++++++++++++++
>   lib/librte_ethdev/rte_ethdev.h      |  15 ++++
>   lib/librte_ethdev/rte_ethdev_core.h |  10 +++
>   3 files changed, 191 insertions(+)
> 
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index 7858ad5..638e42d 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -1933,6 +1933,172 @@ struct rte_eth_dev *
>   }
>   
>   int
> +rte_eth_rx_queue_setup_ex(uint16_t port_id, uint16_t rx_queue_id,
> +			  uint16_t nb_rx_desc, unsigned int socket_id,
> +			  const struct rte_eth_rxconf *rx_conf,
> +			  const struct rte_eth_rxseg *rx_seg, uint16_t n_seg)
> +{
> +	int ret;
> +	uint16_t seg_idx;
> +	uint32_t mbp_buf_size;
> +	struct rte_eth_dev *dev;
> +	struct rte_eth_dev_info dev_info;
> +	struct rte_eth_rxconf local_conf;
> +	void **rxq;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +
> +	dev = &rte_eth_devices[port_id];
> +	if (rx_queue_id >= dev->data->nb_rx_queues) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", rx_queue_id);
> +		return -EINVAL;
> +	}
> +
> +	if (rx_seg == NULL) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid null description pointer\n");
> +		return -EINVAL;
> +	}
> +
> +	if (n_seg == 0) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid zero description number\n");
> +		return -EINVAL;
> +	}
> +
> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_setup_ex, -ENOTSUP);
> +
> +	/*
> +	 * Check the size of the mbuf data buffer.
> +	 * This value must be provided in the private data of the memory pool.
> +	 * First check that the memory pool has a valid private data.
> +	 */
> +	ret = rte_eth_dev_info_get(port_id, &dev_info);
> +	if (ret != 0)
> +		return ret;
> +
> +	for (seg_idx = 0; seg_idx < n_seg; seg_idx++) {
> +		struct rte_mempool *mp = rx_seg[seg_idx].mp;
> +
> +		if (mp->private_data_size < sizeof(struct rte_pktmbuf_pool_private)) {
> +			RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
> +				mp->name, (int)mp->private_data_size,
> +				(int)sizeof(struct rte_pktmbuf_pool_private));
> +			return -ENOSPC;
> +		}
> +
> +		mbp_buf_size = rte_pktmbuf_data_room_size(mp);
> +		if (mbp_buf_size < rx_seg[seg_idx].length + rx_seg[seg_idx].offset) {
> +			RTE_ETHDEV_LOG(ERR,
> +				"%s mbuf_data_room_size %d < %d"
> +				" (segment length=%d + segment offset=%d)\n",
> +				mp->name, (int)mbp_buf_size,
> +				(int)(rx_seg[seg_idx].length + rx_seg[seg_idx].offset),
> +				(int)rx_seg[seg_idx].length,
> +				(int)rx_seg[seg_idx].offset);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	/* Use default specified by driver, if nb_rx_desc is zero */
> +	if (nb_rx_desc == 0) {
> +		nb_rx_desc = dev_info.default_rxportconf.ring_size;
> +		/* If driver default is also zero, fall back on EAL default */
> +		if (nb_rx_desc == 0)
> +			nb_rx_desc = RTE_ETH_DEV_FALLBACK_RX_RINGSIZE;
> +	}
> +
> +	if (nb_rx_desc > dev_info.rx_desc_lim.nb_max ||
> +			nb_rx_desc < dev_info.rx_desc_lim.nb_min ||
> +			nb_rx_desc % dev_info.rx_desc_lim.nb_align != 0) {
> +
> +		RTE_ETHDEV_LOG(ERR,
> +			"Invalid value for nb_rx_desc(=%hu), should be: <= %hu, >= %hu, and a product of %hu\n",
> +			nb_rx_desc, dev_info.rx_desc_lim.nb_max,
> +			dev_info.rx_desc_lim.nb_min,
> +			dev_info.rx_desc_lim.nb_align);
> +		return -EINVAL;
> +	}
> +
> +	if (dev->data->dev_started &&
> +		!(dev_info.dev_capa &
> +			RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP))
> +		return -EBUSY;
> +
> +	if (dev->data->dev_started &&
> +		(dev->data->rx_queue_state[rx_queue_id] !=
> +			RTE_ETH_QUEUE_STATE_STOPPED))
> +		return -EBUSY;
> +
> +	rxq = dev->data->rx_queues;
> +	if (rxq[rx_queue_id]) {
> +		RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_release,
> +					-ENOTSUP);
> +		(*dev->dev_ops->rx_queue_release)(rxq[rx_queue_id]);
> +		rxq[rx_queue_id] = NULL;
> +	}
> +
> +	if (rx_conf == NULL)
> +		rx_conf = &dev_info.default_rxconf;
> +
> +	local_conf = *rx_conf;
> +
> +	/*
> +	 * If an offloading has already been enabled in
> +	 * rte_eth_dev_configure(), it has been enabled on all queues,
> +	 * so there is no need to enable it in this queue again.
> +	 * The local_conf.offloads input to underlying PMD only carries
> +	 * those offloadings which are only enabled on this queue and
> +	 * not enabled on all queues.
> +	 */
> +	local_conf.offloads &= ~dev->data->dev_conf.rxmode.offloads;
> +
> +	/*
> +	 * New added offloadings for this queue are those not enabled in
> +	 * rte_eth_dev_configure() and they must be per-queue type.
> +	 * A pure per-port offloading can't be enabled on a queue while
> +	 * disabled on another queue. A pure per-port offloading can't
> +	 * be enabled for any queue as new added one if it hasn't been
> +	 * enabled in rte_eth_dev_configure().
> +	 */
> +	if ((local_conf.offloads & dev_info.rx_queue_offload_capa) !=
> +	     local_conf.offloads) {
> +		RTE_ETHDEV_LOG(ERR,
> +			"Ethdev port_id=%d rx_queue_id=%d, new added offloads 0x%"PRIx64" must be "
> +			"within per-queue offload capabilities 0x%"PRIx64" in %s()\n",
> +			port_id, rx_queue_id, local_conf.offloads,
> +			dev_info.rx_queue_offload_capa,
> +			__func__);
> +		return -EINVAL;
> +	}
> +
> +	/*
> +	 * If LRO is enabled, check that the maximum aggregated packet
> +	 * size is supported by the configured device.
> +	 */
> +	if (local_conf.offloads & DEV_RX_OFFLOAD_TCP_LRO) {
> +		if (dev->data->dev_conf.rxmode.max_lro_pkt_size == 0)
> +			dev->data->dev_conf.rxmode.max_lro_pkt_size =
> +				dev->data->dev_conf.rxmode.max_rx_pkt_len;
> +		int ret = check_lro_pkt_size(port_id,
> +				dev->data->dev_conf.rxmode.max_lro_pkt_size,
> +				dev->data->dev_conf.rxmode.max_rx_pkt_len,
> +				dev_info.max_lro_pkt_size);
> +		if (ret != 0)
> +			return ret;
> +	}
> +
> +	ret = (*dev->dev_ops->rx_queue_setup_ex)(dev, rx_queue_id, nb_rx_desc,
> +						 socket_id, &local_conf,
> +						 rx_seg, n_seg);
> +	if (!ret) {
> +		if (!dev->data->min_rx_buf_size ||
> +		    dev->data->min_rx_buf_size > mbp_buf_size)
> +			dev->data->min_rx_buf_size = mbp_buf_size;
> +	}
> +
> +	return eth_err(port_id, ret);
> +}
> +
> +int
>   rte_eth_rx_hairpin_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
>   			       uint16_t nb_rx_desc,
>   			       const struct rte_eth_hairpin_conf *conf)

I dislike the idea to introduce new device operation.
rte_eth_rxconf has reserved space and BUFFER_SPLIT offload will
mean that PMD looks at the split configuration location there.

Above duplication is simply not acceptable, but even if we
factor our shared code into helper function, I still see no
point to introduce new device operation.

[snip]
  
Slava Ovsiienko Oct. 1, 2020, 8:54 a.m. UTC | #2
Hi, Andrew

Thank you for the comments, please see my replies below.

> -----Original Message-----
> From: Andrew Rybchenko <arybchenko@solarflare.com>
> Sent: Thursday, September 17, 2020 19:55
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org
> Cc: Thomas Monjalon <thomasm@mellanox.com>;
> stephen@networkplumber.org; ferruh.yigit@intel.com; Shahaf Shuler
> <shahafs@nvidia.com>; olivier.matz@6wind.com; jerinjacobk@gmail.com;
> maxime.coquelin@redhat.com; david.marchand@redhat.com; Asaf Penso
> <asafp@nvidia.com>
> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce Rx buffer split
> 
[snip]
> >
> > For example, let's suppose we configured the Rx queue with the
> > following segments:
> >      seg0 - pool0, len0=14B, off0=RTE_PKTMBUF_HEADROOM
> >      seg1 - pool1, len1=20B, off1=0B
> >      seg2 - pool2, len2=20B, off2=0B
> >      seg3 - pool3, len3=512B, off3=0B
> >
> > The packet 46 bytes long will look like the following:
> >      seg0 - 14B long @ RTE_PKTMBUF_HEADROOM in mbuf from pool0
> >      seg1 - 20B long @ 0 in mbuf from pool1
> >      seg2 - 12B long @ 0 in mbuf from pool2
> >
> > The packet 1500 bytes long will look like the following:
> >      seg0 - 14B @ RTE_PKTMBUF_HEADROOM in mbuf from pool0
> >      seg1 - 20B @ 0 in mbuf from pool1
> >      seg2 - 20B @ 0 in mbuf from pool2
> >      seg3 - 512B @ 0 in mbuf from pool3
> >      seg4 - 512B @ 0 in mbuf from pool3
> >      seg5 - 422B @ 0 in mbuf from pool3
> 
> The behaviour is logical, but what to do if HW can't do it, i.e. use the last
> segment many times. Should it reject configuration if provided segments are
> insufficient to fit MTU packet? How to report the limitation?
> (I'm still trying to convince that SCATTER and BUFFER_SPLIT should be
> independent).

BUFFER_SPLIT is rather the way to tune SCATTER. Currently scattering
happens on unconditional mbuf data buffer boundaries (we have reserved
HEAD space in the first mbuf and fill this one to the buffer end,
the next mbuf buffers might be filled completely). BUFFER_SPLIT provides
the way to specify the desired points to split packet, not just blindly
follow buffer boundaries. There is the check inplemented in common part
if each split segment fits the mbuf allocated from appropriate pool.
PMD should do extra check internally whether it supports the requested
split settings, if not - call will be rejected.

[snip]
> 
> I dislike the idea to introduce new device operation.
> rte_eth_rxconf has reserved space and BUFFER_SPLIT offload will mean that
> PMD looks at the split configuration location there.
> 
We considered the approach of pushing split setting to the rxconf structure.
[http://patches.dpdk.org/patch/75205/]
But it seems there are some issues:

- the split configuration description requires the variable length array (due
  to variations in number of segments), so rte_eth_rxconf structure would
  have the variable length (not nice, IMO).

  We could push pointers to the array of rte_eth_rxseg, but we would lost
  the single structure (and contiguous memory) simplicity, this approach has
  no advantages over the specifying the split configuration as parameters
  of setup_ex().

- it would introduces the ambiguity, rte_eth_rx_queue_setup() specifies the single
  mbuf pool as parameter. What should we do with it? Set to NULL? 
  Treat as the first  pool? I would prefer to specify all split segments in
  uniform fashion, i.e. as array or rte_eth_rxseg structures (and it can be
  easily updated with some extra segment attributes if needed). So, in my
  opinion, we should remove/replace the pool parameter in rx_queue_setup
 (by introducing new func).

- specifying the new extended setup roiutine has an advantage that we should
  not update any PMDs code in part of existing implementations of
  rte_eth_rx_queue_setup().

  If PMD supports BUFFER_SPLIT (or other related feature) it just should provide
  rte_eth_rx_queue_setup_ex() and check the DEV_RX_OFFLOAD_BUFFER_SPLIT
  (or HEADER_SPLIT, or ever feature) it supports. The common code does
  not check the feature flags - it is on PMDs' own. In order to configure PMD
  to perfrom arbitrary desired Rx spliting the application should check
  DEV_RX_OFFLOAD_BUFFER_SPLIT in port capabilites, if found - set
  DEV_RX_OFFLOAD_BUFFER_SPLIT in configuration and call rte_eth_rx_queue_setup_ex().
  And this approach can be followed for any other split related feature.

With best regards, Slava
  
Andrew Rybchenko Oct. 12, 2020, 8:45 a.m. UTC | #3
Hi Slava,

I'm sorry for late reply. See my notes below.

On 10/1/20 11:54 AM, Slava Ovsiienko wrote:
> Hi, Andrew
>
> Thank you for the comments, please see my replies below.
>
>> -----Original Message-----
>> From: Andrew Rybchenko <arybchenko@solarflare.com>
>> Sent: Thursday, September 17, 2020 19:55
>> To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org
>> Cc: Thomas Monjalon <thomasm@mellanox.com>;
>> stephen@networkplumber.org; ferruh.yigit@intel.com; Shahaf Shuler
>> <shahafs@nvidia.com>; olivier.matz@6wind.com; jerinjacobk@gmail.com;
>> maxime.coquelin@redhat.com; david.marchand@redhat.com; Asaf Penso
>> <asafp@nvidia.com>
>> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce Rx buffer split
>>
> [snip]
>>>
>>> For example, let's suppose we configured the Rx queue with the
>>> following segments:
>>> seg0 - pool0, len0=14B, off0=RTE_PKTMBUF_HEADROOM
>>> seg1 - pool1, len1=20B, off1=0B
>>> seg2 - pool2, len2=20B, off2=0B
>>> seg3 - pool3, len3=512B, off3=0B
>>>
>>> The packet 46 bytes long will look like the following:
>>> seg0 - 14B long @ RTE_PKTMBUF_HEADROOM in mbuf from pool0
>>> seg1 - 20B long @ 0 in mbuf from pool1
>>> seg2 - 12B long @ 0 in mbuf from pool2
>>>
>>> The packet 1500 bytes long will look like the following:
>>> seg0 - 14B @ RTE_PKTMBUF_HEADROOM in mbuf from pool0
>>> seg1 - 20B @ 0 in mbuf from pool1
>>> seg2 - 20B @ 0 in mbuf from pool2
>>> seg3 - 512B @ 0 in mbuf from pool3
>>> seg4 - 512B @ 0 in mbuf from pool3
>>> seg5 - 422B @ 0 in mbuf from pool3
>>
>> The behaviour is logical, but what to do if HW can't do it, i.e. use
>> the last
>> segment many times. Should it reject configuration if provided
>> segments are
>> insufficient to fit MTU packet? How to report the limitation?
>> (I'm still trying to convince that SCATTER and BUFFER_SPLIT should be
>> independent).
>
> BUFFER_SPLIT is rather the way to tune SCATTER. Currently scattering
> happens on unconditional mbuf data buffer boundaries (we have reserved
> HEAD space in the first mbuf and fill this one to the buffer end,
> the next mbuf buffers might be filled completely). BUFFER_SPLIT provides
> the way to specify the desired points to split packet, not just blindly
> follow buffer boundaries. There is the check inplemented in common part
> if each split segment fits the mbuf allocated from appropriate pool.
> PMD should do extra check internally whether it supports the requested
> split settings, if not - call will be rejected.
>

@Thomas, @Ferruh: I'd like to hear what other ethdev
maintainers think about it.

> [snip]
>>
>> I dislike the idea to introduce new device operation.
>> rte_eth_rxconf has reserved space and BUFFER_SPLIT offload will mean that
>> PMD looks at the split configuration location there.
>>
> We considered the approach of pushing split setting to the rxconf
> structure.
> [http://patches.dpdk.org/patch/75205/]
> But it seems there are some issues:
>
> - the split configuration description requires the variable length
> array (due
> to variations in number of segments), so rte_eth_rxconf structure would
> have the variable length (not nice, IMO).
>
> We could push pointers to the array of rte_eth_rxseg, but we would lost
> the single structure (and contiguous memory) simplicity, this approach has
> no advantages over the specifying the split configuration as parameters
> of setup_ex().
>

I think it has a huge advantage to avoid extra device
operation.

> - it would introduces the ambiguity, rte_eth_rx_queue_setup()
> specifies the single
> mbuf pool as parameter. What should we do with it? Set to NULL? Treat
> as the first pool? I would prefer to specify all split segments in
> uniform fashion, i.e. as array or rte_eth_rxseg structures (and it can be
> easily updated with some extra segment attributes if needed). So, in my
> opinion, we should remove/replace the pool parameter in rx_queue_setup
> (by introducing new func).
>

I'm trying to resolve the ambiguity as described above
(see BUFFER_SPLIT vs SCATTER). Use the pointer for
tail segments with respect to SCATTER capability.

> - specifying the new extended setup roiutine has an advantage that we
> should
> not update any PMDs code in part of existing implementations of
> rte_eth_rx_queue_setup().

It is not required since it is controlled by the new offload
flags. If the offload is not supported, the new field is
invisible for PMD (it simply ignores).

>
> If PMD supports BUFFER_SPLIT (or other related feature) it just should
> provide
> rte_eth_rx_queue_setup_ex() and check the DEV_RX_OFFLOAD_BUFFER_SPLIT
> (or HEADER_SPLIT, or ever feature) it supports. The common code does
> not check the feature flags - it is on PMDs' own. In order to
> configure PMD
> to perfrom arbitrary desired Rx spliting the application should check
> DEV_RX_OFFLOAD_BUFFER_SPLIT in port capabilites, if found - set
> DEV_RX_OFFLOAD_BUFFER_SPLIT in configuration and call
> rte_eth_rx_queue_setup_ex().
> And this approach can be followed for any other split related feature.
>
> With best regards, Slava
>
  
Slava Ovsiienko Oct. 12, 2020, 9:56 a.m. UTC | #4
Hi, Andrew

Thank you for the comments.

We have two approaches how to specify multiple segments to split Rx packets:
1. update queue configuration structure
2. introduce new rx_queue_setup_ex() routine with extra parameters.

For [1] my only actual dislike is that we would have multiple places to specify
the pool - in rx_queue_setup() and in the config structure. So, we should
implement some checking (if we have offload flag set we should check
whether mp parameter is NULL and segment descriptions array pointer/size
is provided, if no offload flag set - we must check the description array is empty). 

> @Thomas, @Ferruh: I'd like to hear what other ethdev maintainers think
> about it.

Yes, it would be very nice to hear extra opinions. Do we think the providing
of extra API function is worse than extending existing structure, introducing
some conditional ambiguity and complicating the parameter compliance
check?

Now I'm updating the existing version on the patch based on rx_queue_ex()
and then could prepare the version for configuration structure,
it is not a problem - approaches are very similar, we just should choose
the most relevant one.

With best regards, Slava

> -----Original Message-----
> From: Andrew Rybchenko <Andrew.Rybchenko@oktetlabs.ru>
> Sent: Monday, October 12, 2020 11:45
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org
> Cc: Thomas Monjalon <thomasm@mellanox.com>;
> stephen@networkplumber.org; ferruh.yigit@intel.com; Shahaf Shuler
> <shahafs@nvidia.com>; olivier.matz@6wind.com; jerinjacobk@gmail.com;
> maxime.coquelin@redhat.com; david.marchand@redhat.com; Asaf Penso
> <asafp@nvidia.com>
> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce Rx buffer split
> 
> Hi Slava,
> 
> I'm sorry for late reply. See my notes below.
> 
> On 10/1/20 11:54 AM, Slava Ovsiienko wrote:
> > Hi, Andrew
> >
> > Thank you for the comments, please see my replies below.
> >
> >> -----Original Message-----
> >> From: Andrew Rybchenko <arybchenko@solarflare.com>
> >> Sent: Thursday, September 17, 2020 19:55
> >> To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org
> >> Cc: Thomas Monjalon <thomasm@mellanox.com>;
> >> stephen@networkplumber.org; ferruh.yigit@intel.com; Shahaf Shuler
> >> <shahafs@nvidia.com>; olivier.matz@6wind.com;
> jerinjacobk@gmail.com;
> >> maxime.coquelin@redhat.com; david.marchand@redhat.com; Asaf
> Penso
> >> <asafp@nvidia.com>
> >> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce Rx buffer split
> >>
> > [snip]
> >>>
> >>> For example, let's suppose we configured the Rx queue with the
> >>> following segments:
> >>> seg0 - pool0, len0=14B, off0=RTE_PKTMBUF_HEADROOM
> >>> seg1 - pool1, len1=20B, off1=0B
> >>> seg2 - pool2, len2=20B, off2=0B
> >>> seg3 - pool3, len3=512B, off3=0B
> >>>
> >>> The packet 46 bytes long will look like the following:
> >>> seg0 - 14B long @ RTE_PKTMBUF_HEADROOM in mbuf from pool0
> >>> seg1 - 20B long @ 0 in mbuf from pool1
> >>> seg2 - 12B long @ 0 in mbuf from pool2
> >>>
> >>> The packet 1500 bytes long will look like the following:
> >>> seg0 - 14B @ RTE_PKTMBUF_HEADROOM in mbuf from pool0
> >>> seg1 - 20B @ 0 in mbuf from pool1
> >>> seg2 - 20B @ 0 in mbuf from pool2
> >>> seg3 - 512B @ 0 in mbuf from pool3
> >>> seg4 - 512B @ 0 in mbuf from pool3
> >>> seg5 - 422B @ 0 in mbuf from pool3
> >>
> >> The behaviour is logical, but what to do if HW can't do it, i.e. use
> >> the last segment many times. Should it reject configuration if
> >> provided segments are insufficient to fit MTU packet? How to report
> >> the limitation?
> >> (I'm still trying to convince that SCATTER and BUFFER_SPLIT should be
> >> independent).
> >
> > BUFFER_SPLIT is rather the way to tune SCATTER. Currently scattering
> > happens on unconditional mbuf data buffer boundaries (we have reserved
> > HEAD space in the first mbuf and fill this one to the buffer end, the
> > next mbuf buffers might be filled completely). BUFFER_SPLIT provides
> > the way to specify the desired points to split packet, not just
> > blindly follow buffer boundaries. There is the check inplemented in
> > common part if each split segment fits the mbuf allocated from
> appropriate pool.
> > PMD should do extra check internally whether it supports the requested
> > split settings, if not - call will be rejected.
> >
> 
> @Thomas, @Ferruh: I'd like to hear what other ethdev maintainers think
> about it.
> 
> > [snip]
> >>
> >> I dislike the idea to introduce new device operation.
> >> rte_eth_rxconf has reserved space and BUFFER_SPLIT offload will mean
> >> that PMD looks at the split configuration location there.
> >>
> > We considered the approach of pushing split setting to the rxconf
> > structure.
> >
> [https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatc
> >
> hes.dpdk.org%2Fpatch%2F75205%2F&amp;data=02%7C01%7Cviacheslavo%
> 40nvidi
> >
> a.com%7C97a49cb62028432610ea08d86e8b3283%7C43083d15727340c1b7
> db39efd9c
> >
> cc17a%7C0%7C0%7C637380891414182285&amp;sdata=liII5DHGlJAL8wEwV
> Vika79tp
> > 8R9faTZ0lXrlfvQGZE%3D&amp;reserved=0]
> > But it seems there are some issues:
> >
> > - the split configuration description requires the variable length
> > array (due to variations in number of segments), so rte_eth_rxconf
> > structure would have the variable length (not nice, IMO).
> >
> > We could push pointers to the array of rte_eth_rxseg, but we would
> > lost the single structure (and contiguous memory) simplicity, this
> > approach has no advantages over the specifying the split configuration
> > as parameters of setup_ex().
> >
> 
> I think it has a huge advantage to avoid extra device operation.
> 
> > - it would introduces the ambiguity, rte_eth_rx_queue_setup()
> > specifies the single mbuf pool as parameter. What should we do with
> > it? Set to NULL? Treat as the first pool? I would prefer to specify
> > all split segments in uniform fashion, i.e. as array or rte_eth_rxseg
> > structures (and it can be easily updated with some extra segment
> > attributes if needed). So, in my opinion, we should remove/replace the
> > pool parameter in rx_queue_setup (by introducing new func).
> >
> 
> I'm trying to resolve the ambiguity as described above (see BUFFER_SPLIT vs
> SCATTER). Use the pointer for tail segments with respect to SCATTER
> capability.
> 
> > - specifying the new extended setup roiutine has an advantage that we
> > should not update any PMDs code in part of existing implementations of
> > rte_eth_rx_queue_setup().
> 
> It is not required since it is controlled by the new offload flags. If the offload
> is not supported, the new field is invisible for PMD (it simply ignores).
> 
> >
> > If PMD supports BUFFER_SPLIT (or other related feature) it just should
> > provide
> > rte_eth_rx_queue_setup_ex() and check the
> DEV_RX_OFFLOAD_BUFFER_SPLIT
> > (or HEADER_SPLIT, or ever feature) it supports. The common code does
> > not check the feature flags - it is on PMDs' own. In order to
> > configure PMD to perfrom arbitrary desired Rx spliting the application
> > should check DEV_RX_OFFLOAD_BUFFER_SPLIT in port capabilites, if found
> > - set DEV_RX_OFFLOAD_BUFFER_SPLIT in configuration and call
> > rte_eth_rx_queue_setup_ex().
> > And this approach can be followed for any other split related feature.
> >
> > With best regards, Slava
> >
>
  
Thomas Monjalon Oct. 12, 2020, 3:14 p.m. UTC | #5
12/10/2020 11:56, Slava Ovsiienko:
> Hi, Andrew
> 
> Thank you for the comments.
> 
> We have two approaches how to specify multiple segments to split Rx packets:
> 1. update queue configuration structure
> 2. introduce new rx_queue_setup_ex() routine with extra parameters.
> 
> For [1] my only actual dislike is that we would have multiple places to specify
> the pool - in rx_queue_setup() and in the config structure. So, we should
> implement some checking (if we have offload flag set we should check
> whether mp parameter is NULL and segment descriptions array pointer/size
> is provided, if no offload flag set - we must check the description array is empty). 
> 
> > @Thomas, @Ferruh: I'd like to hear what other ethdev maintainers think
> > about it.
> 
> Yes, it would be very nice to hear extra opinions. Do we think the providing
> of extra API function is worse than extending existing structure, introducing
> some conditional ambiguity and complicating the parameter compliance
> check?

Let's try listing pros and cons of each approach, so we can conclude.

1/ update queue config struct

	1.1 pro: keep same queue setup function
	1.2 con: two mempool pointers (struct or function)
	1.3 con: variable size of segment description array

2/ new queue setup function

	2.1 con: two functions for queue setup
	2.2 pro: mempool pointer is not redundant
	2.3 pro: segment description array size defined by the caller

What else I'm missing?
  
Ananyev, Konstantin Oct. 12, 2020, 3:28 p.m. UTC | #6
> 
> 12/10/2020 11:56, Slava Ovsiienko:
> > Hi, Andrew
> >
> > Thank you for the comments.
> >
> > We have two approaches how to specify multiple segments to split Rx packets:
> > 1. update queue configuration structure
> > 2. introduce new rx_queue_setup_ex() routine with extra parameters.
> >
> > For [1] my only actual dislike is that we would have multiple places to specify
> > the pool - in rx_queue_setup() and in the config structure. So, we should
> > implement some checking (if we have offload flag set we should check
> > whether mp parameter is NULL and segment descriptions array pointer/size
> > is provided, if no offload flag set - we must check the description array is empty).
> >
> > > @Thomas, @Ferruh: I'd like to hear what other ethdev maintainers think
> > > about it.
> >
> > Yes, it would be very nice to hear extra opinions. Do we think the providing
> > of extra API function is worse than extending existing structure, introducing
> > some conditional ambiguity and complicating the parameter compliance
> > check?
> 
> Let's try listing pros and cons of each approach, so we can conclude.
> 
> 1/ update queue config struct
> 
> 	1.1 pro: keep same queue setup function
> 	1.2 con: two mempool pointers (struct or function)
> 	1.3 con: variable size of segment description array
> 
> 2/ new queue setup function
> 
> 	2.1 con: two functions for queue setup
> 	2.2 pro: mempool pointer is not redundant
> 	2.3 pro: segment description array size defined by the caller
> 
> What else I'm missing?
> 

My 2 cents: can we make new (_ex) function to work for both
original config (1 mp for all sizes, no split) and for new config
(multiple mp, split allowed)?
Then in future (21.11?) we can either get rid of original one,
or even make it a wrapper around all one?
Konstantin
  
Slava Ovsiienko Oct. 12, 2020, 3:34 p.m. UTC | #7
> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Sent: Monday, October 12, 2020 18:28
> To: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew
> Rybchenko <Andrew.Rybchenko@oktetlabs.ru>; Yigit, Ferruh
> <ferruh.yigit@intel.com>; Slava Ovsiienko <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; stephen@networkplumber.org; Shahaf Shuler
> <shahafs@nvidia.com>; olivier.matz@6wind.com; jerinjacobk@gmail.com;
> maxime.coquelin@redhat.com; david.marchand@redhat.com; Asaf Penso
> <asafp@nvidia.com>
> Subject: RE: [dpdk-dev] [RFC] ethdev: introduce Rx buffer split
> 
> 
> 
> >
> > 12/10/2020 11:56, Slava Ovsiienko:
> > > Hi, Andrew
> > >
> > > Thank you for the comments.
> > >
> > > We have two approaches how to specify multiple segments to split Rx
> packets:
> > > 1. update queue configuration structure 2. introduce new
> > > rx_queue_setup_ex() routine with extra parameters.
> > >
> > > For [1] my only actual dislike is that we would have multiple places
> > > to specify the pool - in rx_queue_setup() and in the config
> > > structure. So, we should implement some checking (if we have offload
> > > flag set we should check whether mp parameter is NULL and segment
> > > descriptions array pointer/size is provided, if no offload flag set - we must
> check the description array is empty).
> > >
> > > > @Thomas, @Ferruh: I'd like to hear what other ethdev maintainers
> > > > think about it.
> > >
> > > Yes, it would be very nice to hear extra opinions. Do we think the
> > > providing of extra API function is worse than extending existing
> > > structure, introducing some conditional ambiguity and complicating
> > > the parameter compliance check?
> >
> > Let's try listing pros and cons of each approach, so we can conclude.
> >
> > 1/ update queue config struct
> >
> > 	1.1 pro: keep same queue setup function
> > 	1.2 con: two mempool pointers (struct or function)
> > 	1.3 con: variable size of segment description array
> >
> > 2/ new queue setup function
> >
> > 	2.1 con: two functions for queue setup
> > 	2.2 pro: mempool pointer is not redundant
> > 	2.3 pro: segment description array size defined by the caller
> >
> > What else I'm missing?
> >
> 
> My 2 cents: can we make new (_ex) function to work for both original config
> (1 mp for all sizes, no split) and for new config (multiple mp, split allowed)?
> Then in future (21.11?) we can either get rid of original one, or even make it
> a wrapper around all one?
> Konstantin

Yes, actually the mlx5 PMD implementation follows this approach -
specifying the segment description array with the only element 
and zero size/offset provides exactly the same configuration as existing
rte_eth_rx_queue_setup().

Currently I'm detailing the description  (how HEAD_ROOM is handled, what happens
if array is shorter the the buffer chain for segment of maximal size, the zero segment
size means follow the value deduced from the pool and so on).

So, may we consider this point as one more "pro" to setup_ex approach ? 😊

With best regards, Slava
  
Ananyev, Konstantin Oct. 12, 2020, 3:56 p.m. UTC | #8
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Sent: Monday, October 12, 2020 18:28
> > To: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew
> > Rybchenko <Andrew.Rybchenko@oktetlabs.ru>; Yigit, Ferruh
> > <ferruh.yigit@intel.com>; Slava Ovsiienko <viacheslavo@nvidia.com>
> > Cc: dev@dpdk.org; stephen@networkplumber.org; Shahaf Shuler
> > <shahafs@nvidia.com>; olivier.matz@6wind.com; jerinjacobk@gmail.com;
> > maxime.coquelin@redhat.com; david.marchand@redhat.com; Asaf Penso
> > <asafp@nvidia.com>
> > Subject: RE: [dpdk-dev] [RFC] ethdev: introduce Rx buffer split
> >
> >
> >
> > >
> > > 12/10/2020 11:56, Slava Ovsiienko:
> > > > Hi, Andrew
> > > >
> > > > Thank you for the comments.
> > > >
> > > > We have two approaches how to specify multiple segments to split Rx
> > packets:
> > > > 1. update queue configuration structure 2. introduce new
> > > > rx_queue_setup_ex() routine with extra parameters.
> > > >
> > > > For [1] my only actual dislike is that we would have multiple places
> > > > to specify the pool - in rx_queue_setup() and in the config
> > > > structure. So, we should implement some checking (if we have offload
> > > > flag set we should check whether mp parameter is NULL and segment
> > > > descriptions array pointer/size is provided, if no offload flag set - we must
> > check the description array is empty).
> > > >
> > > > > @Thomas, @Ferruh: I'd like to hear what other ethdev maintainers
> > > > > think about it.
> > > >
> > > > Yes, it would be very nice to hear extra opinions. Do we think the
> > > > providing of extra API function is worse than extending existing
> > > > structure, introducing some conditional ambiguity and complicating
> > > > the parameter compliance check?
> > >
> > > Let's try listing pros and cons of each approach, so we can conclude.
> > >
> > > 1/ update queue config struct
> > >
> > > 	1.1 pro: keep same queue setup function
> > > 	1.2 con: two mempool pointers (struct or function)
> > > 	1.3 con: variable size of segment description array
> > >
> > > 2/ new queue setup function
> > >
> > > 	2.1 con: two functions for queue setup
> > > 	2.2 pro: mempool pointer is not redundant
> > > 	2.3 pro: segment description array size defined by the caller
> > >
> > > What else I'm missing?
> > >
> >
> > My 2 cents: can we make new (_ex) function to work for both original config
> > (1 mp for all sizes, no split) and for new config (multiple mp, split allowed)?
> > Then in future (21.11?) we can either get rid of original one, or even make it
> > a wrapper around all one?
> > Konstantin
> 
> Yes, actually the mlx5 PMD implementation follows this approach -
> specifying the segment description array with the only element
> and zero size/offset provides exactly the same configuration as existing
> rte_eth_rx_queue_setup().
> 
> Currently I'm detailing the description  (how HEAD_ROOM is handled, what happens
> if array is shorter the the buffer chain for segment of maximal size, the zero segment
> size means follow the value deduced from the pool and so on).
> 
> So, may we consider this point as one more "pro" to setup_ex approach ? 😊
> 

From my perspective, yes.
It is sort of more gradual approach.
I expect it would be experimental function for some time,
so we'll have time to try it, adjust, fix, etc without breaking original one.
  
Slava Ovsiienko Oct. 12, 2020, 3:59 p.m. UTC | #9
> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Sent: Monday, October 12, 2020 18:56
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; NBU-Contact-Thomas
> Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> <Andrew.Rybchenko@oktetlabs.ru>; Yigit, Ferruh <ferruh.yigit@intel.com>
> Cc: dev@dpdk.org; stephen@networkplumber.org; Shahaf Shuler
> <shahafs@nvidia.com>; olivier.matz@6wind.com; jerinjacobk@gmail.com;
> maxime.coquelin@redhat.com; david.marchand@redhat.com; Asaf Penso
> <asafp@nvidia.com>
> Subject: RE: [dpdk-dev] [RFC] ethdev: introduce Rx buffer split
> 
> 
> >
> > > -----Original Message-----
> > > From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > Sent: Monday, October 12, 2020 18:28
> > > To: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew
> > > Rybchenko <Andrew.Rybchenko@oktetlabs.ru>; Yigit, Ferruh
> > > <ferruh.yigit@intel.com>; Slava Ovsiienko <viacheslavo@nvidia.com>
> > > Cc: dev@dpdk.org; stephen@networkplumber.org; Shahaf Shuler
> > > <shahafs@nvidia.com>; olivier.matz@6wind.com;
> jerinjacobk@gmail.com;
> > > maxime.coquelin@redhat.com; david.marchand@redhat.com; Asaf
> Penso
> > > <asafp@nvidia.com>
> > > Subject: RE: [dpdk-dev] [RFC] ethdev: introduce Rx buffer split
> > >
> > >
> > >
> > > >
> > > > 12/10/2020 11:56, Slava Ovsiienko:
> > > > > Hi, Andrew
> > > > >
> > > > > Thank you for the comments.
> > > > >
> > > > > We have two approaches how to specify multiple segments to split
> > > > > Rx
> > > packets:
> > > > > 1. update queue configuration structure 2. introduce new
> > > > > rx_queue_setup_ex() routine with extra parameters.
> > > > >
> > > > > For [1] my only actual dislike is that we would have multiple
> > > > > places to specify the pool - in rx_queue_setup() and in the
> > > > > config structure. So, we should implement some checking (if we
> > > > > have offload flag set we should check whether mp parameter is
> > > > > NULL and segment descriptions array pointer/size is provided, if
> > > > > no offload flag set - we must
> > > check the description array is empty).
> > > > >
> > > > > > @Thomas, @Ferruh: I'd like to hear what other ethdev
> > > > > > maintainers think about it.
> > > > >
> > > > > Yes, it would be very nice to hear extra opinions. Do we think
> > > > > the providing of extra API function is worse than extending
> > > > > existing structure, introducing some conditional ambiguity and
> > > > > complicating the parameter compliance check?
> > > >
> > > > Let's try listing pros and cons of each approach, so we can conclude.
> > > >
> > > > 1/ update queue config struct
> > > >
> > > > 	1.1 pro: keep same queue setup function
> > > > 	1.2 con: two mempool pointers (struct or function)
> > > > 	1.3 con: variable size of segment description array
> > > >
> > > > 2/ new queue setup function
> > > >
> > > > 	2.1 con: two functions for queue setup
> > > > 	2.2 pro: mempool pointer is not redundant
> > > > 	2.3 pro: segment description array size defined by the caller
> > > >
> > > > What else I'm missing?
> > > >
> > >
> > > My 2 cents: can we make new (_ex) function to work for both original
> > > config
> > > (1 mp for all sizes, no split) and for new config (multiple mp, split
> allowed)?
> > > Then in future (21.11?) we can either get rid of original one, or
> > > even make it a wrapper around all one?
> > > Konstantin
> >
> > Yes, actually the mlx5 PMD implementation follows this approach -
> > specifying the segment description array with the only element and
> > zero size/offset provides exactly the same configuration as existing
> > rte_eth_rx_queue_setup().
> >
> > Currently I'm detailing the description  (how HEAD_ROOM is handled,
> > what happens if array is shorter the the buffer chain for segment of
> > maximal size, the zero segment size means follow the value deduced from
> the pool and so on).
> >
> > So, may we consider this point as one more "pro" to setup_ex approach
> > ? 😊
> >
> 
> From my perspective, yes.
> It is sort of more gradual approach.
> I expect it would be experimental function for some time, so we'll have time
> to try it, adjust, fix, etc without breaking original one.
> 
Thank you for providing your opinion (whatever).
Yes, function will be marked as experimental.

With best regards, Slava
  
Andrew Rybchenko Oct. 12, 2020, 4:03 p.m. UTC | #10
On 10/12/20 6:14 PM, Thomas Monjalon wrote:
> 12/10/2020 11:56, Slava Ovsiienko:
>> Hi, Andrew
>>
>> Thank you for the comments.
>>
>> We have two approaches how to specify multiple segments to split Rx packets:
>> 1. update queue configuration structure
>> 2. introduce new rx_queue_setup_ex() routine with extra parameters.
>>
>> For [1] my only actual dislike is that we would have multiple places to specify
>> the pool - in rx_queue_setup() and in the config structure. So, we should
>> implement some checking (if we have offload flag set we should check
>> whether mp parameter is NULL and segment descriptions array pointer/size
>> is provided, if no offload flag set - we must check the description array is empty). 
>>
>>> @Thomas, @Ferruh: I'd like to hear what other ethdev maintainers think
>>> about it.
>>
>> Yes, it would be very nice to hear extra opinions. Do we think the providing
>> of extra API function is worse than extending existing structure, introducing
>> some conditional ambiguity and complicating the parameter compliance
>> check?
> 
> Let's try listing pros and cons of each approach, so we can conclude.
> 
> 1/ update queue config struct
> 
> 	1.1 pro: keep same queue setup function

pro: no code duplication

> 	1.2 con: two mempool pointers (struct or function)
> 	1.3 con: variable size of segment description array
> 
> 2/ new queue setup function
> 
> 	2.1 con: two functions for queue setup

con: code duplication or refactoring of existing stable code

> 	2.2 pro: mempool pointer is not redundant
> 	2.3 pro: segment description array size defined by the caller
> 
> What else I'm missing?
>
  
Slava Ovsiienko Oct. 12, 2020, 4:10 p.m. UTC | #11
> -----Original Message-----
> From: Andrew Rybchenko <Andrew.Rybchenko@oktetlabs.ru>
> Sent: Monday, October 12, 2020 19:04
> To: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> ferruh.yigit@intel.com; Slava Ovsiienko <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; stephen@networkplumber.org; Shahaf Shuler
> <shahafs@nvidia.com>; olivier.matz@6wind.com; jerinjacobk@gmail.com;
> maxime.coquelin@redhat.com; david.marchand@redhat.com; Asaf Penso
> <asafp@nvidia.com>
> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce Rx buffer split
> 
> On 10/12/20 6:14 PM, Thomas Monjalon wrote:
> > 12/10/2020 11:56, Slava Ovsiienko:
> >> Hi, Andrew
> >>
> >> Thank you for the comments.
> >>
> >> We have two approaches how to specify multiple segments to split Rx
> packets:
> >> 1. update queue configuration structure 2. introduce new
> >> rx_queue_setup_ex() routine with extra parameters.
> >>
> >> For [1] my only actual dislike is that we would have multiple places
> >> to specify the pool - in rx_queue_setup() and in the config
> >> structure. So, we should implement some checking (if we have offload
> >> flag set we should check whether mp parameter is NULL and segment
> >> descriptions array pointer/size is provided, if no offload flag set - we must
> check the description array is empty).
> >>
> >>> @Thomas, @Ferruh: I'd like to hear what other ethdev maintainers
> >>> think about it.
> >>
> >> Yes, it would be very nice to hear extra opinions. Do we think the
> >> providing of extra API function is worse than extending existing
> >> structure, introducing some conditional ambiguity and complicating
> >> the parameter compliance check?
> >
> > Let's try listing pros and cons of each approach, so we can conclude.
> >
> > 1/ update queue config struct
> >
> > 	1.1 pro: keep same queue setup function
> 
> pro: no code duplication
> 
> > 	1.2 con: two mempool pointers (struct or function)
> > 	1.3 con: variable size of segment description array
> >
> > 2/ new queue setup function
> >
> > 	2.1 con: two functions for queue setup
> 
> con: code duplication or refactoring of existing stable code

- no refactoring of existing rte_eth_rx_queue_setup() - it is kept intact
- yes, there is some duplication in rte_eth_rxseg_queue_setup, but
  the large part of code is new - there is very specific check for the
 split buffer parameters.
- no code duplication at PMD level - both ways go to the same
  internal routine
- PMD code must be refactored anyway, no con/pro for this point -
  updated PMD rx queue setup must handle the new split format,
 no way to drop it.

> 
> > 	2.2 pro: mempool pointer is not redundant
> > 	2.3 pro: segment description array size defined by the caller
> >
> > What else I'm missing?
> >

With best regards, Slava
  
Thomas Monjalon Oct. 12, 2020, 4:52 p.m. UTC | #12
12/10/2020 17:56, Ananyev, Konstantin:
> > From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > > 12/10/2020 11:56, Slava Ovsiienko:
> > > > > We have two approaches how to specify multiple segments to split Rx
> > > packets:
> > > > > 1. update queue configuration structure 2. introduce new
> > > > > rx_queue_setup_ex() routine with extra parameters.
> > > > >
> > > > > For [1] my only actual dislike is that we would have multiple places
> > > > > to specify the pool - in rx_queue_setup() and in the config
> > > > > structure. So, we should implement some checking (if we have offload
> > > > > flag set we should check whether mp parameter is NULL and segment
> > > > > descriptions array pointer/size is provided, if no offload flag set - we must
> > > check the description array is empty).
> > > > >
> > > > > > @Thomas, @Ferruh: I'd like to hear what other ethdev maintainers
> > > > > > think about it.
> > > > >
> > > > > Yes, it would be very nice to hear extra opinions. Do we think the
> > > > > providing of extra API function is worse than extending existing
> > > > > structure, introducing some conditional ambiguity and complicating
> > > > > the parameter compliance check?
> > > >
> > > > Let's try listing pros and cons of each approach, so we can conclude.
> > > >
> > > > 1/ update queue config struct
> > > >
> > > > 	1.1 pro: keep same queue setup function
> > > > 	1.2 con: two mempool pointers (struct or function)
> > > > 	1.3 con: variable size of segment description array
> > > >
> > > > 2/ new queue setup function
> > > >
> > > > 	2.1 con: two functions for queue setup
> > > > 	2.2 pro: mempool pointer is not redundant
> > > > 	2.3 pro: segment description array size defined by the caller
> > > >
> > > > What else I'm missing?
> > > >
> > >
> > > My 2 cents: can we make new (_ex) function to work for both original config
> > > (1 mp for all sizes, no split) and for new config (multiple mp, split allowed)?
> > > Then in future (21.11?) we can either get rid of original one, or even make it
> > > a wrapper around all one?
> > > Konstantin
> > 
> > Yes, actually the mlx5 PMD implementation follows this approach -
> > specifying the segment description array with the only element
> > and zero size/offset provides exactly the same configuration as existing
> > rte_eth_rx_queue_setup().
> > 
> > Currently I'm detailing the description  (how HEAD_ROOM is handled, what happens
> > if array is shorter the the buffer chain for segment of maximal size, the zero segment
> > size means follow the value deduced from the pool and so on).
> > 
> > So, may we consider this point as one more "pro" to setup_ex approach ? 😊
> 
> From my perspective, yes.
> It is sort of more gradual approach.
> I expect it would be experimental function for some time,
> so we'll have time to try it, adjust, fix, etc without breaking original one.

I like the wrapper idea.
Is it possible to call rte_eth_rx_queue_setup_ex()
from rte_eth_rx_queue_setup() using a rte_eth_rxseg object on the stack?
  
Ferruh Yigit Oct. 13, 2020, 9:59 p.m. UTC | #13
On 10/12/2020 10:56 AM, Slava Ovsiienko wrote:
> Hi, Andrew
> 
> Thank you for the comments.
> 
> We have two approaches how to specify multiple segments to split Rx packets:
> 1. update queue configuration structure
> 2. introduce new rx_queue_setup_ex() routine with extra parameters.
> 
> For [1] my only actual dislike is that we would have multiple places to specify
> the pool - in rx_queue_setup() and in the config structure. So, we should
> implement some checking (if we have offload flag set we should check
> whether mp parameter is NULL and segment descriptions array pointer/size
> is provided, if no offload flag set - we must check the description array is empty).
> 
>> @Thomas, @Ferruh: I'd like to hear what other ethdev maintainers think
>> about it.
> 
> Yes, it would be very nice to hear extra opinions. Do we think the providing
> of extra API function is worse than extending existing structure, introducing
> some conditional ambiguity and complicating the parameter compliance
> check?
> 

I think decision was given with the deprecation notice which already says 
``rte_eth_rxconf`` will be updated for this.

With new API, we need to create a new dev_ops too, not sure about creating a new 
dev_ops for a single PMD.

For the PMD that supports this feature will need two dev_ops that is fairly 
close to each other, as Andrew mentioned this is a duplication.

And from user perspective two setup functions with overlaps can be confusing.

+1 to having single setup function but update the config, and I can see v5 sent 
this way, I will check it.


> Now I'm updating the existing version on the patch based on rx_queue_ex()
> and then could prepare the version for configuration structure,
> it is not a problem - approaches are very similar, we just should choose
> the most relevant one.
> 
> With best regards, Slava
> 
>> -----Original Message-----
>> From: Andrew Rybchenko <Andrew.Rybchenko@oktetlabs.ru>
>> Sent: Monday, October 12, 2020 11:45
>> To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org
>> Cc: Thomas Monjalon <thomasm@mellanox.com>;
>> stephen@networkplumber.org; ferruh.yigit@intel.com; Shahaf Shuler
>> <shahafs@nvidia.com>; olivier.matz@6wind.com; jerinjacobk@gmail.com;
>> maxime.coquelin@redhat.com; david.marchand@redhat.com; Asaf Penso
>> <asafp@nvidia.com>
>> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce Rx buffer split
>>
>> Hi Slava,
>>
>> I'm sorry for late reply. See my notes below.
>>
>> On 10/1/20 11:54 AM, Slava Ovsiienko wrote:
>>> Hi, Andrew
>>>
>>> Thank you for the comments, please see my replies below.
>>>
>>>> -----Original Message-----
>>>> From: Andrew Rybchenko <arybchenko@solarflare.com>
>>>> Sent: Thursday, September 17, 2020 19:55
>>>> To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org
>>>> Cc: Thomas Monjalon <thomasm@mellanox.com>;
>>>> stephen@networkplumber.org; ferruh.yigit@intel.com; Shahaf Shuler
>>>> <shahafs@nvidia.com>; olivier.matz@6wind.com;
>> jerinjacobk@gmail.com;
>>>> maxime.coquelin@redhat.com; david.marchand@redhat.com; Asaf
>> Penso
>>>> <asafp@nvidia.com>
>>>> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce Rx buffer split
>>>>
>>> [snip]
>>>>>
>>>>> For example, let's suppose we configured the Rx queue with the
>>>>> following segments:
>>>>> seg0 - pool0, len0=14B, off0=RTE_PKTMBUF_HEADROOM
>>>>> seg1 - pool1, len1=20B, off1=0B
>>>>> seg2 - pool2, len2=20B, off2=0B
>>>>> seg3 - pool3, len3=512B, off3=0B
>>>>>
>>>>> The packet 46 bytes long will look like the following:
>>>>> seg0 - 14B long @ RTE_PKTMBUF_HEADROOM in mbuf from pool0
>>>>> seg1 - 20B long @ 0 in mbuf from pool1
>>>>> seg2 - 12B long @ 0 in mbuf from pool2
>>>>>
>>>>> The packet 1500 bytes long will look like the following:
>>>>> seg0 - 14B @ RTE_PKTMBUF_HEADROOM in mbuf from pool0
>>>>> seg1 - 20B @ 0 in mbuf from pool1
>>>>> seg2 - 20B @ 0 in mbuf from pool2
>>>>> seg3 - 512B @ 0 in mbuf from pool3
>>>>> seg4 - 512B @ 0 in mbuf from pool3
>>>>> seg5 - 422B @ 0 in mbuf from pool3
>>>>
>>>> The behaviour is logical, but what to do if HW can't do it, i.e. use
>>>> the last segment many times. Should it reject configuration if
>>>> provided segments are insufficient to fit MTU packet? How to report
>>>> the limitation?
>>>> (I'm still trying to convince that SCATTER and BUFFER_SPLIT should be
>>>> independent).
>>>
>>> BUFFER_SPLIT is rather the way to tune SCATTER. Currently scattering
>>> happens on unconditional mbuf data buffer boundaries (we have reserved
>>> HEAD space in the first mbuf and fill this one to the buffer end, the
>>> next mbuf buffers might be filled completely). BUFFER_SPLIT provides
>>> the way to specify the desired points to split packet, not just
>>> blindly follow buffer boundaries. There is the check inplemented in
>>> common part if each split segment fits the mbuf allocated from
>> appropriate pool.
>>> PMD should do extra check internally whether it supports the requested
>>> split settings, if not - call will be rejected.
>>>
>>
>> @Thomas, @Ferruh: I'd like to hear what other ethdev maintainers think
>> about it.
>>
>>> [snip]
>>>>
>>>> I dislike the idea to introduce new device operation.
>>>> rte_eth_rxconf has reserved space and BUFFER_SPLIT offload will mean
>>>> that PMD looks at the split configuration location there.
>>>>
>>> We considered the approach of pushing split setting to the rxconf
>>> structure.
>>>
>> [https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatc
>>>
>> hes.dpdk.org%2Fpatch%2F75205%2F&amp;data=02%7C01%7Cviacheslavo%
>> 40nvidi
>>>
>> a.com%7C97a49cb62028432610ea08d86e8b3283%7C43083d15727340c1b7
>> db39efd9c
>>>
>> cc17a%7C0%7C0%7C637380891414182285&amp;sdata=liII5DHGlJAL8wEwV
>> Vika79tp
>>> 8R9faTZ0lXrlfvQGZE%3D&amp;reserved=0]
>>> But it seems there are some issues:
>>>
>>> - the split configuration description requires the variable length
>>> array (due to variations in number of segments), so rte_eth_rxconf
>>> structure would have the variable length (not nice, IMO).
>>>
>>> We could push pointers to the array of rte_eth_rxseg, but we would
>>> lost the single structure (and contiguous memory) simplicity, this
>>> approach has no advantages over the specifying the split configuration
>>> as parameters of setup_ex().
>>>
>>
>> I think it has a huge advantage to avoid extra device operation.
>>
>>> - it would introduces the ambiguity, rte_eth_rx_queue_setup()
>>> specifies the single mbuf pool as parameter. What should we do with
>>> it? Set to NULL? Treat as the first pool? I would prefer to specify
>>> all split segments in uniform fashion, i.e. as array or rte_eth_rxseg
>>> structures (and it can be easily updated with some extra segment
>>> attributes if needed). So, in my opinion, we should remove/replace the
>>> pool parameter in rx_queue_setup (by introducing new func).
>>>
>>
>> I'm trying to resolve the ambiguity as described above (see BUFFER_SPLIT vs
>> SCATTER). Use the pointer for tail segments with respect to SCATTER
>> capability.
>>
>>> - specifying the new extended setup roiutine has an advantage that we
>>> should not update any PMDs code in part of existing implementations of
>>> rte_eth_rx_queue_setup().
>>
>> It is not required since it is controlled by the new offload flags. If the offload
>> is not supported, the new field is invisible for PMD (it simply ignores).
>>
>>>
>>> If PMD supports BUFFER_SPLIT (or other related feature) it just should
>>> provide
>>> rte_eth_rx_queue_setup_ex() and check the
>> DEV_RX_OFFLOAD_BUFFER_SPLIT
>>> (or HEADER_SPLIT, or ever feature) it supports. The common code does
>>> not check the feature flags - it is on PMDs' own. In order to
>>> configure PMD to perfrom arbitrary desired Rx spliting the application
>>> should check DEV_RX_OFFLOAD_BUFFER_SPLIT in port capabilites, if found
>>> - set DEV_RX_OFFLOAD_BUFFER_SPLIT in configuration and call
>>> rte_eth_rx_queue_setup_ex().
>>> And this approach can be followed for any other split related feature.
>>>
>>> With best regards, Slava
>>>
>>
>
  
Thomas Monjalon Oct. 14, 2020, 7:17 a.m. UTC | #14
13/10/2020 23:59, Ferruh Yigit:
> On 10/12/2020 10:56 AM, Slava Ovsiienko wrote:
> > We have two approaches how to specify multiple segments to split Rx packets:
> > 1. update queue configuration structure
> > 2. introduce new rx_queue_setup_ex() routine with extra parameters.
> > 
> > For [1] my only actual dislike is that we would have multiple places to specify
> > the pool - in rx_queue_setup() and in the config structure. So, we should
> > implement some checking (if we have offload flag set we should check
> > whether mp parameter is NULL and segment descriptions array pointer/size
> > is provided, if no offload flag set - we must check the description array is empty).
> > 
> >> @Thomas, @Ferruh: I'd like to hear what other ethdev maintainers think
> >> about it.
> > 
> > Yes, it would be very nice to hear extra opinions. Do we think the providing
> > of extra API function is worse than extending existing structure, introducing
> > some conditional ambiguity and complicating the parameter compliance
> > check?
> 
> I think decision was given with the deprecation notice which already says 
> ``rte_eth_rxconf`` will be updated for this.
> 
> With new API, we need to create a new dev_ops too, not sure about creating a new 
> dev_ops for a single PMD.

You should not view it as a feature for a single PMD.
Yes, as always, it starts with only one PMD implementing the API,
but I really think this feature is generic and multiple NICs
will be able to support this offload.


> For the PMD that supports this feature will need two dev_ops that is fairly 
> close to each other, as Andrew mentioned this is a duplication.
> 
> And from user perspective two setup functions with overlaps can be confusing.
> 
> +1 to having single setup function but update the config, and I can see v5 sent 
> this way, I will check it.
> 
> 
> > Now I'm updating the existing version on the patch based on rx_queue_ex()
> > and then could prepare the version for configuration structure,
> > it is not a problem - approaches are very similar, we just should choose
> > the most relevant one.
> > 
> > With best regards, Slava
  
Slava Ovsiienko Oct. 14, 2020, 7:37 a.m. UTC | #15
> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@intel.com>
> Sent: Wednesday, October 14, 2020 0:59
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; Andrew Rybchenko
> <Andrew.Rybchenko@oktetlabs.ru>; dev@dpdk.org
> Cc: Thomas Monjalon <thomasm@mellanox.com>;
> stephen@networkplumber.org; Shahaf Shuler <shahafs@nvidia.com>;
> olivier.matz@6wind.com; jerinjacobk@gmail.com;
> maxime.coquelin@redhat.com; david.marchand@redhat.com; Asaf Penso
> <asafp@nvidia.com>; Konstantin Ananyev
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce Rx buffer split
> 
> On 10/12/2020 10:56 AM, Slava Ovsiienko wrote:
> > Hi, Andrew
> >
> > Thank you for the comments.
> >
> > We have two approaches how to specify multiple segments to split Rx
> packets:
> > 1. update queue configuration structure 2. introduce new
> > rx_queue_setup_ex() routine with extra parameters.
> >
> > For [1] my only actual dislike is that we would have multiple places
> > to specify the pool - in rx_queue_setup() and in the config structure.
> > So, we should implement some checking (if we have offload flag set we
> > should check whether mp parameter is NULL and segment descriptions
> > array pointer/size is provided, if no offload flag set - we must check the
> description array is empty).
> >
> >> @Thomas, @Ferruh: I'd like to hear what other ethdev maintainers
> >> think about it.
> >
> > Yes, it would be very nice to hear extra opinions. Do we think the
> > providing of extra API function is worse than extending existing
> > structure, introducing some conditional ambiguity and complicating the
> > parameter compliance check?
> >
> 
> I think decision was given with the deprecation notice which already says
> ``rte_eth_rxconf`` will be updated for this.
> 
> With new API, we need to create a new dev_ops too, not sure about creating
> a new dev_ops for a single PMD.

I would rather consider the feature as generic one, not as for "single PMD".
Currently DPDK does not provide any flexibility about Rx buffer and applications 
just have no way to control Rx buffer(s) attributes. I suppose the split
is not very specific to mlx5, it is very likely the hardware of many other
vendors should be capable to perform the same things.

> 
> For the PMD that supports this feature will need two dev_ops that is fairly
> close to each other, as Andrew mentioned this is a duplication.
> 
> And from user perspective two setup functions with overlaps can be
> confusing.
> 
> +1 to having single setup function but update the config, and I can see
> +v5 sent
> this way, I will check it.

OK, got it, thank you very much for the opinion, let's move in this way.
v5 with reverted (back to deprecation notice) approach is sent, I will address the comments.

With best regards, Slaa
  

Patch

diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 7858ad5..638e42d 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -1933,6 +1933,172 @@  struct rte_eth_dev *
 }
 
 int
+rte_eth_rx_queue_setup_ex(uint16_t port_id, uint16_t rx_queue_id,
+			  uint16_t nb_rx_desc, unsigned int socket_id,
+			  const struct rte_eth_rxconf *rx_conf,
+			  const struct rte_eth_rxseg *rx_seg, uint16_t n_seg)
+{
+	int ret;
+	uint16_t seg_idx;
+	uint32_t mbp_buf_size;
+	struct rte_eth_dev *dev;
+	struct rte_eth_dev_info dev_info;
+	struct rte_eth_rxconf local_conf;
+	void **rxq;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	dev = &rte_eth_devices[port_id];
+	if (rx_queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", rx_queue_id);
+		return -EINVAL;
+	}
+
+	if (rx_seg == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid null description pointer\n");
+		return -EINVAL;
+	}
+
+	if (n_seg == 0) {
+		RTE_ETHDEV_LOG(ERR, "Invalid zero description number\n");
+		return -EINVAL;
+	}
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_setup_ex, -ENOTSUP);
+
+	/*
+	 * Check the size of the mbuf data buffer.
+	 * This value must be provided in the private data of the memory pool.
+	 * First check that the memory pool has a valid private data.
+	 */
+	ret = rte_eth_dev_info_get(port_id, &dev_info);
+	if (ret != 0)
+		return ret;
+
+	for (seg_idx = 0; seg_idx < n_seg; seg_idx++) {
+		struct rte_mempool *mp = rx_seg[seg_idx].mp;
+
+		if (mp->private_data_size < sizeof(struct rte_pktmbuf_pool_private)) {
+			RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
+				mp->name, (int)mp->private_data_size,
+				(int)sizeof(struct rte_pktmbuf_pool_private));
+			return -ENOSPC;
+		}
+
+		mbp_buf_size = rte_pktmbuf_data_room_size(mp);
+		if (mbp_buf_size < rx_seg[seg_idx].length + rx_seg[seg_idx].offset) {
+			RTE_ETHDEV_LOG(ERR,
+				"%s mbuf_data_room_size %d < %d"
+				" (segment length=%d + segment offset=%d)\n",
+				mp->name, (int)mbp_buf_size,
+				(int)(rx_seg[seg_idx].length + rx_seg[seg_idx].offset),
+				(int)rx_seg[seg_idx].length,
+				(int)rx_seg[seg_idx].offset);
+			return -EINVAL;
+		}
+	}
+
+	/* Use default specified by driver, if nb_rx_desc is zero */
+	if (nb_rx_desc == 0) {
+		nb_rx_desc = dev_info.default_rxportconf.ring_size;
+		/* If driver default is also zero, fall back on EAL default */
+		if (nb_rx_desc == 0)
+			nb_rx_desc = RTE_ETH_DEV_FALLBACK_RX_RINGSIZE;
+	}
+
+	if (nb_rx_desc > dev_info.rx_desc_lim.nb_max ||
+			nb_rx_desc < dev_info.rx_desc_lim.nb_min ||
+			nb_rx_desc % dev_info.rx_desc_lim.nb_align != 0) {
+
+		RTE_ETHDEV_LOG(ERR,
+			"Invalid value for nb_rx_desc(=%hu), should be: <= %hu, >= %hu, and a product of %hu\n",
+			nb_rx_desc, dev_info.rx_desc_lim.nb_max,
+			dev_info.rx_desc_lim.nb_min,
+			dev_info.rx_desc_lim.nb_align);
+		return -EINVAL;
+	}
+
+	if (dev->data->dev_started &&
+		!(dev_info.dev_capa &
+			RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP))
+		return -EBUSY;
+
+	if (dev->data->dev_started &&
+		(dev->data->rx_queue_state[rx_queue_id] !=
+			RTE_ETH_QUEUE_STATE_STOPPED))
+		return -EBUSY;
+
+	rxq = dev->data->rx_queues;
+	if (rxq[rx_queue_id]) {
+		RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_release,
+					-ENOTSUP);
+		(*dev->dev_ops->rx_queue_release)(rxq[rx_queue_id]);
+		rxq[rx_queue_id] = NULL;
+	}
+
+	if (rx_conf == NULL)
+		rx_conf = &dev_info.default_rxconf;
+
+	local_conf = *rx_conf;
+
+	/*
+	 * If an offloading has already been enabled in
+	 * rte_eth_dev_configure(), it has been enabled on all queues,
+	 * so there is no need to enable it in this queue again.
+	 * The local_conf.offloads input to underlying PMD only carries
+	 * those offloadings which are only enabled on this queue and
+	 * not enabled on all queues.
+	 */
+	local_conf.offloads &= ~dev->data->dev_conf.rxmode.offloads;
+
+	/*
+	 * New added offloadings for this queue are those not enabled in
+	 * rte_eth_dev_configure() and they must be per-queue type.
+	 * A pure per-port offloading can't be enabled on a queue while
+	 * disabled on another queue. A pure per-port offloading can't
+	 * be enabled for any queue as new added one if it hasn't been
+	 * enabled in rte_eth_dev_configure().
+	 */
+	if ((local_conf.offloads & dev_info.rx_queue_offload_capa) !=
+	     local_conf.offloads) {
+		RTE_ETHDEV_LOG(ERR,
+			"Ethdev port_id=%d rx_queue_id=%d, new added offloads 0x%"PRIx64" must be "
+			"within per-queue offload capabilities 0x%"PRIx64" in %s()\n",
+			port_id, rx_queue_id, local_conf.offloads,
+			dev_info.rx_queue_offload_capa,
+			__func__);
+		return -EINVAL;
+	}
+
+	/*
+	 * If LRO is enabled, check that the maximum aggregated packet
+	 * size is supported by the configured device.
+	 */
+	if (local_conf.offloads & DEV_RX_OFFLOAD_TCP_LRO) {
+		if (dev->data->dev_conf.rxmode.max_lro_pkt_size == 0)
+			dev->data->dev_conf.rxmode.max_lro_pkt_size =
+				dev->data->dev_conf.rxmode.max_rx_pkt_len;
+		int ret = check_lro_pkt_size(port_id,
+				dev->data->dev_conf.rxmode.max_lro_pkt_size,
+				dev->data->dev_conf.rxmode.max_rx_pkt_len,
+				dev_info.max_lro_pkt_size);
+		if (ret != 0)
+			return ret;
+	}
+
+	ret = (*dev->dev_ops->rx_queue_setup_ex)(dev, rx_queue_id, nb_rx_desc,
+						 socket_id, &local_conf,
+						 rx_seg, n_seg);
+	if (!ret) {
+		if (!dev->data->min_rx_buf_size ||
+		    dev->data->min_rx_buf_size > mbp_buf_size)
+			dev->data->min_rx_buf_size = mbp_buf_size;
+	}
+
+	return eth_err(port_id, ret);
+}
+
+int
 rte_eth_rx_hairpin_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
 			       uint16_t nb_rx_desc,
 			       const struct rte_eth_hairpin_conf *conf)
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 70295d7..701264a 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -938,6 +938,16 @@  struct rte_eth_txmode {
 };
 
 /**
+ * A structure used to configure an RX packet segment to split.
+ */
+struct rte_eth_rxseg {
+	struct rte_mempool *mp; /**< Memory pools to allocate segment from */
+	uint16_t length; /**< Segment maximal data length */
+	uint16_t offset; /**< Data offset from beggining of mbuf data buffer */
+	uint32_t reserved; /**< Reserved field */
+};
+
+/**
  * A structure used to configure an RX ring of an Ethernet port.
  */
 struct rte_eth_rxconf {
@@ -1988,6 +1998,11 @@  int rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
 		const struct rte_eth_rxconf *rx_conf,
 		struct rte_mempool *mb_pool);
 
+int rte_eth_rx_queue_setup_ex(uint16_t port_id, uint16_t rx_queue_id,
+		uint16_t nb_rx_desc, unsigned int socket_id,
+		const struct rte_eth_rxconf *rx_conf,
+		const struct rte_eth_rxseg *rx_seg, uint16_t n_seg);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 32407dd..27018de 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -265,6 +265,15 @@  typedef int (*eth_rx_queue_setup_t)(struct rte_eth_dev *dev,
 				    struct rte_mempool *mb_pool);
 /**< @internal Set up a receive queue of an Ethernet device. */
 
+typedef int (*eth_rx_queue_setup_ex_t)(struct rte_eth_dev *dev,
+				       uint16_t rx_queue_id,
+				       uint16_t nb_rx_desc,
+				       unsigned int socket_id,
+				       const struct rte_eth_rxconf *rx_conf,
+				       const struct rte_eth_rxseg *rx_seg,
+				       uint16_t n_seg);
+/**< @internal Set up a receive queue of an Ethernet device. */
+
 typedef int (*eth_tx_queue_setup_t)(struct rte_eth_dev *dev,
 				    uint16_t tx_queue_id,
 				    uint16_t nb_tx_desc,
@@ -659,6 +668,7 @@  struct eth_dev_ops {
 	eth_queue_start_t          tx_queue_start;/**< Start TX for a queue. */
 	eth_queue_stop_t           tx_queue_stop; /**< Stop TX for a queue. */
 	eth_rx_queue_setup_t       rx_queue_setup;/**< Set up device RX queue. */
+	eth_rx_queue_setup_ex_t    rx_queue_setup_ex;/**< Set up device RX queue. */
 	eth_queue_release_t        rx_queue_release; /**< Release RX queue. */
 	eth_rx_queue_count_t       rx_queue_count;
 	/**< Get the number of used RX descriptors. */