[v6,1/6] ethdev: introduce Rx buffer split

Message ID 1602699122-15737-2-git-send-email-viacheslavo@nvidia.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series ethdev: introduce Rx buffer split |

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

Slava Ovsiienko Oct. 14, 2020, 6:11 p.m. UTC
  The DPDK datapath in the transmit direction is very flexible.
An application can build the multi-segment packet and manages
almost all data aspects - the memory pools where segments
are allocated from, the segment lengths, the memory attributes
like external buffers, registered for DMA, etc.

In the receiving direction, the datapath is much less flexible,
an application can only specify the memory pool to configure the
receiving queue and nothing more. In order to extend receiving
datapath capabilities it is proposed to add the way to provide
extended information how to split the packets being received.

The following structure is introduced to specify the Rx packet
segment:

struct rte_eth_rxseg {
    struct rte_mempool *mp; /* memory pools to allocate segment from */
    uint16_t length; /* segment maximal data length,
		       	configures "split point" */
    uint16_t offset; /* data offset from beginning
		       	of mbuf data buffer */
    uint32_t reserved; /* reserved field */
};

The segment descriptions are added to the rte_eth_rxconf structure:
   rx_seg - pointer the array of segment descriptions, each element
             describes the memory pool, maximal data length, initial
             data offset from the beginning of data buffer in mbuf.
	     This array allows to specify the different settings for
	     each segment in individual fashion.
   rx_nseg - number of elements in the array

If the extended segment descriptions is provided with these new
fields the mp parameter of the rte_eth_rx_queue_setup must be
specified as NULL to avoid ambiguity.

There are two options to specifiy Rx buffer configuration:
- mp is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is zero,
  it is compatible configuraion, follows existing implementation,
  provides single pool and no description for segment sizes
  and offsets.
- mp is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not
  zero, it provides the extended configuration, individually for
  each segment.

The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
capabilities is introduced to present the way for PMD to report to
application about supporting Rx packet split to configurable
segments. Prior invoking the rte_eth_rx_queue_setup() routine
application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.

If the Rx queue is configured with new settings the packets being
received will be split into multiple segments pushed to the mbufs
with specified attributes. The PMD will split the received packets
into multiple segments according to the specification in the
description array:

- the first network buffer will be allocated from the memory pool,
  specified in the first segment description element, the second
  network buffer - from the pool in the second segment description
  element and so on. If there is no enough elements to describe
  the buffer for entire packet of maximal length the pool from the
  last valid element will be used to allocate the buffers from for the
  rest of segments

- the offsets from the segment description elements will provide
  the data offset from the buffer beginning except the first mbuf -
  for this one the offset is added to the RTE_PKTMBUF_HEADROOM to get
  actual offset from the buffer beginning. If there is no enough
  elements to describe the buffer for entire packet of maximal length
  the offsets for the rest of segment will be supposed to be zero.

- the data length being received to each segment is limited  by the
  length specified in the segment description element. The data
  receiving starts with filling up the first mbuf data buffer, if the
  specified maximal segment length is reached and there are data
  remaining (packet is longer than buffer in the first mbuf) the
  following data will be pushed to the next segment up to its own
  maximal length. If the first two segments is not enough to store
  all the packet remaining data  the next (third) segment will
  be engaged and so on. If the length in the segment description
  element is zero the actual buffer size will be deduced from
  the appropriate memory pool properties. If there is no enough
  elements to describe the buffer for entire packet of maximal
  length the buffer size will be deduced from the pool of the last
  valid element for the remaining segments.

For example, let's suppose we configured the Rx queue with the
following segments:
    seg0 - pool0, len0=14B, off0=2
    seg1 - pool1, len1=20B, off1=128B
    seg2 - pool2, len2=20B, off2=0B
    seg3 - pool3, len3=512B, off3=0B

The packet 46 bytes long will look like the following:
    seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
    seg1 - 20B long @ 128 in mbuf from pool1
    seg2 - 12B long @ 0 in mbuf from pool2

The packet 1500 bytes long will look like the following:
    seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
    seg1 - 20B @ 128 in mbuf from pool1
    seg2 - 20B @ 0 in mbuf from pool2
    seg3 - 512B @ 0 in mbuf from pool3
    seg4 - 512B @ 0 in mbuf from pool3
    seg5 - 422B @ 0 in mbuf from pool3

The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
configured to support new buffer split feature (if rx_nseg
is greater than one).

The new approach would allow splitting the ingress packets into
multiple parts pushed to the memory with different attributes.
For example, the packet headers can be pushed to the embedded
data buffers within mbufs and the application data into
the external buffers attached to mbufs allocated from the
different memory pools. The memory attributes for the split
parts may differ either - for example the application data
may be pushed into the external memory located on the dedicated
physical device, say GPU or NVMe. This would improve the DPDK
receiving datapath flexibility with preserving compatibility
with existing API.

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/nics/features.rst           |  15 +++++
 doc/guides/rel_notes/deprecation.rst   |   5 --
 doc/guides/rel_notes/release_20_11.rst |   9 +++
 lib/librte_ethdev/rte_ethdev.c         | 111 +++++++++++++++++++++++++--------
 lib/librte_ethdev/rte_ethdev.h         |  62 +++++++++++++++++-
 5 files changed, 171 insertions(+), 31 deletions(-)
  

Comments

Jerin Jacob Oct. 14, 2020, 6:57 p.m. UTC | #1
On Wed, Oct 14, 2020 at 11:42 PM Viacheslav Ovsiienko
<viacheslavo@nvidia.com> wrote:
>
> The DPDK datapath in the transmit direction is very flexible.
> An application can build the multi-segment packet and manages
> almost all data aspects - the memory pools where segments
> are allocated from, the segment lengths, the memory attributes
> like external buffers, registered for DMA, etc.
>
> In the receiving direction, the datapath is much less flexible,
> an application can only specify the memory pool to configure the
> receiving queue and nothing more. In order to extend receiving
> datapath capabilities it is proposed to add the way to provide
> extended information how to split the packets being received.
>
> The following structure is introduced to specify the Rx packet
> segment:
>
> struct rte_eth_rxseg {
>     struct rte_mempool *mp; /* memory pools to allocate segment from */
>     uint16_t length; /* segment maximal data length,
>                         configures "split point" */
>     uint16_t offset; /* data offset from beginning
>                         of mbuf data buffer */
>     uint32_t reserved; /* reserved field */
> };
>
> The segment descriptions are added to the rte_eth_rxconf structure:
>    rx_seg - pointer the array of segment descriptions, each element
>              describes the memory pool, maximal data length, initial
>              data offset from the beginning of data buffer in mbuf.
>              This array allows to specify the different settings for
>              each segment in individual fashion.
>    rx_nseg - number of elements in the array
>
> If the extended segment descriptions is provided with these new
> fields the mp parameter of the rte_eth_rx_queue_setup must be
> specified as NULL to avoid ambiguity.
>
> There are two options to specifiy Rx buffer configuration:
> - mp is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is zero,
>   it is compatible configuraion, follows existing implementation,
>   provides single pool and no description for segment sizes
>   and offsets.
> - mp is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not
>   zero, it provides the extended configuration, individually for
>   each segment.
>
> The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
> capabilities is introduced to present the way for PMD to report to
> application about supporting Rx packet split to configurable
> segments. Prior invoking the rte_eth_rx_queue_setup() routine
> application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.
>
> If the Rx queue is configured with new settings the packets being
> received will be split into multiple segments pushed to the mbufs
> with specified attributes. The PMD will split the received packets
> into multiple segments according to the specification in the
> description array:
>
> - the first network buffer will be allocated from the memory pool,
>   specified in the first segment description element, the second
>   network buffer - from the pool in the second segment description
>   element and so on. If there is no enough elements to describe
>   the buffer for entire packet of maximal length the pool from the
>   last valid element will be used to allocate the buffers from for the
>   rest of segments
>
> - the offsets from the segment description elements will provide
>   the data offset from the buffer beginning except the first mbuf -
>   for this one the offset is added to the RTE_PKTMBUF_HEADROOM to get
>   actual offset from the buffer beginning. If there is no enough
>   elements to describe the buffer for entire packet of maximal length
>   the offsets for the rest of segment will be supposed to be zero.
>
> - the data length being received to each segment is limited  by the
>   length specified in the segment description element. The data
>   receiving starts with filling up the first mbuf data buffer, if the
>   specified maximal segment length is reached and there are data
>   remaining (packet is longer than buffer in the first mbuf) the
>   following data will be pushed to the next segment up to its own
>   maximal length. If the first two segments is not enough to store
>   all the packet remaining data  the next (third) segment will
>   be engaged and so on. If the length in the segment description
>   element is zero the actual buffer size will be deduced from
>   the appropriate memory pool properties. If there is no enough
>   elements to describe the buffer for entire packet of maximal
>   length the buffer size will be deduced from the pool of the last
>   valid element for the remaining segments.
>
> For example, let's suppose we configured the Rx queue with the
> following segments:
>     seg0 - pool0, len0=14B, off0=2
>     seg1 - pool1, len1=20B, off1=128B
>     seg2 - pool2, len2=20B, off2=0B
>     seg3 - pool3, len3=512B, off3=0B


Sorry for chime in late. This API lookout looks good to me.
But, I am wondering how the application can know the capability or "limits" of
struct rte_eth_rxseg structure for the specific PMD. The other
descriptor limit, it's being exposed with struct
rte_eth_dev_info::rx_desc_lim;
If PMD can support a specific pattern rather than returning the
blanket error, the application should know the limit.
IMO, it is better to add
struct rte_eth_rxseg *rxsegs;
unint16_t nb_max_rxsegs
in rte_eth_dev_info structure to express the capablity.
Where the en and offset can define the max offset.

Thoughts?
  
Thomas Monjalon Oct. 14, 2020, 10:13 p.m. UTC | #2
14/10/2020 20:11, Viacheslav Ovsiienko:
> The DPDK datapath in the transmit direction is very flexible.
> An application can build the multi-segment packet and manages
> almost all data aspects - the memory pools where segments
> are allocated from, the segment lengths, the memory attributes
> like external buffers, registered for DMA, etc.
> 
> In the receiving direction, the datapath is much less flexible,
> an application can only specify the memory pool to configure the
> receiving queue and nothing more. In order to extend receiving
> datapath capabilities it is proposed to add the way to provide
> extended information how to split the packets being received.
> 
> The following structure is introduced to specify the Rx packet
> segment:
> 
> struct rte_eth_rxseg {
>     struct rte_mempool *mp; /* memory pools to allocate segment from */
>     uint16_t length; /* segment maximal data length,
> 		       	configures "split point" */
>     uint16_t offset; /* data offset from beginning
> 		       	of mbuf data buffer */
>     uint32_t reserved; /* reserved field */
> };
> 
> The segment descriptions are added to the rte_eth_rxconf structure:
>    rx_seg - pointer the array of segment descriptions, each element
>              describes the memory pool, maximal data length, initial
>              data offset from the beginning of data buffer in mbuf.
> 	     This array allows to specify the different settings for
> 	     each segment in individual fashion.
>    rx_nseg - number of elements in the array
> 
> If the extended segment descriptions is provided with these new
> fields the mp parameter of the rte_eth_rx_queue_setup must be
> specified as NULL to avoid ambiguity.
> 
> There are two options to specifiy Rx buffer configuration:
> - mp is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is zero,
>   it is compatible configuraion, follows existing implementation,
>   provides single pool and no description for segment sizes
>   and offsets.
> - mp is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not
>   zero, it provides the extended configuration, individually for
>   each segment.
> 
> The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
> capabilities is introduced to present the way for PMD to report to
> application about supporting Rx packet split to configurable
> segments. Prior invoking the rte_eth_rx_queue_setup() routine
> application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.
> 
> If the Rx queue is configured with new settings the packets being
> received will be split into multiple segments pushed to the mbufs
> with specified attributes. The PMD will split the received packets
> into multiple segments according to the specification in the
> description array:
> 
> - the first network buffer will be allocated from the memory pool,
>   specified in the first segment description element, the second
>   network buffer - from the pool in the second segment description
>   element and so on. If there is no enough elements to describe
>   the buffer for entire packet of maximal length the pool from the
>   last valid element will be used to allocate the buffers from for the
>   rest of segments
> 
> - the offsets from the segment description elements will provide
>   the data offset from the buffer beginning except the first mbuf -
>   for this one the offset is added to the RTE_PKTMBUF_HEADROOM to get
>   actual offset from the buffer beginning. If there is no enough
>   elements to describe the buffer for entire packet of maximal length
>   the offsets for the rest of segment will be supposed to be zero.
> 
> - the data length being received to each segment is limited  by the
>   length specified in the segment description element. The data
>   receiving starts with filling up the first mbuf data buffer, if the
>   specified maximal segment length is reached and there are data
>   remaining (packet is longer than buffer in the first mbuf) the
>   following data will be pushed to the next segment up to its own
>   maximal length. If the first two segments is not enough to store
>   all the packet remaining data  the next (third) segment will
>   be engaged and so on. If the length in the segment description
>   element is zero the actual buffer size will be deduced from
>   the appropriate memory pool properties. If there is no enough
>   elements to describe the buffer for entire packet of maximal
>   length the buffer size will be deduced from the pool of the last
>   valid element for the remaining segments.
> 
> For example, let's suppose we configured the Rx queue with the
> following segments:
>     seg0 - pool0, len0=14B, off0=2
>     seg1 - pool1, len1=20B, off1=128B
>     seg2 - pool2, len2=20B, off2=0B
>     seg3 - pool3, len3=512B, off3=0B
> 
> The packet 46 bytes long will look like the following:
>     seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>     seg1 - 20B long @ 128 in mbuf from pool1
>     seg2 - 12B long @ 0 in mbuf from pool2
> 
> The packet 1500 bytes long will look like the following:
>     seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>     seg1 - 20B @ 128 in mbuf from pool1
>     seg2 - 20B @ 0 in mbuf from pool2
>     seg3 - 512B @ 0 in mbuf from pool3
>     seg4 - 512B @ 0 in mbuf from pool3
>     seg5 - 422B @ 0 in mbuf from pool3
> 
> The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
> configured to support new buffer split feature (if rx_nseg
> is greater than one).
> 
> The new approach would allow splitting the ingress packets into
> multiple parts pushed to the memory with different attributes.
> For example, the packet headers can be pushed to the embedded
> data buffers within mbufs and the application data into
> the external buffers attached to mbufs allocated from the
> different memory pools. The memory attributes for the split
> parts may differ either - for example the application data
> may be pushed into the external memory located on the dedicated
> physical device, say GPU or NVMe. This would improve the DPDK
> receiving datapath flexibility with preserving compatibility
> with existing API.
> 
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

A large part of this commit log can be dropped because redundant
with the doxygen comments.

Acked-by: Thomas Monjalon <thomas@monjalon.net>
  
Ajit Khaparde Oct. 14, 2020, 10:50 p.m. UTC | #3
On Wed, Oct 14, 2020 at 11:13 AM Viacheslav Ovsiienko
<viacheslavo@nvidia.com> wrote:
>
> The DPDK datapath in the transmit direction is very flexible.
> An application can build the multi-segment packet and manages
> almost all data aspects - the memory pools where segments
> are allocated from, the segment lengths, the memory attributes
> like external buffers, registered for DMA, etc.
>
> In the receiving direction, the datapath is much less flexible,
> an application can only specify the memory pool to configure the
> receiving queue and nothing more. In order to extend receiving
> datapath capabilities it is proposed to add the way to provide
> extended information how to split the packets being received.
>
> The following structure is introduced to specify the Rx packet
> segment:
>
> struct rte_eth_rxseg {
>     struct rte_mempool *mp; /* memory pools to allocate segment from */
>     uint16_t length; /* segment maximal data length,
>                         configures "split point" */
>     uint16_t offset; /* data offset from beginning
>                         of mbuf data buffer */
>     uint32_t reserved; /* reserved field */
> };
>
> The segment descriptions are added to the rte_eth_rxconf structure:
>    rx_seg - pointer the array of segment descriptions, each element
>              describes the memory pool, maximal data length, initial
>              data offset from the beginning of data buffer in mbuf.
>              This array allows to specify the different settings for
>              each segment in individual fashion.
>    rx_nseg - number of elements in the array
>
> If the extended segment descriptions is provided with these new
> fields the mp parameter of the rte_eth_rx_queue_setup must be
> specified as NULL to avoid ambiguity.
>
> There are two options to specifiy Rx buffer configuration:
> - mp is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is zero,
>   it is compatible configuraion, follows existing implementation,
>   provides single pool and no description for segment sizes
>   and offsets.
> - mp is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not
>   zero, it provides the extended configuration, individually for
>   each segment.
>
> The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
> capabilities is introduced to present the way for PMD to report to
> application about supporting Rx packet split to configurable
> segments. Prior invoking the rte_eth_rx_queue_setup() routine
> application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.
>
> If the Rx queue is configured with new settings the packets being
> received will be split into multiple segments pushed to the mbufs
> with specified attributes. The PMD will split the received packets
> into multiple segments according to the specification in the
> description array:
>
> - the first network buffer will be allocated from the memory pool,
>   specified in the first segment description element, the second
>   network buffer - from the pool in the second segment description
>   element and so on. If there is no enough elements to describe
>   the buffer for entire packet of maximal length the pool from the
>   last valid element will be used to allocate the buffers from for the
>   rest of segments
>
> - the offsets from the segment description elements will provide
>   the data offset from the buffer beginning except the first mbuf -
>   for this one the offset is added to the RTE_PKTMBUF_HEADROOM to get
>   actual offset from the buffer beginning. If there is no enough
>   elements to describe the buffer for entire packet of maximal length
>   the offsets for the rest of segment will be supposed to be zero.
>
> - the data length being received to each segment is limited  by the
>   length specified in the segment description element. The data
>   receiving starts with filling up the first mbuf data buffer, if the
>   specified maximal segment length is reached and there are data
>   remaining (packet is longer than buffer in the first mbuf) the
>   following data will be pushed to the next segment up to its own
>   maximal length. If the first two segments is not enough to store
>   all the packet remaining data  the next (third) segment will
>   be engaged and so on. If the length in the segment description
>   element is zero the actual buffer size will be deduced from
>   the appropriate memory pool properties. If there is no enough
>   elements to describe the buffer for entire packet of maximal
>   length the buffer size will be deduced from the pool of the last
>   valid element for the remaining segments.
>
> For example, let's suppose we configured the Rx queue with the
> following segments:
>     seg0 - pool0, len0=14B, off0=2
>     seg1 - pool1, len1=20B, off1=128B
>     seg2 - pool2, len2=20B, off2=0B
>     seg3 - pool3, len3=512B, off3=0B
>
> The packet 46 bytes long will look like the following:
>     seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>     seg1 - 20B long @ 128 in mbuf from pool1
>     seg2 - 12B long @ 0 in mbuf from pool2
>
> The packet 1500 bytes long will look like the following:
>     seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>     seg1 - 20B @ 128 in mbuf from pool1
>     seg2 - 20B @ 0 in mbuf from pool2
>     seg3 - 512B @ 0 in mbuf from pool3
>     seg4 - 512B @ 0 in mbuf from pool3
>     seg5 - 422B @ 0 in mbuf from pool3
>
> The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
> configured to support new buffer split feature (if rx_nseg
> is greater than one).
>
> The new approach would allow splitting the ingress packets into
> multiple parts pushed to the memory with different attributes.
> For example, the packet headers can be pushed to the embedded
> data buffers within mbufs and the application data into
> the external buffers attached to mbufs allocated from the
> different memory pools. The memory attributes for the split
> parts may differ either - for example the application data
> may be pushed into the external memory located on the dedicated
> physical device, say GPU or NVMe. This would improve the DPDK
> receiving datapath flexibility with preserving compatibility
> with existing API.
>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>

> ---
>  doc/guides/nics/features.rst           |  15 +++++
>  doc/guides/rel_notes/deprecation.rst   |   5 --
>  doc/guides/rel_notes/release_20_11.rst |   9 +++
>  lib/librte_ethdev/rte_ethdev.c         | 111 +++++++++++++++++++++++++--------
>  lib/librte_ethdev/rte_ethdev.h         |  62 +++++++++++++++++-
>  5 files changed, 171 insertions(+), 31 deletions(-)
>
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index dd8c955..832ea3b 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -185,6 +185,21 @@ Supports receiving segmented mbufs.
>  * **[related]    eth_dev_ops**: ``rx_pkt_burst``.
>
>
> +.. _nic_features_buffer_split:
> +
> +Buffer Split on Rx
> +------------------
> +
> +Scatters the packets being received on specified boundaries to segmented mbufs.
> +
> +* **[uses]       rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
> +* **[uses]       rte_eth_rxconf**: ``rx_conf.rx_seg, rx_conf.rx_nseg``.
> +* **[implements] datapath**: ``Buffer Split functionality``.
> +* **[provides]   rte_eth_dev_info**: ``rx_offload_capa:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
> +* **[provides]   eth_dev_ops**: ``rxq_info_get:buffer_split``.
> +* **[related] API**: ``rte_eth_rx_queue_setup()``.
> +
> +
>  .. _nic_features_lro:
>
>  LRO
> diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
> index 584e720..232cd54 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -138,11 +138,6 @@ Deprecation Notices
>    In 19.11 PMDs will still update the field even when the offload is not
>    enabled.
>
> -* ethdev: Add new fields to ``rte_eth_rxconf`` to configure the receiving
> -  queues to split ingress packets into multiple segments according to the
> -  specified lengths into the buffers allocated from the specified
> -  memory pools. The backward compatibility to existing API is preserved.
> -
>  * ethdev: ``rx_descriptor_done`` dev_ops and ``rte_eth_rx_descriptor_done``
>    will be removed in 21.11.
>    Existing ``rte_eth_rx_descriptor_status`` and ``rte_eth_tx_descriptor_status``
> diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
> index bcc0fc2..bcc2479 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -60,6 +60,12 @@ New Features
>    Added the FEC API which provides functions for query FEC capabilities and
>    current FEC mode from device. Also, API for configuring FEC mode is also provided.
>
> +* **Introduced extended buffer description for receiving.**
> +
> +  Added the extended Rx buffer description for Rx queue setup routine
> +  providing the individual settings for each Rx segment with maximal size,
> +  buffer offset and memory pool to allocate data buffers from.
> +
>  * **Updated Broadcom bnxt driver.**
>
>    Updated the Broadcom bnxt driver with new features and improvements, including:
> @@ -253,6 +259,9 @@ API Changes
>    As the data of ``uint8_t`` will be truncated when queue number under
>    a TC is greater than 256.
>
> +* ethdev: Added fields rx_seg and rx_nseg to rte_eth_rxconf structure
> +  to provide extended description of the receiving buffer.
> +
>  * vhost: Moved vDPA APIs from experimental to stable.
>
>  * rawdev: Added a structure size parameter to the functions
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index 892c246..96ecb91 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -105,6 +105,9 @@ struct rte_eth_xstats_name_off {
>  #define RTE_RX_OFFLOAD_BIT2STR(_name)  \
>         { DEV_RX_OFFLOAD_##_name, #_name }
>
> +#define RTE_ETH_RX_OFFLOAD_BIT2STR(_name)      \
> +       { RTE_ETH_RX_OFFLOAD_##_name, #_name }
> +
>  static const struct {
>         uint64_t offload;
>         const char *name;
> @@ -128,9 +131,11 @@ struct rte_eth_xstats_name_off {
>         RTE_RX_OFFLOAD_BIT2STR(SCTP_CKSUM),
>         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> +       RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
>  };
>
>  #undef RTE_RX_OFFLOAD_BIT2STR
> +#undef RTE_ETH_RX_OFFLOAD_BIT2STR
>
>  #define RTE_TX_OFFLOAD_BIT2STR(_name)  \
>         { DEV_TX_OFFLOAD_##_name, #_name }
> @@ -1784,38 +1789,94 @@ struct rte_eth_dev *
>                 return -EINVAL;
>         }
>
> -       if (mp == NULL) {
> -               RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
> -               return -EINVAL;
> -       }
> -
>         RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_setup, -ENOTSUP);
>
> -       /*
> -        * Check the size of the mbuf data buffer.
> -        * This value must be provided in the private data of the memory pool.
> -        * First check that the memory pool has a valid private data.
> -        */
>         ret = rte_eth_dev_info_get(port_id, &dev_info);
>         if (ret != 0)
>                 return ret;
>
> -       if (mp->private_data_size < sizeof(struct rte_pktmbuf_pool_private)) {
> -               RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
> -                       mp->name, (int)mp->private_data_size,
> -                       (int)sizeof(struct rte_pktmbuf_pool_private));
> -               return -ENOSPC;
> -       }
> -       mbp_buf_size = rte_pktmbuf_data_room_size(mp);
> +       if (mp) {
> +               /* Single pool configuration check. */
> +               if (rx_conf->rx_seg || rx_conf->rx_nseg) {
> +                       RTE_ETHDEV_LOG(ERR,
> +                                      "Ambiguous segment configuration\n");
> +                       return -EINVAL;
> +               }
> +               /*
> +                * Check the size of the mbuf data buffer, this value
> +                * must be provided in the private data of the memory pool.
> +                * First check that the memory pool(s) has a valid private data.
> +                */
> +               if (mp->private_data_size <
> +                               sizeof(struct rte_pktmbuf_pool_private)) {
> +                       RTE_ETHDEV_LOG(ERR, "%s private_data_size %u < %u\n",
> +                               mp->name, mp->private_data_size,
> +                               (unsigned int)
> +                               sizeof(struct rte_pktmbuf_pool_private));
> +                       return -ENOSPC;
> +               }
> +               mbp_buf_size = rte_pktmbuf_data_room_size(mp);
> +               if (mbp_buf_size < dev_info.min_rx_bufsize +
> +                                  RTE_PKTMBUF_HEADROOM) {
> +                       RTE_ETHDEV_LOG(ERR,
> +                                      "%s mbuf_data_room_size %u < %u"
> +                                      " (RTE_PKTMBUF_HEADROOM=%u +"
> +                                      " min_rx_bufsize(dev)=%u)\n",
> +                                      mp->name, mbp_buf_size,
> +                                      RTE_PKTMBUF_HEADROOM +
> +                                      dev_info.min_rx_bufsize,
> +                                      RTE_PKTMBUF_HEADROOM,
> +                                      dev_info.min_rx_bufsize);
> +                       return -EINVAL;
> +               }
> +       } else {
> +               const struct rte_eth_rxseg *rx_seg = rx_conf->rx_seg;
> +               uint16_t n_seg = rx_conf->rx_nseg;
> +               uint16_t seg_idx;
>
> -       if (mbp_buf_size < dev_info.min_rx_bufsize + RTE_PKTMBUF_HEADROOM) {
> -               RTE_ETHDEV_LOG(ERR,
> -                       "%s mbuf_data_room_size %d < %d (RTE_PKTMBUF_HEADROOM=%d + min_rx_bufsize(dev)=%d)\n",
> -                       mp->name, (int)mbp_buf_size,
> -                       (int)(RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize),
> -                       (int)RTE_PKTMBUF_HEADROOM,
> -                       (int)dev_info.min_rx_bufsize);
> -               return -EINVAL;
> +               /* Extended multi-segment configuration check. */
> +               if (!rx_conf->rx_seg || !rx_conf->rx_nseg) {
> +                       RTE_ETHDEV_LOG(ERR,
> +                                      "Memory pool is null and no"
> +                                      " extended configuration provided\n");
> +                       return -EINVAL;
> +               }
> +               /*
> +                * Check the sizes and offsets against buffer sizes
> +                * for each segment specified in extended configuration.
> +                */
> +               for (seg_idx = 0; seg_idx < n_seg; seg_idx++) {
> +                       struct rte_mempool *mpl = rx_seg[seg_idx].mp;
> +                       uint32_t length = rx_seg[seg_idx].length;
> +                       uint32_t offset = rx_seg[seg_idx].offset;
> +                       uint32_t head_room = seg_idx ? 0 : RTE_PKTMBUF_HEADROOM;
> +
> +                       if (mpl == NULL) {
> +                               RTE_ETHDEV_LOG(ERR, "null mempool pointer\n");
> +                               return -EINVAL;
> +                       }
> +                       if (mpl->private_data_size <
> +                               sizeof(struct rte_pktmbuf_pool_private)) {
> +                               RTE_ETHDEV_LOG(ERR,
> +                                              "%s private_data_size %u < %u\n",
> +                                              mpl->name,
> +                                              mpl->private_data_size,
> +                                              (unsigned int)sizeof(struct
> +                                              rte_pktmbuf_pool_private));
> +                               return -ENOSPC;
> +                       }
> +                       mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
> +                       length = length ? length : (mbp_buf_size - head_room);
> +                       if (mbp_buf_size < length + offset + head_room) {
> +                               RTE_ETHDEV_LOG(ERR,
> +                                       "%s mbuf_data_room_size %u < %u"
> +                                       " (segment length=%u +"
> +                                       " segment offset=%u)\n",
> +                                       mpl->name, mbp_buf_size,
> +                                       length + offset, length, offset);
> +                               return -EINVAL;
> +                       }
> +               }
>         }
>
>         /* Use default specified by driver, if nb_rx_desc is zero */
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 5bcfbb8..e019f4a 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -970,6 +970,16 @@ struct rte_eth_txmode {
>  };
>
>  /**
> + * A structure used to configure an RX packet segment to split.
> + */
> +struct rte_eth_rxseg {
> +       struct rte_mempool *mp; /**< Memory pool to allocate segment from. */
> +       uint16_t length; /**< Segment data length, configures split point. */
> +       uint16_t offset; /**< Data offset from beginning of mbuf data buffer. */
> +       uint32_t reserved; /**< Reserved field. */
> +};
> +
> +/**
>   * A structure used to configure an RX ring of an Ethernet port.
>   */
>  struct rte_eth_rxconf {
> @@ -977,6 +987,43 @@ struct rte_eth_rxconf {
>         uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> +       uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +       /**
> +        * Points to the array of segment descriptions. Each array element
> +        * describes the properties for each segment in the receiving
> +        * buffer.
> +        *
> +        * If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag is set in offloads field,
> +        * the PMD will split the received packets into multiple segments
> +        * according to the specification in the description array:
> +        *
> +        * - the first network buffer will be allocated from the memory pool,
> +        *   specified in the first array element, the second buffer, from the
> +        *   pool in the second element, and so on.
> +        *
> +        * - the offsets from the segment description elements specify
> +        *   the data offset from the buffer beginning except the first mbuf.
> +        *   For this one the offset is added with RTE_PKTMBUF_HEADROOM.
> +        *
> +        * - the lengthes in the elements define the maximal data amount
> +        *   being received to each segment. The receiving starts with filling
> +        *   up the first mbuf data buffer up to specified length. If the
> +        *   there are data remaining (packet is longer than buffer in the first
> +        *   mbuf) the following data will be pushed to the next segment
> +        *   up to its own length, and so on.
> +        *
> +        * - If the length in the segment description element is zero
> +        *   the actual buffer size will be deduced from the appropriate
> +        *   memory pool properties.
> +        *
> +        * - if there is not enough elements to describe the buffer for entire
> +        *   packet of maximal length the following parameters will be used
> +        *   for the all remaining segments:
> +        *     - pool from the last valid element
> +        *     - the buffer size from this pool
> +        *     - zero offset
> +        */
> +       struct rte_eth_rxseg *rx_seg;
>         /**
>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1260,6 +1307,7 @@ struct rte_eth_conf {
>  #define DEV_RX_OFFLOAD_SCTP_CKSUM      0x00020000
>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> +#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
>
>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> @@ -2027,9 +2075,21 @@ int rte_eth_dev_configure(uint16_t port_id, uint16_t nb_rx_queue,
>   *   No need to repeat any bit in rx_conf->offloads which has already been
>   *   enabled in rte_eth_dev_configure() at port level. An offloading enabled
>   *   at port level can't be disabled at queue level.
> + *   The configuration structure also contains the pointer to the array
> + *   of the receiving buffer segment descriptions, see rx_seg and rx_nseg
> + *   fields, this extended configuration might be used by split offloads like
> + *   RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT. If mp_pool is not NULL,
> + *   the extended configuration fields must be set to NULL and zero.
>   * @param mb_pool
>   *   The pointer to the memory pool from which to allocate *rte_mbuf* network
> - *   memory buffers to populate each descriptor of the receive ring.
> + *   memory buffers to populate each descriptor of the receive ring. There are
> + *   two options to provide Rx buffer configuration:
> + *   - single pool:
> + *     mb_pool is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is 0.
> + *   - multiple segments description:
> + *     mb_pool is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not 0.
> + *     Taken only if flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is set in offloads.
> + *
>   * @return
>   *   - 0: Success, receive queue correctly set up.
>   *   - -EIO: if device is removed.
> --
> 1.8.3.1
>
  
Slava Ovsiienko Oct. 15, 2020, 7:43 a.m. UTC | #4
Hi, Jerin

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Wednesday, October 14, 2020 21:57
> To: Slava Ovsiienko <viacheslavo@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
> <thomas@monjalon.net>; Stephen Hemminger
> <stephen@networkplumber.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> Olivier Matz <olivier.matz@6wind.com>; Maxime Coquelin
> <maxime.coquelin@redhat.com>; David Marchand
> <david.marchand@redhat.com>; Andrew Rybchenko
> <arybchenko@solarflare.com>
> Subject: Re: [PATCH v6 1/6] ethdev: introduce Rx buffer split
> 
> On Wed, Oct 14, 2020 at 11:42 PM Viacheslav Ovsiienko
> <viacheslavo@nvidia.com> wrote:
> >
> > The DPDK datapath in the transmit direction is very flexible.
> > An application can build the multi-segment packet and manages almost
> > all data aspects - the memory pools where segments are allocated from,
> > the segment lengths, the memory attributes like external buffers,
> > registered for DMA, etc.
> >

[..snip..]

> > For example, let's suppose we configured the Rx queue with the
> > following segments:
> >     seg0 - pool0, len0=14B, off0=2
> >     seg1 - pool1, len1=20B, off1=128B
> >     seg2 - pool2, len2=20B, off2=0B
> >     seg3 - pool3, len3=512B, off3=0B
> 
> 
> Sorry for chime in late. This API lookout looks good to me.
> But, I am wondering how the application can know the capability or "limits" of
> struct rte_eth_rxseg structure for the specific PMD. The other descriptor limit,
> it's being exposed with struct rte_eth_dev_info::rx_desc_lim; If PMD can
> support a specific pattern rather than returning the blanket error, the
> application should know the limit.
> IMO, it is better to add
> struct rte_eth_rxseg *rxsegs;
> unint16_t nb_max_rxsegs
> in rte_eth_dev_info structure to express the capablity.
> Where the en and offset can define the max offset.
> 
> Thoughts?

Moreover, there might be implied a lot of various limitations - offsets might be not supported at all or
have some requirements for alignment, the similar requirements might be applied to segment size
(say, ask for some granularity). Currently it is not obvious how to report all nuances, and it is supposed
the limitations of this kind must be documented in PMD chapter. As for mlx5 - it has no special
limitations besides common requirements to the regular segments.

One more point - the split feature might be considered as just one of possible cases of using
these segment descriptions, other features might impose other (unknown for now) limitations.
If we see some of the features of such kind or other PMDs adopts the split feature - we'll try to find
the common root and consider the way how to report it.

With best regards, Slava
  
Jerin Jacob Oct. 15, 2020, 9:27 a.m. UTC | #5
On Thu, Oct 15, 2020 at 1:13 PM Slava Ovsiienko <viacheslavo@nvidia.com> wrote:
>
> Hi, Jerin

Hi Slava,

>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Wednesday, October 14, 2020 21:57
> > To: Slava Ovsiienko <viacheslavo@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
> > <thomas@monjalon.net>; Stephen Hemminger
> > <stephen@networkplumber.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > Olivier Matz <olivier.matz@6wind.com>; Maxime Coquelin
> > <maxime.coquelin@redhat.com>; David Marchand
> > <david.marchand@redhat.com>; Andrew Rybchenko
> > <arybchenko@solarflare.com>
> > Subject: Re: [PATCH v6 1/6] ethdev: introduce Rx buffer split
> >
> > On Wed, Oct 14, 2020 at 11:42 PM Viacheslav Ovsiienko
> > <viacheslavo@nvidia.com> wrote:
> > >
> > > The DPDK datapath in the transmit direction is very flexible.
> > > An application can build the multi-segment packet and manages almost
> > > all data aspects - the memory pools where segments are allocated from,
> > > the segment lengths, the memory attributes like external buffers,
> > > registered for DMA, etc.
> > >
>
> [..snip..]
>
> > > For example, let's suppose we configured the Rx queue with the
> > > following segments:
> > >     seg0 - pool0, len0=14B, off0=2
> > >     seg1 - pool1, len1=20B, off1=128B
> > >     seg2 - pool2, len2=20B, off2=0B
> > >     seg3 - pool3, len3=512B, off3=0B
> >
> >
> > Sorry for chime in late. This API lookout looks good to me.
> > But, I am wondering how the application can know the capability or "limits" of
> > struct rte_eth_rxseg structure for the specific PMD. The other descriptor limit,
> > it's being exposed with struct rte_eth_dev_info::rx_desc_lim; If PMD can
> > support a specific pattern rather than returning the blanket error, the
> > application should know the limit.
> > IMO, it is better to add
> > struct rte_eth_rxseg *rxsegs;
> > unint16_t nb_max_rxsegs
> > in rte_eth_dev_info structure to express the capablity.
> > Where the en and offset can define the max offset.
> >
> > Thoughts?
>
> Moreover, there might be implied a lot of various limitations - offsets might be not supported at all or
> have some requirements for alignment, the similar requirements might be applied to segment size
> (say, ask for some granularity). Currently it is not obvious how to report all nuances, and it is supposed
> the limitations of this kind must be documented in PMD chapter. As for mlx5 - it has no special
> limitations besides common requirements to the regular segments.

Reporting the limitation in the documentation will not help for the
generic applications.

>
> One more point - the split feature might be considered as just one of possible cases of using
> these segment descriptions, other features might impose other (unknown for now) limitations.
> If we see some of the features of such kind or other PMDs adopts the split feature - we'll try to find
> the common root and consider the way how to report it.

My only concern with that approach will be ABI break again if
something needs to exposed over rte_eth_dev_info().
IMO, if we featured needs to completed only when its capabilities are
exposed in a programmatic manner.
As of mlx5, if there not limitation then info
rte_eth_dev_info::rxsegs[x].len, offset etc as UINT16_MAX so
that application is aware of the state.

>
> With best regards, Slava
>
  
Andrew Rybchenko Oct. 15, 2020, 9:49 a.m. UTC | #6
On 10/15/20 10:43 AM, Slava Ovsiienko wrote:
> Hi, Jerin
> 
>> -----Original Message-----
>> From: Jerin Jacob <jerinjacobk@gmail.com>
>> Sent: Wednesday, October 14, 2020 21:57
>> To: Slava Ovsiienko <viacheslavo@nvidia.com>
>> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
>> <thomas@monjalon.net>; Stephen Hemminger
>> <stephen@networkplumber.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
>> Olivier Matz <olivier.matz@6wind.com>; Maxime Coquelin
>> <maxime.coquelin@redhat.com>; David Marchand
>> <david.marchand@redhat.com>; Andrew Rybchenko
>> <arybchenko@solarflare.com>
>> Subject: Re: [PATCH v6 1/6] ethdev: introduce Rx buffer split
>>
>> On Wed, Oct 14, 2020 at 11:42 PM Viacheslav Ovsiienko
>> <viacheslavo@nvidia.com> wrote:
>>>
>>> The DPDK datapath in the transmit direction is very flexible.
>>> An application can build the multi-segment packet and manages almost
>>> all data aspects - the memory pools where segments are allocated from,
>>> the segment lengths, the memory attributes like external buffers,
>>> registered for DMA, etc.
>>>
> 
> [..snip..]
> 
>>> For example, let's suppose we configured the Rx queue with the
>>> following segments:
>>>     seg0 - pool0, len0=14B, off0=2
>>>     seg1 - pool1, len1=20B, off1=128B
>>>     seg2 - pool2, len2=20B, off2=0B
>>>     seg3 - pool3, len3=512B, off3=0B
>>
>>
>> Sorry for chime in late. This API lookout looks good to me.
>> But, I am wondering how the application can know the capability or "limits" of
>> struct rte_eth_rxseg structure for the specific PMD. The other descriptor limit,
>> it's being exposed with struct rte_eth_dev_info::rx_desc_lim; If PMD can
>> support a specific pattern rather than returning the blanket error, the
>> application should know the limit.
>> IMO, it is better to add
>> struct rte_eth_rxseg *rxsegs;
>> unint16_t nb_max_rxsegs
>> in rte_eth_dev_info structure to express the capablity.
>> Where the en and offset can define the max offset.
>>
>> Thoughts?
> 
> Moreover, there might be implied a lot of various limitations - offsets might be not supported at all or
> have some requirements for alignment, the similar requirements might be applied to segment size
> (say, ask for some granularity). Currently it is not obvious how to report all nuances, and it is supposed
> the limitations of this kind must be documented in PMD chapter. As for mlx5 - it has no special
> limitations besides common requirements to the regular segments.
> 
> One more point - the split feature might be considered as just one of possible cases of using
> these segment descriptions, other features might impose other (unknown for now) limitations.
> If we see some of the features of such kind or other PMDs adopts the split feature - we'll try to find
> the common root and consider the way how to report it.

At least there are few simple limitations which are easy to
express:
 1. Maximum number of segments
 2. Possibility to use the last segment many times if required
    (I was suggesting to use scatter for it, but you rejected
     the idea - may be time to reconsider :) )
 3. Maximum offset
    Frankly speaking I'm not sure why it cannot be handled on
    PMD level (i.e. provide descriptors with offset taken into
    account or guarantee that HW mempool objects initialized
    correctly with required headroom). May be in some corner
    cases when the same HW mempool is shared by various
    segments with different offset requirements.
 4. Offset alignment
 5. Maximum/minimum length of a segment
 6. Length alignment

I realize that 3, 4 and 5 could be per segment number.
If it is really that complex, report common denominator
which is guaranteed to work. If we have no checks on ethdev
layer, application can ignore it if it knows better.
  
Andrew Rybchenko Oct. 15, 2020, 10:11 a.m. UTC | #7
On 10/14/20 9:11 PM, Viacheslav Ovsiienko wrote:
> The DPDK datapath in the transmit direction is very flexible.
> An application can build the multi-segment packet and manages
> almost all data aspects - the memory pools where segments
> are allocated from, the segment lengths, the memory attributes
> like external buffers, registered for DMA, etc.
> 
> In the receiving direction, the datapath is much less flexible,
> an application can only specify the memory pool to configure the
> receiving queue and nothing more. In order to extend receiving
> datapath capabilities it is proposed to add the way to provide
> extended information how to split the packets being received.
> 
> The following structure is introduced to specify the Rx packet
> segment:
> 
> struct rte_eth_rxseg {
>     struct rte_mempool *mp; /* memory pools to allocate segment from */
>     uint16_t length; /* segment maximal data length,
> 		       	configures "split point" */
>     uint16_t offset; /* data offset from beginning
> 		       	of mbuf data buffer */
>     uint32_t reserved; /* reserved field */
> };
> 
> The segment descriptions are added to the rte_eth_rxconf structure:
>    rx_seg - pointer the array of segment descriptions, each element
>              describes the memory pool, maximal data length, initial
>              data offset from the beginning of data buffer in mbuf.
> 	     This array allows to specify the different settings for
> 	     each segment in individual fashion.
>    rx_nseg - number of elements in the array
> 
> If the extended segment descriptions is provided with these new
> fields the mp parameter of the rte_eth_rx_queue_setup must be
> specified as NULL to avoid ambiguity.
> 
> There are two options to specifiy Rx buffer configuration:
> - mp is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is zero,
>   it is compatible configuraion, follows existing implementation,
>   provides single pool and no description for segment sizes
>   and offsets.
> - mp is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not
>   zero, it provides the extended configuration, individually for
>   each segment.
> 
> The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
> capabilities is introduced to present the way for PMD to report to
> application about supporting Rx packet split to configurable
> segments. Prior invoking the rte_eth_rx_queue_setup() routine
> application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.
> 
> If the Rx queue is configured with new settings the packets being
> received will be split into multiple segments pushed to the mbufs
> with specified attributes. The PMD will split the received packets
> into multiple segments according to the specification in the
> description array:
> 
> - the first network buffer will be allocated from the memory pool,
>   specified in the first segment description element, the second
>   network buffer - from the pool in the second segment description
>   element and so on. If there is no enough elements to describe
>   the buffer for entire packet of maximal length the pool from the
>   last valid element will be used to allocate the buffers from for the
>   rest of segments
> 
> - the offsets from the segment description elements will provide
>   the data offset from the buffer beginning except the first mbuf -
>   for this one the offset is added to the RTE_PKTMBUF_HEADROOM to get
>   actual offset from the buffer beginning. If there is no enough
>   elements to describe the buffer for entire packet of maximal length
>   the offsets for the rest of segment will be supposed to be zero.
> 
> - the data length being received to each segment is limited  by the
>   length specified in the segment description element. The data
>   receiving starts with filling up the first mbuf data buffer, if the
>   specified maximal segment length is reached and there are data
>   remaining (packet is longer than buffer in the first mbuf) the
>   following data will be pushed to the next segment up to its own
>   maximal length. If the first two segments is not enough to store
>   all the packet remaining data  the next (third) segment will
>   be engaged and so on. If the length in the segment description
>   element is zero the actual buffer size will be deduced from
>   the appropriate memory pool properties. If there is no enough
>   elements to describe the buffer for entire packet of maximal
>   length the buffer size will be deduced from the pool of the last
>   valid element for the remaining segments.
> 
> For example, let's suppose we configured the Rx queue with the
> following segments:
>     seg0 - pool0, len0=14B, off0=2
>     seg1 - pool1, len1=20B, off1=128B
>     seg2 - pool2, len2=20B, off2=0B
>     seg3 - pool3, len3=512B, off3=0B
> 
> The packet 46 bytes long will look like the following:
>     seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>     seg1 - 20B long @ 128 in mbuf from pool1
>     seg2 - 12B long @ 0 in mbuf from pool2
> 
> The packet 1500 bytes long will look like the following:
>     seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>     seg1 - 20B @ 128 in mbuf from pool1
>     seg2 - 20B @ 0 in mbuf from pool2
>     seg3 - 512B @ 0 in mbuf from pool3
>     seg4 - 512B @ 0 in mbuf from pool3
>     seg5 - 422B @ 0 in mbuf from pool3
> 
> The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
> configured to support new buffer split feature (if rx_nseg
> is greater than one).
> 
> The new approach would allow splitting the ingress packets into
> multiple parts pushed to the memory with different attributes.
> For example, the packet headers can be pushed to the embedded
> data buffers within mbufs and the application data into
> the external buffers attached to mbufs allocated from the
> different memory pools. The memory attributes for the split
> parts may differ either - for example the application data
> may be pushed into the external memory located on the dedicated
> physical device, say GPU or NVMe. This would improve the DPDK
> receiving datapath flexibility with preserving compatibility
> with existing API.
> 
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> ---
>  doc/guides/nics/features.rst           |  15 +++++
>  doc/guides/rel_notes/deprecation.rst   |   5 --
>  doc/guides/rel_notes/release_20_11.rst |   9 +++
>  lib/librte_ethdev/rte_ethdev.c         | 111 +++++++++++++++++++++++++--------
>  lib/librte_ethdev/rte_ethdev.h         |  62 +++++++++++++++++-
>  5 files changed, 171 insertions(+), 31 deletions(-)
> 
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index dd8c955..832ea3b 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -185,6 +185,21 @@ Supports receiving segmented mbufs.
>  * **[related]    eth_dev_ops**: ``rx_pkt_burst``.
>  
>  
> +.. _nic_features_buffer_split:
> +
> +Buffer Split on Rx
> +------------------
> +
> +Scatters the packets being received on specified boundaries to segmented mbufs.
> +
> +* **[uses]       rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
> +* **[uses]       rte_eth_rxconf**: ``rx_conf.rx_seg, rx_conf.rx_nseg``.
> +* **[implements] datapath**: ``Buffer Split functionality``.
> +* **[provides]   rte_eth_dev_info**: ``rx_offload_capa:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
> +* **[provides]   eth_dev_ops**: ``rxq_info_get:buffer_split``.
> +* **[related] API**: ``rte_eth_rx_queue_setup()``.
> +
> +
>  .. _nic_features_lro:
>  
>  LRO
> diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
> index 584e720..232cd54 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -138,11 +138,6 @@ Deprecation Notices
>    In 19.11 PMDs will still update the field even when the offload is not
>    enabled.
>  
> -* ethdev: Add new fields to ``rte_eth_rxconf`` to configure the receiving
> -  queues to split ingress packets into multiple segments according to the
> -  specified lengths into the buffers allocated from the specified
> -  memory pools. The backward compatibility to existing API is preserved.
> -
>  * ethdev: ``rx_descriptor_done`` dev_ops and ``rte_eth_rx_descriptor_done``
>    will be removed in 21.11.
>    Existing ``rte_eth_rx_descriptor_status`` and ``rte_eth_tx_descriptor_status``
> diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
> index bcc0fc2..bcc2479 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -60,6 +60,12 @@ New Features
>    Added the FEC API which provides functions for query FEC capabilities and
>    current FEC mode from device. Also, API for configuring FEC mode is also provided.
>  
> +* **Introduced extended buffer description for receiving.**
> +
> +  Added the extended Rx buffer description for Rx queue setup routine
> +  providing the individual settings for each Rx segment with maximal size,
> +  buffer offset and memory pool to allocate data buffers from.
> +
>  * **Updated Broadcom bnxt driver.**
>  
>    Updated the Broadcom bnxt driver with new features and improvements, including:
> @@ -253,6 +259,9 @@ API Changes
>    As the data of ``uint8_t`` will be truncated when queue number under
>    a TC is greater than 256.
>  
> +* ethdev: Added fields rx_seg and rx_nseg to rte_eth_rxconf structure
> +  to provide extended description of the receiving buffer.
> +
>  * vhost: Moved vDPA APIs from experimental to stable.
>  
>  * rawdev: Added a structure size parameter to the functions
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index 892c246..96ecb91 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -105,6 +105,9 @@ struct rte_eth_xstats_name_off {
>  #define RTE_RX_OFFLOAD_BIT2STR(_name)	\
>  	{ DEV_RX_OFFLOAD_##_name, #_name }
>  
> +#define RTE_ETH_RX_OFFLOAD_BIT2STR(_name)	\
> +	{ RTE_ETH_RX_OFFLOAD_##_name, #_name }
> +
>  static const struct {
>  	uint64_t offload;
>  	const char *name;
> @@ -128,9 +131,11 @@ struct rte_eth_xstats_name_off {
>  	RTE_RX_OFFLOAD_BIT2STR(SCTP_CKSUM),
>  	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>  	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> +	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
>  };
>  
>  #undef RTE_RX_OFFLOAD_BIT2STR
> +#undef RTE_ETH_RX_OFFLOAD_BIT2STR
>  
>  #define RTE_TX_OFFLOAD_BIT2STR(_name)	\
>  	{ DEV_TX_OFFLOAD_##_name, #_name }
> @@ -1784,38 +1789,94 @@ struct rte_eth_dev *
>  		return -EINVAL;
>  	}
>  
> -	if (mp == NULL) {
> -		RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
> -		return -EINVAL;
> -	}
> -
>  	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_setup, -ENOTSUP);
>  
> -	/*
> -	 * Check the size of the mbuf data buffer.
> -	 * This value must be provided in the private data of the memory pool.
> -	 * First check that the memory pool has a valid private data.
> -	 */
>  	ret = rte_eth_dev_info_get(port_id, &dev_info);
>  	if (ret != 0)
>  		return ret;
>  
> -	if (mp->private_data_size < sizeof(struct rte_pktmbuf_pool_private)) {
> -		RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
> -			mp->name, (int)mp->private_data_size,
> -			(int)sizeof(struct rte_pktmbuf_pool_private));
> -		return -ENOSPC;
> -	}
> -	mbp_buf_size = rte_pktmbuf_data_room_size(mp);
> +	if (mp) {

Please, compare vs NULL [1]

[1] https://doc.dpdk.org/guides/contributing/coding_style.html#null-pointers

> +		/* Single pool configuration check. */
> +		if (rx_conf->rx_seg || rx_conf->rx_nseg) {

Please, compare vs NULL and 0. IMHO, rx_nsegs check is sufficient. If it
is 0, nobody cares what is in rx_seg.

> +			RTE_ETHDEV_LOG(ERR,
> +				       "Ambiguous segment configuration\n");
> +			return -EINVAL;
> +		}
> +		/*
> +		 * Check the size of the mbuf data buffer, this value
> +		 * must be provided in the private data of the memory pool.
> +		 * First check that the memory pool(s) has a valid private data.
> +		 */
> +		if (mp->private_data_size <
> +				sizeof(struct rte_pktmbuf_pool_private)) {
> +			RTE_ETHDEV_LOG(ERR, "%s private_data_size %u < %u\n",
> +				mp->name, mp->private_data_size,
> +				(unsigned int)
> +				sizeof(struct rte_pktmbuf_pool_private));
> +			return -ENOSPC;
> +		}
> +		mbp_buf_size = rte_pktmbuf_data_room_size(mp);
> +		if (mbp_buf_size < dev_info.min_rx_bufsize +
> +				   RTE_PKTMBUF_HEADROOM) {
> +			RTE_ETHDEV_LOG(ERR,
> +				       "%s mbuf_data_room_size %u < %u"
> +				       " (RTE_PKTMBUF_HEADROOM=%u +"
> +				       " min_rx_bufsize(dev)=%u)\n",

Do not split format string. It is not a problem that it is long.

> +				       mp->name, mbp_buf_size,
> +				       RTE_PKTMBUF_HEADROOM +
> +				       dev_info.min_rx_bufsize,
> +				       RTE_PKTMBUF_HEADROOM,
> +				       dev_info.min_rx_bufsize);
> +			return -EINVAL;
> +		}
> +	} else {
> +		const struct rte_eth_rxseg *rx_seg = rx_conf->rx_seg;
> +		uint16_t n_seg = rx_conf->rx_nseg;
> +		uint16_t seg_idx;
>  
> -	if (mbp_buf_size < dev_info.min_rx_bufsize + RTE_PKTMBUF_HEADROOM) {
> -		RTE_ETHDEV_LOG(ERR,
> -			"%s mbuf_data_room_size %d < %d (RTE_PKTMBUF_HEADROOM=%d + min_rx_bufsize(dev)=%d)\n",
> -			mp->name, (int)mbp_buf_size,
> -			(int)(RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize),
> -			(int)RTE_PKTMBUF_HEADROOM,
> -			(int)dev_info.min_rx_bufsize);
> -		return -EINVAL;
> +		/* Extended multi-segment configuration check. */
> +		if (!rx_conf->rx_seg || !rx_conf->rx_nseg) {

Please, compare vs NULL and 0

> +			RTE_ETHDEV_LOG(ERR,
> +				       "Memory pool is null and no"
> +				       " extended configuration provided\n");

Do not split format string. It is not a problem that it is long.


> +			return -EINVAL;
> +		}
> +		/*
> +		 * Check the sizes and offsets against buffer sizes
> +		 * for each segment specified in extended configuration.
> +		 */
> +		for (seg_idx = 0; seg_idx < n_seg; seg_idx++) {
> +			struct rte_mempool *mpl = rx_seg[seg_idx].mp;
> +			uint32_t length = rx_seg[seg_idx].length;
> +			uint32_t offset = rx_seg[seg_idx].offset;
> +			uint32_t head_room = seg_idx ? 0 : RTE_PKTMBUF_HEADROOM;

Why? Shouldn't it be in offset? IMHO too many offsets this way.

> +
> +			if (mpl == NULL) {
> +				RTE_ETHDEV_LOG(ERR, "null mempool pointer\n");
> +				return -EINVAL;
> +			}
> +			if (mpl->private_data_size <
> +				sizeof(struct rte_pktmbuf_pool_private)) {
> +				RTE_ETHDEV_LOG(ERR,
> +					       "%s private_data_size %u < %u\n",
> +					       mpl->name,
> +					       mpl->private_data_size,
> +					       (unsigned int)sizeof(struct
> +					       rte_pktmbuf_pool_private));
> +				return -ENOSPC;
> +			}
> +			mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
> +			length = length ? length : (mbp_buf_size - head_room);

Compare length with 0.
What does ensure that mbp_buf_size is greater or equal to head_room size?

> +			if (mbp_buf_size < length + offset + head_room) {
> +				RTE_ETHDEV_LOG(ERR,
> +					"%s mbuf_data_room_size %u < %u"
> +					" (segment length=%u +"
> +					" segment offset=%u)\n",

Do not split format string. It is not a problem that it is long.


> +					mpl->name, mbp_buf_size,
> +					length + offset, length, offset);
> +				return -EINVAL;
> +			}
> +		}
>  	}
>  
>  	/* Use default specified by driver, if nb_rx_desc is zero */
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 5bcfbb8..e019f4a 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -970,6 +970,16 @@ struct rte_eth_txmode {
>  };
>  
>  /**
> + * A structure used to configure an RX packet segment to split.

RX -> Rx

> + */
> +struct rte_eth_rxseg {
> +	struct rte_mempool *mp; /**< Memory pool to allocate segment from. */
> +	uint16_t length; /**< Segment data length, configures split point. */
> +	uint16_t offset; /**< Data offset from beginning of mbuf data buffer. */
> +	uint32_t reserved; /**< Reserved field. */
> +};
> +
> +/**
>   * A structure used to configure an RX ring of an Ethernet port.
>   */
>  struct rte_eth_rxconf {
> @@ -977,6 +987,43 @@ struct rte_eth_rxconf {
>  	uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
>  	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>  	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> +	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +	/**
> +	 * Points to the array of segment descriptions. Each array element
> +	 * describes the properties for each segment in the receiving
> +	 * buffer.
> +	 *
> +	 * If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag is set in offloads field,
> +	 * the PMD will split the received packets into multiple segments
> +	 * according to the specification in the description array:
> +	 *
> +	 * - the first network buffer will be allocated from the memory pool,
> +	 *   specified in the first array element, the second buffer, from the
> +	 *   pool in the second element, and so on.
> +	 *
> +	 * - the offsets from the segment description elements specify
> +	 *   the data offset from the buffer beginning except the first mbuf.
> +	 *   For this one the offset is added with RTE_PKTMBUF_HEADROOM.
> +	 *
> +	 * - the lengthes in the elements define the maximal data amount
> +	 *   being received to each segment. The receiving starts with filling
> +	 *   up the first mbuf data buffer up to specified length. If the
> +	 *   there are data remaining (packet is longer than buffer in the first
> +	 *   mbuf) the following data will be pushed to the next segment
> +	 *   up to its own length, and so on.
> +	 *
> +	 * - If the length in the segment description element is zero
> +	 *   the actual buffer size will be deduced from the appropriate
> +	 *   memory pool properties.
> +	 *
> +	 * - if there is not enough elements to describe the buffer for entire
> +	 *   packet of maximal length the following parameters will be used
> +	 *   for the all remaining segments:
> +	 *     - pool from the last valid element
> +	 *     - the buffer size from this pool
> +	 *     - zero offset
> +	 */
> +	struct rte_eth_rxseg *rx_seg;
>  	/**
>  	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>  	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1260,6 +1307,7 @@ struct rte_eth_conf {
>  #define DEV_RX_OFFLOAD_SCTP_CKSUM	0x00020000
>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>  #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
> +#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
>  
>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>  				 DEV_RX_OFFLOAD_UDP_CKSUM | \
> @@ -2027,9 +2075,21 @@ int rte_eth_dev_configure(uint16_t port_id, uint16_t nb_rx_queue,
>   *   No need to repeat any bit in rx_conf->offloads which has already been
>   *   enabled in rte_eth_dev_configure() at port level. An offloading enabled
>   *   at port level can't be disabled at queue level.
> + *   The configuration structure also contains the pointer to the array
> + *   of the receiving buffer segment descriptions, see rx_seg and rx_nseg
> + *   fields, this extended configuration might be used by split offloads like
> + *   RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT. If mp_pool is not NULL,
> + *   the extended configuration fields must be set to NULL and zero.
>   * @param mb_pool
>   *   The pointer to the memory pool from which to allocate *rte_mbuf* network
> - *   memory buffers to populate each descriptor of the receive ring.
> + *   memory buffers to populate each descriptor of the receive ring. There are
> + *   two options to provide Rx buffer configuration:
> + *   - single pool:
> + *     mb_pool is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is 0.
> + *   - multiple segments description:
> + *     mb_pool is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not 0.
> + *     Taken only if flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is set in offloads.
> + *
>   * @return
>   *   - 0: Success, receive queue correctly set up.
>   *   - -EIO: if device is removed.
>
  
Thomas Monjalon Oct. 15, 2020, 10:19 a.m. UTC | #8
15/10/2020 12:11, Andrew Rybchenko:
> On 10/14/20 9:11 PM, Viacheslav Ovsiienko wrote:
> > +		/* Single pool configuration check. */
> > +		if (rx_conf->rx_seg || rx_conf->rx_nseg) {
> 
> Please, compare vs NULL and 0. IMHO, rx_nsegs check is sufficient. If it
> is 0, nobody cares what is in rx_seg.

Yes the pointer should not be a criteria.
Having more than zero items is enough to check.

[...]
> > +			RTE_ETHDEV_LOG(ERR,
> > +				       "%s mbuf_data_room_size %u < %u"
> > +				       " (RTE_PKTMBUF_HEADROOM=%u +"
> > +				       " min_rx_bufsize(dev)=%u)\n",
> 
> Do not split format string. It is not a problem that it is long.

The benefit of keeping format string on the same line is for "grepping"
the source code. But after a format specifier, I think we can split.
Who is grepping "< %u (RTE_PKTMBUF_HEADROOM" ?
I would just change the split on the second line after the %u.
  
Jerin Jacob Oct. 15, 2020, 10:27 a.m. UTC | #9
On Thu, Oct 15, 2020 at 2:57 PM Jerin Jacob <jerinjacobk@gmail.com> wrote:
>
> On Thu, Oct 15, 2020 at 1:13 PM Slava Ovsiienko <viacheslavo@nvidia.com> wrote:
> >
> > Hi, Jerin
>
> Hi Slava,
>
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Wednesday, October 14, 2020 21:57
> > > To: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
> > > <thomas@monjalon.net>; Stephen Hemminger
> > > <stephen@networkplumber.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > Olivier Matz <olivier.matz@6wind.com>; Maxime Coquelin
> > > <maxime.coquelin@redhat.com>; David Marchand
> > > <david.marchand@redhat.com>; Andrew Rybchenko
> > > <arybchenko@solarflare.com>
> > > Subject: Re: [PATCH v6 1/6] ethdev: introduce Rx buffer split
> > >
> > > On Wed, Oct 14, 2020 at 11:42 PM Viacheslav Ovsiienko
> > > <viacheslavo@nvidia.com> wrote:
> > > >
> > > > The DPDK datapath in the transmit direction is very flexible.
> > > > An application can build the multi-segment packet and manages almost
> > > > all data aspects - the memory pools where segments are allocated from,
> > > > the segment lengths, the memory attributes like external buffers,
> > > > registered for DMA, etc.
> > > >
> >
> > [..snip..]
> >
> > > > For example, let's suppose we configured the Rx queue with the
> > > > following segments:
> > > >     seg0 - pool0, len0=14B, off0=2
> > > >     seg1 - pool1, len1=20B, off1=128B
> > > >     seg2 - pool2, len2=20B, off2=0B
> > > >     seg3 - pool3, len3=512B, off3=0B
> > >
> > >
> > > Sorry for chime in late. This API lookout looks good to me.
> > > But, I am wondering how the application can know the capability or "limits" of
> > > struct rte_eth_rxseg structure for the specific PMD. The other descriptor limit,
> > > it's being exposed with struct rte_eth_dev_info::rx_desc_lim; If PMD can
> > > support a specific pattern rather than returning the blanket error, the
> > > application should know the limit.
> > > IMO, it is better to add
> > > struct rte_eth_rxseg *rxsegs;
> > > unint16_t nb_max_rxsegs
> > > in rte_eth_dev_info structure to express the capablity.
> > > Where the en and offset can define the max offset.
> > >
> > > Thoughts?
> >
> > Moreover, there might be implied a lot of various limitations - offsets might be not supported at all or
> > have some requirements for alignment, the similar requirements might be applied to segment size
> > (say, ask for some granularity). Currently it is not obvious how to report all nuances, and it is supposed
> > the limitations of this kind must be documented in PMD chapter. As for mlx5 - it has no special
> > limitations besides common requirements to the regular segments.
>
> Reporting the limitation in the documentation will not help for the
> generic applications.
>
> >
> > One more point - the split feature might be considered as just one of possible cases of using
> > these segment descriptions, other features might impose other (unknown for now) limitations.

Also , I agree that w will have multiple use cases with segment descriptors.
In order to make it future proof on the API definion is better to have
from:
struct rte_eth_rxseg {
   struct rte_mempool *mp; /**< Memory pool to allocate segment from. */
  uint16_t length; /**< Segment data length, configures split point. */
  uint16_t offset; /**< Data offset from beginning of mbuf data buffer. */
  uint32_t reserved; /**< Reserved field. */
};
something lime below:

struct rte_eth_rxseg {
    enum rte_eth_rxseg_mode mode ;
    union {
               struct rte_eth_rxseg_mode xxx {
                              struct rte_mempool *mp; /**< Memory pool
to allocate segment from. */
                              uint16_t length; /**< Segment data
length, configures split point. */
                               uint16_t offset; /**< Data offset from
beginning of mbuf data buffer. */
                               uint32_t reserved; /**< Reserved field. */
             }
}

Another mode, Marvell PMD has it(I believe Intel also) i.e
When we say:

seg0 - pool0, len0=2000B, off0=0
seg1 - pool1, len1=2001B, off1=0

packet size up to, 2000B goes to pool 0 and if is >=2001 goes to pool1.
I think, it is better to have mode param in rte_eth_rxseg for avoiding
ABI changes.(Just  like clean rte_flow APIs)

> > If we see some of the features of such kind or other PMDs adopts the split feature - we'll try to find
> > the common root and consider the way how to report it.
>
> My only concern with that approach will be ABI break again if
> something needs to exposed over rte_eth_dev_info().
> IMO, if we featured needs to completed only when its capabilities are
> exposed in a programmatic manner.
> As of mlx5, if there not limitation then info
> rte_eth_dev_info::rxsegs[x].len, offset etc as UINT16_MAX so
> that application is aware of the state.
>
> >
> > With best regards, Slava
> >
  
Slava Ovsiienko Oct. 15, 2020, 10:34 a.m. UTC | #10
Hi, Andrew

> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Sent: Thursday, October 15, 2020 12:49
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; Jerin Jacob
> <jerinjacobk@gmail.com>
> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
> <thomas@monjalon.net>; Stephen Hemminger
> <stephen@networkplumber.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> Olivier Matz <olivier.matz@6wind.com>; Maxime Coquelin
> <maxime.coquelin@redhat.com>; David Marchand
> <david.marchand@redhat.com>; Andrew Rybchenko
> <arybchenko@solarflare.com>
> Subject: Re: [dpdk-dev] [PATCH v6 1/6] ethdev: introduce Rx buffer split
> 
> On 10/15/20 10:43 AM, Slava Ovsiienko wrote:
> > Hi, Jerin
> >
> >> -----Original Message-----
> >> From: Jerin Jacob <jerinjacobk@gmail.com>
> >> Sent: Wednesday, October 14, 2020 21:57
> >> To: Slava Ovsiienko <viacheslavo@nvidia.com>
> >> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
> >> <thomas@monjalon.net>; Stephen Hemminger
> >> <stephen@networkplumber.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> >> Olivier Matz <olivier.matz@6wind.com>; Maxime Coquelin
> >> <maxime.coquelin@redhat.com>; David Marchand
> >> <david.marchand@redhat.com>; Andrew Rybchenko
> >> <arybchenko@solarflare.com>
> >> Subject: Re: [PATCH v6 1/6] ethdev: introduce Rx buffer split
> >>
> >> On Wed, Oct 14, 2020 at 11:42 PM Viacheslav Ovsiienko
> >> <viacheslavo@nvidia.com> wrote:
> >>>
> >>> The DPDK datapath in the transmit direction is very flexible.
> >>> An application can build the multi-segment packet and manages almost
> >>> all data aspects - the memory pools where segments are allocated
> >>> from, the segment lengths, the memory attributes like external
> >>> buffers, registered for DMA, etc.
> >>>
> >
> > [..snip..]
> >
> >>> For example, let's suppose we configured the Rx queue with the
> >>> following segments:
> >>>     seg0 - pool0, len0=14B, off0=2
> >>>     seg1 - pool1, len1=20B, off1=128B
> >>>     seg2 - pool2, len2=20B, off2=0B
> >>>     seg3 - pool3, len3=512B, off3=0B
> >>
> >>
> >> Sorry for chime in late. This API lookout looks good to me.
> >> But, I am wondering how the application can know the capability or
> >> "limits" of struct rte_eth_rxseg structure for the specific PMD. The
> >> other descriptor limit, it's being exposed with struct
> >> rte_eth_dev_info::rx_desc_lim; If PMD can support a specific pattern
> >> rather than returning the blanket error, the application should know the
> limit.
> >> IMO, it is better to add
> >> struct rte_eth_rxseg *rxsegs;
> >> unint16_t nb_max_rxsegs
> >> in rte_eth_dev_info structure to express the capablity.
> >> Where the en and offset can define the max offset.
> >>
> >> Thoughts?
> >
> > Moreover, there might be implied a lot of various limitations -
> > offsets might be not supported at all or have some requirements for
> > alignment, the similar requirements might be applied to segment size
> > (say, ask for some granularity). Currently it is not obvious how to
> > report all nuances, and it is supposed the limitations of this kind must be
> documented in PMD chapter. As for mlx5 - it has no special limitations besides
> common requirements to the regular segments.
> >
> > One more point - the split feature might be considered as just one of
> > possible cases of using these segment descriptions, other features might
> impose other (unknown for now) limitations.
> > If we see some of the features of such kind or other PMDs adopts the
> > split feature - we'll try to find the common root and consider the way how to
> report it.
> 
> At least there are few simple limitations which are easy to
> express:
>  1. Maximum number of segments
We have scatter capability and we do not report the maximal number of segments,
it is on PMD own. We could add the field to the rte_eth_dev_info, but not sure
whether we have something special to report there even for mlx5 case.


>  2. Possibility to use the last segment many times if required
>     (I was suggesting to use scatter for it, but you rejected
>      the idea - may be time to reconsider :) ) 

Mmm, sorry I do not follow, it might be I did not understand/missed your idea.
Some of the last segment attributes are used multiple times to scatter the rest
of the data in fashion very close to the existing scattering approach - at least,
pool and buffer size from this pool are used. The beginning of the packet
scattered according to the new descriptions, the rest of the packet -
according to the existing regular scattering with pool settings from
the last segment description.

 3. Maximum offset
>     Frankly speaking I'm not sure why it cannot be handled on
>     PMD level (i.e. provide descriptors with offset taken into
>     account or guarantee that HW mempool objects initialized
>     correctly with required headroom). May be in some corner
>     cases when the same HW mempool is shared by various
>     segments with different offset requirements.

HW offsets are beyond the feature scope, the offsets in the segment
description is supposed to be added to the native pool offsets (if any).

>  4. Offset alignment
>  5. Maximum/minimum length of a segment
>  6. Length alignment
In which form? Mask of lsbs ? 0 means no limitations ?

> 
> I realize that 3, 4 and 5 could be per segment number.
> If it is really that complex, report common denominator which is guaranteed to
> work. If we have no checks on ethdev layer, application can ignore it if it knows
> better.

Currently it is not clear at all what kind of limitations should be reported,
we could include all of mentioned/proposed ones, and no one will report there -
mlx5 has no any reasonable limitations to report for now.

Should we reserve some pointer field in the rte_eth_dev_info to report
the limitations? (Limitation description should contain variable size array,
depending on the number of segments, so pointer seems to be appropriate).
It would allow us to avoid ABI break, and present the limitation structure once it is defined.

With best regards, Slava
  
Slava Ovsiienko Oct. 15, 2020, 10:51 a.m. UTC | #11
Hi, Jerin

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Thursday, October 15, 2020 13:28
> To: Slava Ovsiienko <viacheslavo@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
> <thomas@monjalon.net>; Stephen Hemminger
> <stephen@networkplumber.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> Olivier Matz <olivier.matz@6wind.com>; Maxime Coquelin
> <maxime.coquelin@redhat.com>; David Marchand
> <david.marchand@redhat.com>; Andrew Rybchenko
> <arybchenko@solarflare.com>
> Subject: Re: [PATCH v6 1/6] ethdev: introduce Rx buffer split
> 
[..snip..]
> 
> struct rte_eth_rxseg {
>     enum rte_eth_rxseg_mode mode ;
>     union {
>                struct rte_eth_rxseg_mode xxx {
>                               struct rte_mempool *mp; /**< Memory pool to allocate
> segment from. */
>                               uint16_t length; /**< Segment data length, configures split
> point. */
>                                uint16_t offset; /**< Data offset from beginning of mbuf data
> buffer. */
>                                uint32_t reserved; /**< Reserved field. */
>              }
> }

There is an array of rte_eth_rxseg. It would introduce multiple "enum rte_eth_rxseg_mode mode"
and would cause some ambiguity. About mode selection - please, see below.
Union seems to be good idea, let's adopt.

> 
> Another mode, Marvell PMD has it(I believe Intel also) i.e When we say:
> 
> seg0 - pool0, len0=2000B, off0=0
> seg1 - pool1, len1=2001B, off1=0
> 
> packet size up to, 2000B goes to pool 0 and if is >=2001 goes to pool1.
> I think, it is better to have mode param in rte_eth_rxseg for avoiding ABI
> changes.(Just  like clean rte_flow APIs)

It is supposed to choose with RTE_ETH_RX_OFFLOAD_xxx flags.
For packet sorting it should be something like this RTE_ETH_RX_OFFLOAD_SORT.
PMD reports it supports the feature, the flag is set in rx_conf->offloads
and rxseg structure is interpreted according to these flags.

Please, note, there is intentionally no check for RTE_ETH_RX_OFFLOAD_xxx
in rte_eth_dev_rx_queue_setup() - it should be done on PMD side.

> 
> > > If we see some of the features of such kind or other PMDs adopts the
> > > split feature - we'll try to find the common root and consider the way how
> to report it.
> >
> > My only concern with that approach will be ABI break again if
> > something needs to exposed over rte_eth_dev_info().

Let's reserve the pointer to struct rte_eth_rxseg_limitations
in the rte_eth_dev_info to avoid ABI break?

With best regards, Slava
  
Andrew Rybchenko Oct. 15, 2020, 11:09 a.m. UTC | #12
On 10/15/20 1:34 PM, Slava Ovsiienko wrote:
> Hi, Andrew
> 
>> -----Original Message-----
>> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>> Sent: Thursday, October 15, 2020 12:49
>> To: Slava Ovsiienko <viacheslavo@nvidia.com>; Jerin Jacob
>> <jerinjacobk@gmail.com>
>> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
>> <thomas@monjalon.net>; Stephen Hemminger
>> <stephen@networkplumber.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
>> Olivier Matz <olivier.matz@6wind.com>; Maxime Coquelin
>> <maxime.coquelin@redhat.com>; David Marchand
>> <david.marchand@redhat.com>; Andrew Rybchenko
>> <arybchenko@solarflare.com>
>> Subject: Re: [dpdk-dev] [PATCH v6 1/6] ethdev: introduce Rx buffer split
>>
>> On 10/15/20 10:43 AM, Slava Ovsiienko wrote:
>>> Hi, Jerin
>>>
>>>> -----Original Message-----
>>>> From: Jerin Jacob <jerinjacobk@gmail.com>
>>>> Sent: Wednesday, October 14, 2020 21:57
>>>> To: Slava Ovsiienko <viacheslavo@nvidia.com>
>>>> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
>>>> <thomas@monjalon.net>; Stephen Hemminger
>>>> <stephen@networkplumber.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
>>>> Olivier Matz <olivier.matz@6wind.com>; Maxime Coquelin
>>>> <maxime.coquelin@redhat.com>; David Marchand
>>>> <david.marchand@redhat.com>; Andrew Rybchenko
>>>> <arybchenko@solarflare.com>
>>>> Subject: Re: [PATCH v6 1/6] ethdev: introduce Rx buffer split
>>>>
>>>> On Wed, Oct 14, 2020 at 11:42 PM Viacheslav Ovsiienko
>>>> <viacheslavo@nvidia.com> wrote:
>>>>>
>>>>> The DPDK datapath in the transmit direction is very flexible.
>>>>> An application can build the multi-segment packet and manages almost
>>>>> all data aspects - the memory pools where segments are allocated
>>>>> from, the segment lengths, the memory attributes like external
>>>>> buffers, registered for DMA, etc.
>>>>>
>>>
>>> [..snip..]
>>>
>>>>> For example, let's suppose we configured the Rx queue with the
>>>>> following segments:
>>>>>     seg0 - pool0, len0=14B, off0=2
>>>>>     seg1 - pool1, len1=20B, off1=128B
>>>>>     seg2 - pool2, len2=20B, off2=0B
>>>>>     seg3 - pool3, len3=512B, off3=0B
>>>>
>>>>
>>>> Sorry for chime in late. This API lookout looks good to me.
>>>> But, I am wondering how the application can know the capability or
>>>> "limits" of struct rte_eth_rxseg structure for the specific PMD. The
>>>> other descriptor limit, it's being exposed with struct
>>>> rte_eth_dev_info::rx_desc_lim; If PMD can support a specific pattern
>>>> rather than returning the blanket error, the application should know the
>> limit.
>>>> IMO, it is better to add
>>>> struct rte_eth_rxseg *rxsegs;
>>>> unint16_t nb_max_rxsegs
>>>> in rte_eth_dev_info structure to express the capablity.
>>>> Where the en and offset can define the max offset.
>>>>
>>>> Thoughts?
>>>
>>> Moreover, there might be implied a lot of various limitations -
>>> offsets might be not supported at all or have some requirements for
>>> alignment, the similar requirements might be applied to segment size
>>> (say, ask for some granularity). Currently it is not obvious how to
>>> report all nuances, and it is supposed the limitations of this kind must be
>> documented in PMD chapter. As for mlx5 - it has no special limitations besides
>> common requirements to the regular segments.
>>>
>>> One more point - the split feature might be considered as just one of
>>> possible cases of using these segment descriptions, other features might
>> impose other (unknown for now) limitations.
>>> If we see some of the features of such kind or other PMDs adopts the
>>> split feature - we'll try to find the common root and consider the way how to
>> report it.
>>
>> At least there are few simple limitations which are easy to
>> express:
>>  1. Maximum number of segments
> We have scatter capability and we do not report the maximal number of segments,
> it is on PMD own. We could add the field to the rte_eth_dev_info, but not sure
> whether we have something special to report there even for mlx5 case.

There is always a limitation in programming and HW. Nothing is
unlimited. Limits could be high, but still exist.
Number of descriptors? Width of field in HW interface?
Maximum length of the config message to HW?
All above could limit it directly or indirectly.


>>  2. Possibility to use the last segment many times if required
>>     (I was suggesting to use scatter for it, but you rejected
>>      the idea - may be time to reconsider :) ) 
> 
> Mmm, sorry I do not follow, it might be I did not understand/missed your idea.
> Some of the last segment attributes are used multiple times to scatter the rest
> of the data in fashion very close to the existing scattering approach - at least,
> pool and buffer size from this pool are used. The beginning of the packet
> scattered according to the new descriptions, the rest of the packet -
> according to the existing regular scattering with pool settings from
> the last segment description.

I believe that the possibility to split into a fixed segments
(BUFFER_SPLIT) and possibility to use a mempool (just mp or
last segment) many times if a packet does not fit (SCATTER)
it is *different* features.
I can easily imagine HW which could do BUFFER_SPLIT to
fixed segments, but cannot use the last segment many times
(i.e. no classical SCATTER).

> 
>  3. Maximum offset
>>     Frankly speaking I'm not sure why it cannot be handled on
>>     PMD level (i.e. provide descriptors with offset taken into
>>     account or guarantee that HW mempool objects initialized
>>     correctly with required headroom). May be in some corner
>>     cases when the same HW mempool is shared by various
>>     segments with different offset requirements.
> 
> HW offsets are beyond the feature scope, the offsets in the segment
> description is supposed to be added to the native pool offsets (if any).

Are you saying that offsets are not passed to HW and just
handled by PMD to provide correct IOVA addresses to put
data to? If so, it is an implementation detail which is
specific to mlx5. If so, no specific limitations
except data room, size and offset consistency.
But it could be passed to a HW and it could be, for example,
just 8 bits for the value.

> 
>>  4. Offset alignment
>>  5. Maximum/minimum length of a segment
>>  6. Length alignment
> In which form? Mask of lsbs ? 0 means no limitations ?

log2, i.e. 0 => 1 (no limitations) 1 => 2 (even only),
6 => 64 (64-byte cache line aligned) etc.

> 
>>
>> I realize that 3, 4 and 5 could be per segment number.
>> If it is really that complex, report common denominator which is guaranteed to
>> work. If we have no checks on ethdev layer, application can ignore it if it knows
>> better.
> 
> Currently it is not clear at all what kind of limitations should be reported,
> we could include all of mentioned/proposed ones, and no one will report there -
> mlx5 has no any reasonable limitations to report for now.
> 
> Should we reserve some pointer field in the rte_eth_dev_info to report
> the limitations? (Limitation description should contain variable size array,
> depending on the number of segments, so pointer seems to be appropriate).
> It would allow us to avoid ABI break, and present the limitation structure once it is defined.

I will let other ethdev maintainers to make a decision here.
My vote would be to report limitations mentioned above.
It looks like Jerin is also interested in limitations
reporting. Not sure if my form looks OK or no.
  
Jerin Jacob Oct. 15, 2020, 11:26 a.m. UTC | #13
On Thu, Oct 15, 2020 at 4:21 PM Slava Ovsiienko <viacheslavo@nvidia.com> wrote:
>
> Hi, Jerin
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Thursday, October 15, 2020 13:28
> > To: Slava Ovsiienko <viacheslavo@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
> > <thomas@monjalon.net>; Stephen Hemminger
> > <stephen@networkplumber.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > Olivier Matz <olivier.matz@6wind.com>; Maxime Coquelin
> > <maxime.coquelin@redhat.com>; David Marchand
> > <david.marchand@redhat.com>; Andrew Rybchenko
> > <arybchenko@solarflare.com>
> > Subject: Re: [PATCH v6 1/6] ethdev: introduce Rx buffer split
> >
> [..snip..]
> >
> > struct rte_eth_rxseg {
> >     enum rte_eth_rxseg_mode mode ;
> >     union {
> >                struct rte_eth_rxseg_mode xxx {
> >                               struct rte_mempool *mp; /**< Memory pool to allocate
> > segment from. */
> >                               uint16_t length; /**< Segment data length, configures split
> > point. */
> >                                uint16_t offset; /**< Data offset from beginning of mbuf data
> > buffer. */
> >                                uint32_t reserved; /**< Reserved field. */
> >              }
> > }
>
> There is an array of rte_eth_rxseg. It would introduce multiple "enum rte_eth_rxseg_mode mode"
> and would cause some ambiguity. About mode selection - please, see below.
> Union seems to be good idea, let's adopt.

Ack. Let's take the only union concept.

>
> >
> > Another mode, Marvell PMD has it(I believe Intel also) i.e When we say:
> >
> > seg0 - pool0, len0=2000B, off0=0
> > seg1 - pool1, len1=2001B, off1=0
> >
> > packet size up to, 2000B goes to pool 0 and if is >=2001 goes to pool1.
> > I think, it is better to have mode param in rte_eth_rxseg for avoiding ABI
> > changes.(Just  like clean rte_flow APIs)
>
> It is supposed to choose with RTE_ETH_RX_OFFLOAD_xxx flags.
> For packet sorting it should be something like this RTE_ETH_RX_OFFLOAD_SORT.
> PMD reports it supports the feature, the flag is set in rx_conf->offloads
> and rxseg structure is interpreted according to these flags.

Works for me.

>
> Please, note, there is intentionally no check for RTE_ETH_RX_OFFLOAD_xxx
> in rte_eth_dev_rx_queue_setup() - it should be done on PMD side.
>
> >
> > > > If we see some of the features of such kind or other PMDs adopts the
> > > > split feature - we'll try to find the common root and consider the way how
> > to report it.
> > >
> > > My only concern with that approach will be ABI break again if
> > > something needs to exposed over rte_eth_dev_info().
>
> Let's reserve the pointer to struct rte_eth_rxseg_limitations
> in the rte_eth_dev_info to avoid ABI break?

Works for me. If we add an additional reserved field.

Due to RC1 time constraint, I am OK to leave it as a reserved filed and fill
meat when it is required if other ethdev maintainers are OK.
I will be required for feature complete.



>
> With best regards, Slava
  
Ferruh Yigit Oct. 15, 2020, 11:36 a.m. UTC | #14
On 10/15/2020 12:26 PM, Jerin Jacob wrote:

<...>

>>>>> If we see some of the features of such kind or other PMDs adopts the
>>>>> split feature - we'll try to find the common root and consider the way how
>>> to report it.
>>>>
>>>> My only concern with that approach will be ABI break again if
>>>> something needs to exposed over rte_eth_dev_info().
>>
>> Let's reserve the pointer to struct rte_eth_rxseg_limitations
>> in the rte_eth_dev_info to avoid ABI break?
> 
> Works for me. If we add an additional reserved field.
> 
> Due to RC1 time constraint, I am OK to leave it as a reserved filed and fill
> meat when it is required if other ethdev maintainers are OK.
> I will be required for feature complete.
> 

Sounds good to me.
  
Slava Ovsiienko Oct. 15, 2020, 11:49 a.m. UTC | #15
> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@intel.com>
> Sent: Thursday, October 15, 2020 14:37
> To: Jerin Jacob <jerinjacobk@gmail.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
> <thomas@monjalon.net>; Stephen Hemminger
> <stephen@networkplumber.org>; Olivier Matz <olivier.matz@6wind.com>;
> Maxime Coquelin <maxime.coquelin@redhat.com>; David Marchand
> <david.marchand@redhat.com>; Andrew Rybchenko
> <arybchenko@solarflare.com>
> Subject: Re: [PATCH v6 1/6] ethdev: introduce Rx buffer split
> 
> On 10/15/2020 12:26 PM, Jerin Jacob wrote:
> 
> <...>
> 
> >>>>> If we see some of the features of such kind or other PMDs adopts
> >>>>> the split feature - we'll try to find the common root and consider
> >>>>> the way how
> >>> to report it.
> >>>>
> >>>> My only concern with that approach will be ABI break again if
> >>>> something needs to exposed over rte_eth_dev_info().
> >>
> >> Let's reserve the pointer to struct rte_eth_rxseg_limitations in the
> >> rte_eth_dev_info to avoid ABI break?
> >
> > Works for me. If we add an additional reserved field.
> >
> > Due to RC1 time constraint, I am OK to leave it as a reserved filed
> > and fill meat when it is required if other ethdev maintainers are OK.
> > I will be required for feature complete.
> >
> 
> Sounds good to me.

OK, let's introduce the pointer in the rte_eth_dev_info and 
define struct rte_eth_rxseg_limitations as experimental.
Will it be allowed to update this one later (after 20.11)? 
Is ABI break is allowed for the case?

With best regards, Slava
  
Thomas Monjalon Oct. 15, 2020, 12:49 p.m. UTC | #16
15/10/2020 13:49, Slava Ovsiienko:
> From: Ferruh Yigit <ferruh.yigit@intel.com>
> > On 10/15/2020 12:26 PM, Jerin Jacob wrote:
> > 
> > <...>
> > 
> > >>>>> If we see some of the features of such kind or other PMDs adopts
> > >>>>> the split feature - we'll try to find the common root and consider
> > >>>>> the way how
> > >>> to report it.
> > >>>>
> > >>>> My only concern with that approach will be ABI break again if
> > >>>> something needs to exposed over rte_eth_dev_info().
> > >>
> > >> Let's reserve the pointer to struct rte_eth_rxseg_limitations in the
> > >> rte_eth_dev_info to avoid ABI break?
> > >
> > > Works for me. If we add an additional reserved field.
> > >
> > > Due to RC1 time constraint, I am OK to leave it as a reserved filed
> > > and fill meat when it is required if other ethdev maintainers are OK.
> > > I will be required for feature complete.
> > >
> > 
> > Sounds good to me.

OK for me.

> OK, let's introduce the pointer in the rte_eth_dev_info and 
> define struct rte_eth_rxseg_limitations as experimental.
> Will it be allowed to update this one later (after 20.11)? 
> Is ABI break is allowed for the case?

If it is experimental, you can change it at anytime.

Ideally, we could try to have a first version of the limitations
during 20.11-rc2.
  
Andrew Rybchenko Oct. 15, 2020, 1:07 p.m. UTC | #17
On 10/15/20 3:49 PM, Thomas Monjalon wrote:
> 15/10/2020 13:49, Slava Ovsiienko:
>> From: Ferruh Yigit <ferruh.yigit@intel.com>
>>> On 10/15/2020 12:26 PM, Jerin Jacob wrote:
>>>
>>> <...>
>>>
>>>>>>>> If we see some of the features of such kind or other PMDs adopts
>>>>>>>> the split feature - we'll try to find the common root and consider
>>>>>>>> the way how
>>>>>> to report it.
>>>>>>>
>>>>>>> My only concern with that approach will be ABI break again if
>>>>>>> something needs to exposed over rte_eth_dev_info().
>>>>>
>>>>> Let's reserve the pointer to struct rte_eth_rxseg_limitations in the
>>>>> rte_eth_dev_info to avoid ABI break?
>>>>
>>>> Works for me. If we add an additional reserved field.
>>>>
>>>> Due to RC1 time constraint, I am OK to leave it as a reserved filed
>>>> and fill meat when it is required if other ethdev maintainers are OK.
>>>> I will be required for feature complete.
>>>>
>>>
>>> Sounds good to me.
> 
> OK for me.

OK as well, but I dislike the idea with pointer in dev_info.
It sounds like it breaks existing practice.
We should either reserve enough space or simply add
dedicated API call to report Rx seg capabilities.

> 
>> OK, let's introduce the pointer in the rte_eth_dev_info and 
>> define struct rte_eth_rxseg_limitations as experimental.
>> Will it be allowed to update this one later (after 20.11)? 
>> Is ABI break is allowed for the case?
> 
> If it is experimental, you can change it at anytime.
> 
> Ideally, we could try to have a first version of the limitations
> during 20.11-rc2.

Yes, please.
  
Slava Ovsiienko Oct. 15, 2020, 1:57 p.m. UTC | #18
> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Sent: Thursday, October 15, 2020 16:07
> To: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Ferruh Yigit
> <ferruh.yigit@intel.com>; Jerin Jacob <jerinjacobk@gmail.com>; Slava
> Ovsiienko <viacheslavo@nvidia.com>; Andrew Rybchenko
> <arybchenko@solarflare.com>
> Cc: dpdk-dev <dev@dpdk.org>; Stephen Hemminger
> <stephen@networkplumber.org>; Olivier Matz <olivier.matz@6wind.com>;
> Maxime Coquelin <maxime.coquelin@redhat.com>; David Marchand
> <david.marchand@redhat.com>
> Subject: Re: [dpdk-dev] [PATCH v6 1/6] ethdev: introduce Rx buffer split
> 
> On 10/15/20 3:49 PM, Thomas Monjalon wrote:
> > 15/10/2020 13:49, Slava Ovsiienko:
> >> From: Ferruh Yigit <ferruh.yigit@intel.com>
> >>> On 10/15/2020 12:26 PM, Jerin Jacob wrote:
> >>>
> >>> <...>
> >>>
> >>>>>>>> If we see some of the features of such kind or other PMDs
> >>>>>>>> adopts the split feature - we'll try to find the common root
> >>>>>>>> and consider the way how
> >>>>>> to report it.
> >>>>>>>
> >>>>>>> My only concern with that approach will be ABI break again if
> >>>>>>> something needs to exposed over rte_eth_dev_info().
> >>>>>
> >>>>> Let's reserve the pointer to struct rte_eth_rxseg_limitations in
> >>>>> the rte_eth_dev_info to avoid ABI break?
> >>>>
> >>>> Works for me. If we add an additional reserved field.
> >>>>
> >>>> Due to RC1 time constraint, I am OK to leave it as a reserved filed
> >>>> and fill meat when it is required if other ethdev maintainers are OK.
> >>>> I will be required for feature complete.
> >>>>
> >>>
> >>> Sounds good to me.
> >
> > OK for me.
> 
> OK as well, but I dislike the idea with pointer in dev_info.
> It sounds like it breaks existing practice.

Moreover, if we are going to have multiple features using Rx segmentation
we should provide multiple structures with limitations - at least, one  per feature.

> We should either reserve enough space or simply add dedicated API call to
> report Rx seg capabilities.
> 
It seems we are trying to embrace everything in very generic way 😊
Just curious - how did we managed to survive without limitations on Tx direction?
No one told us how many segments PMD supports on Tx, what is the limitations
for offsets and alignments, it seems there is no limits for tx segment size at all.
How could it happen? Tx limitations do not exist? Just no one cared about the Tx limitations?

As for Rx limitations - there are no reasonable ones for now. We'll invent
the way to report the limitations (and it seems to be unbalanced - we should provide
the same to Tx), the next step is to provide at least one PMD using that,
and in this way to make mlx5 PMD to report silly values - "I have no reasonable
limitations beyond meaningful buffer size under pool_buf_size/UINT16_MAX)".

IMO, If some HW does not support arbitrary split (suppose, it is not common case,
most of HW is very flexible about specifying Rx buffers) the BUFFER_SPLIT feature
should not be advertised at all, because it would be not very useful - application
is intended to work over specific protocol, it knows where it wants to set split point
(defined by packet format). Hence,  application is not so interested about offsets,
alignments, etc - it just checks whether PMD provides requested split points or not.

That's why just simple documenting was initially intended, there are just no
a lot of limitations expected, likewise Tx direction shows that.

Yes, generally speaking, there are no doubts it would be nice to report the limitations, but:
- not expected to have many (documenting can the few exceptions)
- no nice way is found how to report - pointer? API?
- complicated to present for various features (variable size array, multiple features)
- not known which limitations are actually needed, just some theoretical ones

So, we see the large white area, should we invent something not well-defined to cover one,
or let's wait for actual request to check limitations that can't be handled by documenting
and internal PMD checking/validation?

With best regards, Slava
  
Slava Ovsiienko Oct. 15, 2020, 2:39 p.m. UTC | #19
Hi, Andrew

> >> At least there are few simple limitations which are easy to
> >> express:
> >>  1. Maximum number of segments
> > We have scatter capability and we do not report the maximal number of
> > segments, it is on PMD own. We could add the field to the
> > rte_eth_dev_info, but not sure whether we have something special to report
> there even for mlx5 case.
> 
> There is always a limitation in programming and HW. Nothing is unlimited.
> Limits could be high, but still exist.
> Number of descriptors? Width of field in HW interface?
> Maximum length of the config message to HW?
> All above could limit it directly or indirectly.

None of above is applicable to mlx5 buffer split feature - it just adjusts the Rx buffer pointers
and segment sizes, no anything beyond generic limitation - the queue descriptor numbers
and mbuf buffer size. Suppose the most of HW by other vendors is capable to support
buffer split feature with similar generic limitations.

> 
> >>  2. Possibility to use the last segment many times if required
> >>     (I was suggesting to use scatter for it, but you rejected
> >>      the idea - may be time to reconsider :) )
> >
> > Mmm, sorry I do not follow, it might be I did not understand/missed your
> idea.
> > Some of the last segment attributes are used multiple times to scatter
> > the rest of the data in fashion very close to the existing scattering
> > approach - at least, pool and buffer size from this pool are used. The
> > beginning of the packet scattered according to the new descriptions,
> > the rest of the packet - according to the existing regular scattering
> > with pool settings from the last segment description.
> 
> I believe that the possibility to split into a fixed segments
> (BUFFER_SPLIT) and possibility to use a mempool (just mp or last segment)
> many times if a packet does not fit (SCATTER) it is *different* features.

Sorry, what do you mean "use mempool many times"? Allocate multiple
mbufs from the same mempool and build the chain of them? 

We have SCATTER offload and many PMDs advertise that. 
Scattering is actually the split, the split happens on some well-defined points
to the mbufs from the same pool. BUFFER_SPLIT just extends SCATTER
capabilities by providing the split point arbitrary settings and multiple
pools.

> I can easily imagine HW which could do BUFFER_SPLIT to fixed segments, but
> cannot use the last segment many times (i.e. no classical SCATTER).

Sorry, what do you mean "BUFFER_SPLIT to fixed segments" ?
This new offload BUFFER_SPLIT  is intended to push data to flexible segments,
potentially allocated from the different pools. The HW can be constrained
with pool number (say it supports some pool alloc/free hardware accelerator
for single pool only), in this case it will not be able to support BUFFER_SPLIT
in multiple pool config, but using the single pool does not arise the problem.

It seems I missed something, could you, please, provide an example,
how would you like to see the usage last segment many times for BUFFER_SPLIT?
How the packet should be split, in mbufs with what (last segment inherited) attributes?

> 
> >
> >  3. Maximum offset
> >>     Frankly speaking I'm not sure why it cannot be handled on
> >>     PMD level (i.e. provide descriptors with offset taken into
> >>     account or guarantee that HW mempool objects initialized
> >>     correctly with required headroom). May be in some corner
> >>     cases when the same HW mempool is shared by various
> >>     segments with different offset requirements.
> >
> > HW offsets are beyond the feature scope, the offsets in the segment
> > description is supposed to be added to the native pool offsets (if any).
> 
> Are you saying that offsets are not passed to HW and just handled by PMD to
> provide correct IOVA addresses to put data to? If so, it is an implementation
> detail which is specific to mlx5. If so, no specific limitations except data room,
> size and offset consistency.
> But it could be passed to a HW and it could be, for example, just 8 bits for the
> value.

Yes, it could. But there should be other vendors be involved, not known for now
who is going to support BUFFER_SPLIT and in which way. We should not invent
some theoretical limitations and merge the dead code. And, please note -
Tx segmentation has been living for 10 years successfully without any limitations,
no one cares about, there is no any request to report. Likewise is expected for Rx.

> 
> >
> >>  4. Offset alignment
> >>  5. Maximum/minimum length of a segment  6. Length alignment
> > In which form? Mask of lsbs ? 0 means no limitations ?
> 
> log2, i.e. 0 => 1 (no limitations) 1 => 2 (even only),
> 6 => 64 (64-byte cache line aligned) etc.
> 

Yes, possible option.
> >
> >>
> >> I realize that 3, 4 and 5 could be per segment number.
> >> If it is really that complex, report common denominator which is
> >> guaranteed to work. If we have no checks on ethdev layer, application
> >> can ignore it if it knows better
> >
> > Currently it is not clear at all what kind of limitations should be
> > reported, we could include all of mentioned/proposed ones, and no one
> > will report there -
> > mlx5 has no any reasonable limitations to report for now.
> >
> > Should we reserve some pointer field in the rte_eth_dev_info to report
> > the limitations? (Limitation description should contain variable size
> > array, depending on the number of segments, so pointer seems to be
> appropriate).
> > It would allow us to avoid ABI break, and present the limitation structure
> once it is defined.
> 
> I will let other ethdev maintainers to make a decision here.
> My vote would be to report limitations mentioned above.
> It looks like Jerin is also interested in limitations reporting. Not sure if my form
> looks OK or no.

For now I tend to think we could reserve some pointer for BUFFER_SPLIT limitations and that's it.
Reporting some silly generic limitations from mlx5 means introducing the dead code in my opinion.
If we'll see the actual request from applications to check and handle limitations (actually applications
are very limited in this matter - they expect the split point to be set at very strong defined place
of the packet format).
  
Slava Ovsiienko Oct. 15, 2020, 8:22 p.m. UTC | #20
Hi, 

Evening update:
- addressed code comments
- provided the union of segmentation description  with dedicated feature structures according Jerin's proposal
- added the reporting of split limitation

With best regards, Slava

> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Sent: Thursday, October 15, 2020 16:07
> To: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Ferruh Yigit
> <ferruh.yigit@intel.com>; Jerin Jacob <jerinjacobk@gmail.com>; Slava
> Ovsiienko <viacheslavo@nvidia.com>; Andrew Rybchenko
> <arybchenko@solarflare.com>
> Cc: dpdk-dev <dev@dpdk.org>; Stephen Hemminger
> <stephen@networkplumber.org>; Olivier Matz <olivier.matz@6wind.com>;
> Maxime Coquelin <maxime.coquelin@redhat.com>; David Marchand
> <david.marchand@redhat.com>
> Subject: Re: [dpdk-dev] [PATCH v6 1/6] ethdev: introduce Rx buffer split
> 
> On 10/15/20 3:49 PM, Thomas Monjalon wrote:
> > 15/10/2020 13:49, Slava Ovsiienko:
> >> From: Ferruh Yigit <ferruh.yigit@intel.com>
> >>> On 10/15/2020 12:26 PM, Jerin Jacob wrote:
> >>>
> >>> <...>
> >>>
> >>>>>>>> If we see some of the features of such kind or other PMDs
> >>>>>>>> adopts the split feature - we'll try to find the common root
> >>>>>>>> and consider the way how
> >>>>>> to report it.
> >>>>>>>
> >>>>>>> My only concern with that approach will be ABI break again if
> >>>>>>> something needs to exposed over rte_eth_dev_info().
> >>>>>
> >>>>> Let's reserve the pointer to struct rte_eth_rxseg_limitations in
> >>>>> the rte_eth_dev_info to avoid ABI break?
> >>>>
> >>>> Works for me. If we add an additional reserved field.
> >>>>
> >>>> Due to RC1 time constraint, I am OK to leave it as a reserved filed
> >>>> and fill meat when it is required if other ethdev maintainers are OK.
> >>>> I will be required for feature complete.
> >>>>
> >>>
> >>> Sounds good to me.
> >
> > OK for me.
> 
> OK as well, but I dislike the idea with pointer in dev_info.
> It sounds like it breaks existing practice.
> We should either reserve enough space or simply add dedicated API call to
> report Rx seg capabilities.
> 
> >
> >> OK, let's introduce the pointer in the rte_eth_dev_info and define
> >> struct rte_eth_rxseg_limitations as experimental.
> >> Will it be allowed to update this one later (after 20.11)?
> >> Is ABI break is allowed for the case?
> >
> > If it is experimental, you can change it at anytime.
> >
> > Ideally, we could try to have a first version of the limitations
> > during 20.11-rc2.
> 
> Yes, please.
  

Patch

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index dd8c955..832ea3b 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -185,6 +185,21 @@  Supports receiving segmented mbufs.
 * **[related]    eth_dev_ops**: ``rx_pkt_burst``.
 
 
+.. _nic_features_buffer_split:
+
+Buffer Split on Rx
+------------------
+
+Scatters the packets being received on specified boundaries to segmented mbufs.
+
+* **[uses]       rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
+* **[uses]       rte_eth_rxconf**: ``rx_conf.rx_seg, rx_conf.rx_nseg``.
+* **[implements] datapath**: ``Buffer Split functionality``.
+* **[provides]   rte_eth_dev_info**: ``rx_offload_capa:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
+* **[provides]   eth_dev_ops**: ``rxq_info_get:buffer_split``.
+* **[related] API**: ``rte_eth_rx_queue_setup()``.
+
+
 .. _nic_features_lro:
 
 LRO
diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 584e720..232cd54 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -138,11 +138,6 @@  Deprecation Notices
   In 19.11 PMDs will still update the field even when the offload is not
   enabled.
 
-* ethdev: Add new fields to ``rte_eth_rxconf`` to configure the receiving
-  queues to split ingress packets into multiple segments according to the
-  specified lengths into the buffers allocated from the specified
-  memory pools. The backward compatibility to existing API is preserved.
-
 * ethdev: ``rx_descriptor_done`` dev_ops and ``rte_eth_rx_descriptor_done``
   will be removed in 21.11.
   Existing ``rte_eth_rx_descriptor_status`` and ``rte_eth_tx_descriptor_status``
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index bcc0fc2..bcc2479 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -60,6 +60,12 @@  New Features
   Added the FEC API which provides functions for query FEC capabilities and
   current FEC mode from device. Also, API for configuring FEC mode is also provided.
 
+* **Introduced extended buffer description for receiving.**
+
+  Added the extended Rx buffer description for Rx queue setup routine
+  providing the individual settings for each Rx segment with maximal size,
+  buffer offset and memory pool to allocate data buffers from.
+
 * **Updated Broadcom bnxt driver.**
 
   Updated the Broadcom bnxt driver with new features and improvements, including:
@@ -253,6 +259,9 @@  API Changes
   As the data of ``uint8_t`` will be truncated when queue number under
   a TC is greater than 256.
 
+* ethdev: Added fields rx_seg and rx_nseg to rte_eth_rxconf structure
+  to provide extended description of the receiving buffer.
+
 * vhost: Moved vDPA APIs from experimental to stable.
 
 * rawdev: Added a structure size parameter to the functions
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 892c246..96ecb91 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -105,6 +105,9 @@  struct rte_eth_xstats_name_off {
 #define RTE_RX_OFFLOAD_BIT2STR(_name)	\
 	{ DEV_RX_OFFLOAD_##_name, #_name }
 
+#define RTE_ETH_RX_OFFLOAD_BIT2STR(_name)	\
+	{ RTE_ETH_RX_OFFLOAD_##_name, #_name }
+
 static const struct {
 	uint64_t offload;
 	const char *name;
@@ -128,9 +131,11 @@  struct rte_eth_xstats_name_off {
 	RTE_RX_OFFLOAD_BIT2STR(SCTP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
+#undef RTE_ETH_RX_OFFLOAD_BIT2STR
 
 #define RTE_TX_OFFLOAD_BIT2STR(_name)	\
 	{ DEV_TX_OFFLOAD_##_name, #_name }
@@ -1784,38 +1789,94 @@  struct rte_eth_dev *
 		return -EINVAL;
 	}
 
-	if (mp == NULL) {
-		RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
-		return -EINVAL;
-	}
-
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_setup, -ENOTSUP);
 
-	/*
-	 * Check the size of the mbuf data buffer.
-	 * This value must be provided in the private data of the memory pool.
-	 * First check that the memory pool has a valid private data.
-	 */
 	ret = rte_eth_dev_info_get(port_id, &dev_info);
 	if (ret != 0)
 		return ret;
 
-	if (mp->private_data_size < sizeof(struct rte_pktmbuf_pool_private)) {
-		RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
-			mp->name, (int)mp->private_data_size,
-			(int)sizeof(struct rte_pktmbuf_pool_private));
-		return -ENOSPC;
-	}
-	mbp_buf_size = rte_pktmbuf_data_room_size(mp);
+	if (mp) {
+		/* Single pool configuration check. */
+		if (rx_conf->rx_seg || rx_conf->rx_nseg) {
+			RTE_ETHDEV_LOG(ERR,
+				       "Ambiguous segment configuration\n");
+			return -EINVAL;
+		}
+		/*
+		 * Check the size of the mbuf data buffer, this value
+		 * must be provided in the private data of the memory pool.
+		 * First check that the memory pool(s) has a valid private data.
+		 */
+		if (mp->private_data_size <
+				sizeof(struct rte_pktmbuf_pool_private)) {
+			RTE_ETHDEV_LOG(ERR, "%s private_data_size %u < %u\n",
+				mp->name, mp->private_data_size,
+				(unsigned int)
+				sizeof(struct rte_pktmbuf_pool_private));
+			return -ENOSPC;
+		}
+		mbp_buf_size = rte_pktmbuf_data_room_size(mp);
+		if (mbp_buf_size < dev_info.min_rx_bufsize +
+				   RTE_PKTMBUF_HEADROOM) {
+			RTE_ETHDEV_LOG(ERR,
+				       "%s mbuf_data_room_size %u < %u"
+				       " (RTE_PKTMBUF_HEADROOM=%u +"
+				       " min_rx_bufsize(dev)=%u)\n",
+				       mp->name, mbp_buf_size,
+				       RTE_PKTMBUF_HEADROOM +
+				       dev_info.min_rx_bufsize,
+				       RTE_PKTMBUF_HEADROOM,
+				       dev_info.min_rx_bufsize);
+			return -EINVAL;
+		}
+	} else {
+		const struct rte_eth_rxseg *rx_seg = rx_conf->rx_seg;
+		uint16_t n_seg = rx_conf->rx_nseg;
+		uint16_t seg_idx;
 
-	if (mbp_buf_size < dev_info.min_rx_bufsize + RTE_PKTMBUF_HEADROOM) {
-		RTE_ETHDEV_LOG(ERR,
-			"%s mbuf_data_room_size %d < %d (RTE_PKTMBUF_HEADROOM=%d + min_rx_bufsize(dev)=%d)\n",
-			mp->name, (int)mbp_buf_size,
-			(int)(RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize),
-			(int)RTE_PKTMBUF_HEADROOM,
-			(int)dev_info.min_rx_bufsize);
-		return -EINVAL;
+		/* Extended multi-segment configuration check. */
+		if (!rx_conf->rx_seg || !rx_conf->rx_nseg) {
+			RTE_ETHDEV_LOG(ERR,
+				       "Memory pool is null and no"
+				       " extended configuration provided\n");
+			return -EINVAL;
+		}
+		/*
+		 * Check the sizes and offsets against buffer sizes
+		 * for each segment specified in extended configuration.
+		 */
+		for (seg_idx = 0; seg_idx < n_seg; seg_idx++) {
+			struct rte_mempool *mpl = rx_seg[seg_idx].mp;
+			uint32_t length = rx_seg[seg_idx].length;
+			uint32_t offset = rx_seg[seg_idx].offset;
+			uint32_t head_room = seg_idx ? 0 : RTE_PKTMBUF_HEADROOM;
+
+			if (mpl == NULL) {
+				RTE_ETHDEV_LOG(ERR, "null mempool pointer\n");
+				return -EINVAL;
+			}
+			if (mpl->private_data_size <
+				sizeof(struct rte_pktmbuf_pool_private)) {
+				RTE_ETHDEV_LOG(ERR,
+					       "%s private_data_size %u < %u\n",
+					       mpl->name,
+					       mpl->private_data_size,
+					       (unsigned int)sizeof(struct
+					       rte_pktmbuf_pool_private));
+				return -ENOSPC;
+			}
+			mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
+			length = length ? length : (mbp_buf_size - head_room);
+			if (mbp_buf_size < length + offset + head_room) {
+				RTE_ETHDEV_LOG(ERR,
+					"%s mbuf_data_room_size %u < %u"
+					" (segment length=%u +"
+					" segment offset=%u)\n",
+					mpl->name, mbp_buf_size,
+					length + offset, length, offset);
+				return -EINVAL;
+			}
+		}
 	}
 
 	/* Use default specified by driver, if nb_rx_desc is zero */
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 5bcfbb8..e019f4a 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -970,6 +970,16 @@  struct rte_eth_txmode {
 };
 
 /**
+ * A structure used to configure an RX packet segment to split.
+ */
+struct rte_eth_rxseg {
+	struct rte_mempool *mp; /**< Memory pool to allocate segment from. */
+	uint16_t length; /**< Segment data length, configures split point. */
+	uint16_t offset; /**< Data offset from beginning of mbuf data buffer. */
+	uint32_t reserved; /**< Reserved field. */
+};
+
+/**
  * A structure used to configure an RX ring of an Ethernet port.
  */
 struct rte_eth_rxconf {
@@ -977,6 +987,43 @@  struct rte_eth_rxconf {
 	uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
+	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	/**
+	 * Points to the array of segment descriptions. Each array element
+	 * describes the properties for each segment in the receiving
+	 * buffer.
+	 *
+	 * If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag is set in offloads field,
+	 * the PMD will split the received packets into multiple segments
+	 * according to the specification in the description array:
+	 *
+	 * - the first network buffer will be allocated from the memory pool,
+	 *   specified in the first array element, the second buffer, from the
+	 *   pool in the second element, and so on.
+	 *
+	 * - the offsets from the segment description elements specify
+	 *   the data offset from the buffer beginning except the first mbuf.
+	 *   For this one the offset is added with RTE_PKTMBUF_HEADROOM.
+	 *
+	 * - the lengthes in the elements define the maximal data amount
+	 *   being received to each segment. The receiving starts with filling
+	 *   up the first mbuf data buffer up to specified length. If the
+	 *   there are data remaining (packet is longer than buffer in the first
+	 *   mbuf) the following data will be pushed to the next segment
+	 *   up to its own length, and so on.
+	 *
+	 * - If the length in the segment description element is zero
+	 *   the actual buffer size will be deduced from the appropriate
+	 *   memory pool properties.
+	 *
+	 * - if there is not enough elements to describe the buffer for entire
+	 *   packet of maximal length the following parameters will be used
+	 *   for the all remaining segments:
+	 *     - pool from the last valid element
+	 *     - the buffer size from this pool
+	 *     - zero offset
+	 */
+	struct rte_eth_rxseg *rx_seg;
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1260,6 +1307,7 @@  struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_SCTP_CKSUM	0x00020000
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
+#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \
@@ -2027,9 +2075,21 @@  int rte_eth_dev_configure(uint16_t port_id, uint16_t nb_rx_queue,
  *   No need to repeat any bit in rx_conf->offloads which has already been
  *   enabled in rte_eth_dev_configure() at port level. An offloading enabled
  *   at port level can't be disabled at queue level.
+ *   The configuration structure also contains the pointer to the array
+ *   of the receiving buffer segment descriptions, see rx_seg and rx_nseg
+ *   fields, this extended configuration might be used by split offloads like
+ *   RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT. If mp_pool is not NULL,
+ *   the extended configuration fields must be set to NULL and zero.
  * @param mb_pool
  *   The pointer to the memory pool from which to allocate *rte_mbuf* network
- *   memory buffers to populate each descriptor of the receive ring.
+ *   memory buffers to populate each descriptor of the receive ring. There are
+ *   two options to provide Rx buffer configuration:
+ *   - single pool:
+ *     mb_pool is not NULL, rx_conf.rx_seg is NULL, rx_conf.rx_nseg is 0.
+ *   - multiple segments description:
+ *     mb_pool is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not 0.
+ *     Taken only if flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is set in offloads.
+ *
  * @return
  *   - 0: Success, receive queue correctly set up.
  *   - -EIO: if device is removed.