[v8,1/3] ethdev: introduce protocol hdr based buffer split

Message ID 20220601135059.958882-2-wenxuanx.wu@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Andrew Rybchenko
Headers
Series ethdev: introduce protocol type based header split |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Wu, WenxuanX June 1, 2022, 1:50 p.m. UTC
  From: Wenxuan Wu <wenxuanx.wu@intel.com>

Currently, Rx buffer split supports length based split. With Rx queue
offload RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT enabled and Rx packet segment
configured, PMD will be able to split the received packets into
multiple segments.

However, length based buffer split is not suitable for NICs that do split
based on protocol headers. Given a arbitrarily variable length in Rx packet
segment, it is almost impossible to pass a fixed protocol header to PMD.
Besides, the existence of tunneling results in the composition of a packet
is various, which makes the situation even worse.

This patch extends current buffer split to support protocol header based
buffer split. A new proto_hdr field is introduced in the reserved field
of rte_eth_rxseg_split structure to specify protocol header. The proto_hdr
field defines the split position of packet, splitting will always happens
after the protocol header defined in the Rx packet segment. When Rx queue
offload RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is enabled and corresponding
protocol header is configured, PMD will split the ingress packets into
multiple segments.

struct rte_eth_rxseg_split {

        struct rte_mempool *mp; /* memory pools to allocate segment from */
        uint16_t length; /* segment maximal data length,
                            configures "split point" */
        uint16_t offset; /* data offset from beginning
                            of mbuf data buffer */
        uint32_t proto_hdr; /* inner/outer L2/L3/L4 protocol header,
			       configures "split point" */
    };

Both inner and outer L2/L3/L4 level protocol header split can be supported.
Corresponding protocol header capability is RTE_PTYPE_L2_ETHER,
RTE_PTYPE_L3_IPV4, RTE_PTYPE_L3_IPV6, RTE_PTYPE_L4_TCP, RTE_PTYPE_L4_UDP,
RTE_PTYPE_L4_SCTP, RTE_PTYPE_INNER_L2_ETHER, RTE_PTYPE_INNER_L3_IPV4,
RTE_PTYPE_INNER_L3_IPV6, RTE_PTYPE_INNER_L4_TCP, RTE_PTYPE_INNER_L4_UDP,
RTE_PTYPE_INNER_L4_SCTP.

For example, let's suppose we configured the Rx queue with the
following segments:
    seg0 - pool0, proto_hdr0=RTE_PTYPE_L3_IPV4, off0=2B
    seg1 - pool1, proto_hdr1=RTE_PTYPE_L4_UDP, off1=128B
    seg2 - pool2, off1=0B

The packet consists of MAC_IPV4_UDP_PAYLOAD will be split like
following:
    seg0 - ipv4 header @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
    seg1 - udp header @ 128 in mbuf from pool1
    seg2 - payload @ 0 in mbuf from pool2

Now buffet split can be configured in two modes. For length based
buffer split, the mp, length, offset field in Rx packet segment should
be configured, while the proto_hdr field should not be configured.
For protocol header based buffer split, the mp, offset, proto_hdr field
in Rx packet segment should be configured, while the length field should
not be configured.

The split limitations imposed by underlying PMD is reported in the
rte_eth_dev_info->rx_seg_capa field. The memory attributes for the split
parts may differ either, dpdk memory and external memory, respectively.

Signed-off-by: Xuan Ding <xuan.ding@intel.com>
Signed-off-by: Yuan Wang <yuanx.wang@intel.com>
Signed-off-by: Wenxuan Wu <wenxuanx.wu@intel.com>
Reviewed-by: Qi Zhang <qi.z.zhang@intel.com>
Acked-by: Ray Kinsella <mdr@ashroe.eu>
---
 lib/ethdev/rte_ethdev.c | 40 +++++++++++++++++++++++++++++++++-------
 lib/ethdev/rte_ethdev.h | 28 +++++++++++++++++++++++++++-
 2 files changed, 60 insertions(+), 8 deletions(-)
  

Comments

Andrew Rybchenko June 2, 2022, 1:20 p.m. UTC | #1
Is it the right one since it is listed in patchwork?

On 6/1/22 16:50, wenxuanx.wu@intel.com wrote:
> From: Wenxuan Wu <wenxuanx.wu@intel.com>
> 
> Currently, Rx buffer split supports length based split. With Rx queue
> offload RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT enabled and Rx packet segment
> configured, PMD will be able to split the received packets into
> multiple segments.
> 
> However, length based buffer split is not suitable for NICs that do split
> based on protocol headers. Given a arbitrarily variable length in Rx packet

a -> an

> segment, it is almost impossible to pass a fixed protocol header to PMD.
> Besides, the existence of tunneling results in the composition of a packet
> is various, which makes the situation even worse.
> 
> This patch extends current buffer split to support protocol header based
> buffer split. A new proto_hdr field is introduced in the reserved field
> of rte_eth_rxseg_split structure to specify protocol header. The proto_hdr
> field defines the split position of packet, splitting will always happens
> after the protocol header defined in the Rx packet segment. When Rx queue
> offload RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is enabled and corresponding
> protocol header is configured, PMD will split the ingress packets into
> multiple segments.
> 
> struct rte_eth_rxseg_split {
> 
>          struct rte_mempool *mp; /* memory pools to allocate segment from */
>          uint16_t length; /* segment maximal data length,
>                              configures "split point" */
>          uint16_t offset; /* data offset from beginning
>                              of mbuf data buffer */
>          uint32_t proto_hdr; /* inner/outer L2/L3/L4 protocol header,
> 			       configures "split point" */
>      };
> 
> Both inner and outer L2/L3/L4 level protocol header split can be supported.
> Corresponding protocol header capability is RTE_PTYPE_L2_ETHER,
> RTE_PTYPE_L3_IPV4, RTE_PTYPE_L3_IPV6, RTE_PTYPE_L4_TCP, RTE_PTYPE_L4_UDP,
> RTE_PTYPE_L4_SCTP, RTE_PTYPE_INNER_L2_ETHER, RTE_PTYPE_INNER_L3_IPV4,
> RTE_PTYPE_INNER_L3_IPV6, RTE_PTYPE_INNER_L4_TCP, RTE_PTYPE_INNER_L4_UDP,
> RTE_PTYPE_INNER_L4_SCTP.
> 
> For example, let's suppose we configured the Rx queue with the
> following segments:
>      seg0 - pool0, proto_hdr0=RTE_PTYPE_L3_IPV4, off0=2B
>      seg1 - pool1, proto_hdr1=RTE_PTYPE_L4_UDP, off1=128B
>      seg2 - pool2, off1=0B
> 
> The packet consists of MAC_IPV4_UDP_PAYLOAD will be split like
> following:
>      seg0 - ipv4 header @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>      seg1 - udp header @ 128 in mbuf from pool1
>      seg2 - payload @ 0 in mbuf from pool2

It must be defined how ICMPv4 packets will be split in such case.
And how UDP over IPv6 will be split.
> 
> Now buffet split can be configured in two modes. For length based
> buffer split, the mp, length, offset field in Rx packet segment should
> be configured, while the proto_hdr field should not be configured.
> For protocol header based buffer split, the mp, offset, proto_hdr field
> in Rx packet segment should be configured, while the length field should
> not be configured.
> 
> The split limitations imposed by underlying PMD is reported in the
> rte_eth_dev_info->rx_seg_capa field. The memory attributes for the split
> parts may differ either, dpdk memory and external memory, respectively.
> 
> Signed-off-by: Xuan Ding <xuan.ding@intel.com>
> Signed-off-by: Yuan Wang <yuanx.wang@intel.com>
> Signed-off-by: Wenxuan Wu <wenxuanx.wu@intel.com>
> Reviewed-by: Qi Zhang <qi.z.zhang@intel.com>
> Acked-by: Ray Kinsella <mdr@ashroe.eu>
> ---
>   lib/ethdev/rte_ethdev.c | 40 +++++++++++++++++++++++++++++++++-------
>   lib/ethdev/rte_ethdev.h | 28 +++++++++++++++++++++++++++-
>   2 files changed, 60 insertions(+), 8 deletions(-)
> 
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 29a3d80466..fbd55cdd9d 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -1661,6 +1661,7 @@ rte_eth_rx_queue_check_split(const struct rte_eth_rxseg_split *rx_seg,
>   		struct rte_mempool *mpl = rx_seg[seg_idx].mp;
>   		uint32_t length = rx_seg[seg_idx].length;
>   		uint32_t offset = rx_seg[seg_idx].offset;
> +		uint32_t proto_hdr = rx_seg[seg_idx].proto_hdr;
>   
>   		if (mpl == NULL) {
>   			RTE_ETHDEV_LOG(ERR, "null mempool pointer\n");
> @@ -1694,13 +1695,38 @@ rte_eth_rx_queue_check_split(const struct rte_eth_rxseg_split *rx_seg,
>   		}
>   		offset += seg_idx != 0 ? 0 : RTE_PKTMBUF_HEADROOM;
>   		*mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
> -		length = length != 0 ? length : *mbp_buf_size;
> -		if (*mbp_buf_size < length + offset) {
> -			RTE_ETHDEV_LOG(ERR,
> -				       "%s mbuf_data_room_size %u < %u (segment length=%u + segment offset=%u)\n",
> -				       mpl->name, *mbp_buf_size,
> -				       length + offset, length, offset);
> -			return -EINVAL;
> +		if (proto_hdr == RTE_PTYPE_UNKNOWN) {
> +			/* Split at fixed length. */
> +			length = length != 0 ? length : *mbp_buf_size;
> +			if (*mbp_buf_size < length + offset) {
> +				RTE_ETHDEV_LOG(ERR,
> +					"%s mbuf_data_room_size %u < %u (segment length=%u + segment offset=%u)\n",
> +					mpl->name, *mbp_buf_size,
> +					length + offset, length, offset);
> +				return -EINVAL;
> +			}
> +		} else {
> +			/* Split after specified protocol header. */
> +			if (!(proto_hdr & RTE_BUFFER_SPLIT_PROTO_HDR_MASK)) {

The condition looks suspicious. It will be true if proto_hdr has no
single bit from the mask. I guess it is not the intent.
I guess the condition should be
   proto_hdr & ~RTE_BUFFER_SPLIT_PROTO_HDR_MASK
i.e. there is unsupported bits in proto_hdr

IMHO we need extra field in dev_info to report supported protocols to
split on. Or a new API to get an array similar to ptype get.
May be a new API is a better choice to not overload dev_info and to
be more flexible in reporting.

> +				RTE_ETHDEV_LOG(ERR,
> +					"Protocol header %u not supported)\n",
> +					proto_hdr);

I think it would be useful to log unsupported bits only, if we say so.

> +				return -EINVAL;
> +			}
> +
> +			if (length != 0) {
> +				RTE_ETHDEV_LOG(ERR, "segment length should be set to zero in protocol header "
> +					       "based buffer split\n");
> +				return -EINVAL;
> +			}
> +
> +			if (*mbp_buf_size < offset) {
> +				RTE_ETHDEV_LOG(ERR,
> +						"%s mbuf_data_room_size %u < %u segment offset)\n",
> +						mpl->name, *mbp_buf_size,
> +						offset);
> +				return -EINVAL;
> +			}
>   		}
>   	}
>   	return 0;
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 04cff8ee10..0cd9dd6cc0 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1187,6 +1187,9 @@ struct rte_eth_txmode {
>    *   mbuf) the following data will be pushed to the next segment
>    *   up to its own length, and so on.
>    *
> + * - The proto_hdrs in the elements define the split position of
> + *   received packets.
> + *
>    * - If the length in the segment description element is zero
>    *   the actual buffer size will be deduced from the appropriate
>    *   memory pool properties.
> @@ -1197,14 +1200,37 @@ struct rte_eth_txmode {
>    *     - pool from the last valid element
>    *     - the buffer size from this pool
>    *     - zero offset
> + *
> + * - Length based buffer split:
> + *     - mp, length, offset should be configured.
> + *     - The proto_hdr field should not be configured.
> + *
> + * - Protocol header based buffer split:
> + *     - mp, offset, proto_hdr should be configured.
> + *     - The length field should not be configured.
>    */
>   struct rte_eth_rxseg_split {
>   	struct rte_mempool *mp; /**< Memory pool to allocate segment from. */
>   	uint16_t length; /**< Segment data length, configures split point. */
>   	uint16_t offset; /**< Data offset from beginning of mbuf data buffer. */
> -	uint32_t reserved; /**< Reserved field. */
> +	uint32_t proto_hdr; /**< Inner/outer L2/L3/L4 protocol header, configures split point. */
>   };
>   
> +/* Buffer split protocol header capability. */
> +#define RTE_BUFFER_SPLIT_PROTO_HDR_MASK ( \
> +	RTE_PTYPE_L2_ETHER | \
> +	RTE_PTYPE_L3_IPV4 | \
> +	RTE_PTYPE_L3_IPV6 | \
> +	RTE_PTYPE_L4_TCP | \
> +	RTE_PTYPE_L4_UDP | \
> +	RTE_PTYPE_L4_SCTP | \
> +	RTE_PTYPE_INNER_L2_ETHER | \
> +	RTE_PTYPE_INNER_L3_IPV4 | \
> +	RTE_PTYPE_INNER_L3_IPV6 | \
> +	RTE_PTYPE_INNER_L4_TCP | \
> +	RTE_PTYPE_INNER_L4_UDP | \
> +	RTE_PTYPE_INNER_L4_SCTP)
> +
>   /**
>    * @warning
>    * @b EXPERIMENTAL: this structure may change without prior notice.
  
Ding, Xuan June 3, 2022, 4:30 p.m. UTC | #2
Hi Andrew,

> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Sent: Thursday, June 2, 2022 9:21 PM
> To: Wu, WenxuanX <wenxuanx.wu@intel.com>; thomas@monjalon.net; Li,
> Xiaoyun <xiaoyun.li@intel.com>; ferruh.yigit@xilinx.com; Singh, Aman Deep
> <aman.deep.singh@intel.com>; dev@dpdk.org; Zhang, Yuying
> <yuying.zhang@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> jerinjacobk@gmail.com
> Cc: stephen@networkplumber.org; Ding, Xuan <xuan.ding@intel.com>;
> Wang, YuanX <yuanx.wang@intel.com>; Ray Kinsella <mdr@ashroe.eu>
> Subject: Re: [PATCH v8 1/3] ethdev: introduce protocol hdr based buffer split
> 
> Is it the right one since it is listed in patchwork?

Yes, it is.

> 
> On 6/1/22 16:50, wenxuanx.wu@intel.com wrote:
> > From: Wenxuan Wu <wenxuanx.wu@intel.com>
> >
> > Currently, Rx buffer split supports length based split. With Rx queue
> > offload RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT enabled and Rx packet
> segment
> > configured, PMD will be able to split the received packets into
> > multiple segments.
> >
> > However, length based buffer split is not suitable for NICs that do
> > split based on protocol headers. Given a arbitrarily variable length
> > in Rx packet
> 
> a -> an

Thanks for your catch, will fix it in next version.

> 
> > segment, it is almost impossible to pass a fixed protocol header to PMD.
> > Besides, the existence of tunneling results in the composition of a
> > packet is various, which makes the situation even worse.
> >
> > This patch extends current buffer split to support protocol header
> > based buffer split. A new proto_hdr field is introduced in the
> > reserved field of rte_eth_rxseg_split structure to specify protocol
> > header. The proto_hdr field defines the split position of packet,
> > splitting will always happens after the protocol header defined in the
> > Rx packet segment. When Rx queue offload
> > RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is enabled and corresponding
> protocol
> > header is configured, PMD will split the ingress packets into multiple
> segments.
> >
> > struct rte_eth_rxseg_split {
> >
> >          struct rte_mempool *mp; /* memory pools to allocate segment from
> */
> >          uint16_t length; /* segment maximal data length,
> >                              configures "split point" */
> >          uint16_t offset; /* data offset from beginning
> >                              of mbuf data buffer */
> >          uint32_t proto_hdr; /* inner/outer L2/L3/L4 protocol header,
> > 			       configures "split point" */
> >      };
> >
> > Both inner and outer L2/L3/L4 level protocol header split can be supported.
> > Corresponding protocol header capability is RTE_PTYPE_L2_ETHER,
> > RTE_PTYPE_L3_IPV4, RTE_PTYPE_L3_IPV6, RTE_PTYPE_L4_TCP,
> > RTE_PTYPE_L4_UDP, RTE_PTYPE_L4_SCTP, RTE_PTYPE_INNER_L2_ETHER,
> > RTE_PTYPE_INNER_L3_IPV4, RTE_PTYPE_INNER_L3_IPV6,
> > RTE_PTYPE_INNER_L4_TCP, RTE_PTYPE_INNER_L4_UDP,
> RTE_PTYPE_INNER_L4_SCTP.
> >
> > For example, let's suppose we configured the Rx queue with the
> > following segments:
> >      seg0 - pool0, proto_hdr0=RTE_PTYPE_L3_IPV4, off0=2B
> >      seg1 - pool1, proto_hdr1=RTE_PTYPE_L4_UDP, off1=128B
> >      seg2 - pool2, off1=0B
> >
> > The packet consists of MAC_IPV4_UDP_PAYLOAD will be split like
> > following:
> >      seg0 - ipv4 header @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from
> pool0
> >      seg1 - udp header @ 128 in mbuf from pool1
> >      seg2 - payload @ 0 in mbuf from pool2
> 
> It must be defined how ICMPv4 packets will be split in such case.
> And how UDP over IPv6 will be split.

The ICMP header type is missed, I will define the expected split behavior and
add it in next version, thanks for your catch.

In fact, the buffer split based on protocol header depends on the driver parsing result.
As long as driver can recognize this packet type, I think there is no difference between
UDP over IPV4 and UDP over IPV6?

> >
> > Now buffet split can be configured in two modes. For length based
> > buffer split, the mp, length, offset field in Rx packet segment should
> > be configured, while the proto_hdr field should not be configured.
> > For protocol header based buffer split, the mp, offset, proto_hdr
> > field in Rx packet segment should be configured, while the length
> > field should not be configured.
> >
> > The split limitations imposed by underlying PMD is reported in the
> > rte_eth_dev_info->rx_seg_capa field. The memory attributes for the
> > split parts may differ either, dpdk memory and external memory,
> respectively.
> >
> > Signed-off-by: Xuan Ding <xuan.ding@intel.com>
> > Signed-off-by: Yuan Wang <yuanx.wang@intel.com>
> > Signed-off-by: Wenxuan Wu <wenxuanx.wu@intel.com>
> > Reviewed-by: Qi Zhang <qi.z.zhang@intel.com>
> > Acked-by: Ray Kinsella <mdr@ashroe.eu>
> > ---
> >   lib/ethdev/rte_ethdev.c | 40 +++++++++++++++++++++++++++++++++-------
> >   lib/ethdev/rte_ethdev.h | 28 +++++++++++++++++++++++++++-
> >   2 files changed, 60 insertions(+), 8 deletions(-)
> >
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > 29a3d80466..fbd55cdd9d 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -1661,6 +1661,7 @@ rte_eth_rx_queue_check_split(const struct
> rte_eth_rxseg_split *rx_seg,
> >   		struct rte_mempool *mpl = rx_seg[seg_idx].mp;
> >   		uint32_t length = rx_seg[seg_idx].length;
> >   		uint32_t offset = rx_seg[seg_idx].offset;
> > +		uint32_t proto_hdr = rx_seg[seg_idx].proto_hdr;
> >
> >   		if (mpl == NULL) {
> >   			RTE_ETHDEV_LOG(ERR, "null mempool pointer\n");
> @@ -1694,13
> > +1695,38 @@ rte_eth_rx_queue_check_split(const struct
> rte_eth_rxseg_split *rx_seg,
> >   		}
> >   		offset += seg_idx != 0 ? 0 : RTE_PKTMBUF_HEADROOM;
> >   		*mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
> > -		length = length != 0 ? length : *mbp_buf_size;
> > -		if (*mbp_buf_size < length + offset) {
> > -			RTE_ETHDEV_LOG(ERR,
> > -				       "%s mbuf_data_room_size %u < %u
> (segment length=%u + segment offset=%u)\n",
> > -				       mpl->name, *mbp_buf_size,
> > -				       length + offset, length, offset);
> > -			return -EINVAL;
> > +		if (proto_hdr == RTE_PTYPE_UNKNOWN) {
> > +			/* Split at fixed length. */
> > +			length = length != 0 ? length : *mbp_buf_size;
> > +			if (*mbp_buf_size < length + offset) {
> > +				RTE_ETHDEV_LOG(ERR,
> > +					"%s mbuf_data_room_size %u < %u
> (segment length=%u + segment offset=%u)\n",
> > +					mpl->name, *mbp_buf_size,
> > +					length + offset, length, offset);
> > +				return -EINVAL;
> > +			}
> > +		} else {
> > +			/* Split after specified protocol header. */
> > +			if (!(proto_hdr &
> RTE_BUFFER_SPLIT_PROTO_HDR_MASK)) {
> 
> The condition looks suspicious. It will be true if proto_hdr has no single bit
> from the mask. I guess it is not the intent.

Actually it is the intent... Here the mask is used to check if proto_hdr
belongs to the inner/outer L2/L3/L4 capability we defined. And which
proto_hdr is supported by the NIC will be checked in the PMD later.

> I guess the condition should be
>    proto_hdr & ~RTE_BUFFER_SPLIT_PROTO_HDR_MASK i.e. there is
> unsupported bits in proto_hdr
> 
> IMHO we need extra field in dev_info to report supported protocols to split
> on. Or a new API to get an array similar to ptype get.
> May be a new API is a better choice to not overload dev_info and to be more
> flexible in reporting.

Thanks for your suggestion.
Here I hope to confirm the intent of dev_info or API to expose the supported proto_hdr of driver.
Is it for the pro_hdr check in the rte_eth_rx_queue_check_split()?
If so, could we just check whether pro_hdrs configured belongs to L2/L3/L4 in lib, and check the
capability in PMD? This is what the current design does.

Actually I have another question, do we need a API or dev_info to expose which buffer split the driver supports.
i.e. length based or proto_hdr based. Because it requires different fields to be configured
in RX packet segment.

Hope to get your insights. :)

> 
> > +				RTE_ETHDEV_LOG(ERR,
> > +					"Protocol header %u not
> supported)\n",
> > +					proto_hdr);
> 
> I think it would be useful to log unsupported bits only, if we say so.

The same as above.
Thanks again for your time.

Regards,
Xuan

> 
> > +				return -EINVAL;
> > +			}
> > +
> > +			if (length != 0) {
> > +				RTE_ETHDEV_LOG(ERR, "segment length
> should be set to zero in protocol header "
> > +					       "based buffer split\n");
> > +				return -EINVAL;
> > +			}
> > +
> > +			if (*mbp_buf_size < offset) {
> > +				RTE_ETHDEV_LOG(ERR,
> > +						"%s
> mbuf_data_room_size %u < %u segment offset)\n",
> > +						mpl->name, *mbp_buf_size,
> > +						offset);
> > +				return -EINVAL;
> > +			}
> >   		}
> >   	}
> >   	return 0;
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > 04cff8ee10..0cd9dd6cc0 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1187,6 +1187,9 @@ struct rte_eth_txmode {
> >    *   mbuf) the following data will be pushed to the next segment
> >    *   up to its own length, and so on.
> >    *
> > + * - The proto_hdrs in the elements define the split position of
> > + *   received packets.
> > + *
> >    * - If the length in the segment description element is zero
> >    *   the actual buffer size will be deduced from the appropriate
> >    *   memory pool properties.
> > @@ -1197,14 +1200,37 @@ struct rte_eth_txmode {
> >    *     - pool from the last valid element
> >    *     - the buffer size from this pool
> >    *     - zero offset
> > + *
> > + * - Length based buffer split:
> > + *     - mp, length, offset should be configured.
> > + *     - The proto_hdr field should not be configured.
> > + *
> > + * - Protocol header based buffer split:
> > + *     - mp, offset, proto_hdr should be configured.
> > + *     - The length field should not be configured.
> >    */
> >   struct rte_eth_rxseg_split {
> >   	struct rte_mempool *mp; /**< Memory pool to allocate segment
> from. */
> >   	uint16_t length; /**< Segment data length, configures split point. */
> >   	uint16_t offset; /**< Data offset from beginning of mbuf data buffer.
> */
> > -	uint32_t reserved; /**< Reserved field. */
> > +	uint32_t proto_hdr; /**< Inner/outer L2/L3/L4 protocol header,
> > +configures split point. */
> >   };
> >
> > +/* Buffer split protocol header capability. */ #define
> > +RTE_BUFFER_SPLIT_PROTO_HDR_MASK ( \
> > +	RTE_PTYPE_L2_ETHER | \
> > +	RTE_PTYPE_L3_IPV4 | \
> > +	RTE_PTYPE_L3_IPV6 | \
> > +	RTE_PTYPE_L4_TCP | \
> > +	RTE_PTYPE_L4_UDP | \
> > +	RTE_PTYPE_L4_SCTP | \
> > +	RTE_PTYPE_INNER_L2_ETHER | \
> > +	RTE_PTYPE_INNER_L3_IPV4 | \
> > +	RTE_PTYPE_INNER_L3_IPV6 | \
> > +	RTE_PTYPE_INNER_L4_TCP | \
> > +	RTE_PTYPE_INNER_L4_UDP | \
> > +	RTE_PTYPE_INNER_L4_SCTP)
> > +
> >   /**
> >    * @warning
> >    * @b EXPERIMENTAL: this structure may change without prior notice.
  
Andrew Rybchenko June 4, 2022, 2:25 p.m. UTC | #3
On 6/3/22 19:30, Ding, Xuan wrote:
> Hi Andrew,
> 
>> -----Original Message-----
>> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>> Sent: Thursday, June 2, 2022 9:21 PM
>> To: Wu, WenxuanX <wenxuanx.wu@intel.com>; thomas@monjalon.net; Li,
>> Xiaoyun <xiaoyun.li@intel.com>; ferruh.yigit@xilinx.com; Singh, Aman Deep
>> <aman.deep.singh@intel.com>; dev@dpdk.org; Zhang, Yuying
>> <yuying.zhang@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
>> jerinjacobk@gmail.com
>> Cc: stephen@networkplumber.org; Ding, Xuan <xuan.ding@intel.com>;
>> Wang, YuanX <yuanx.wang@intel.com>; Ray Kinsella <mdr@ashroe.eu>
>> Subject: Re: [PATCH v8 1/3] ethdev: introduce protocol hdr based buffer split
>>
>> Is it the right one since it is listed in patchwork?
> 
> Yes, it is.
> 
>>
>> On 6/1/22 16:50, wenxuanx.wu@intel.com wrote:
>>> From: Wenxuan Wu <wenxuanx.wu@intel.com>
>>>
>>> Currently, Rx buffer split supports length based split. With Rx queue
>>> offload RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT enabled and Rx packet
>> segment
>>> configured, PMD will be able to split the received packets into
>>> multiple segments.
>>>
>>> However, length based buffer split is not suitable for NICs that do
>>> split based on protocol headers. Given a arbitrarily variable length
>>> in Rx packet
>>
>> a -> an
> 
> Thanks for your catch, will fix it in next version.
> 
>>
>>> segment, it is almost impossible to pass a fixed protocol header to PMD.
>>> Besides, the existence of tunneling results in the composition of a
>>> packet is various, which makes the situation even worse.
>>>
>>> This patch extends current buffer split to support protocol header
>>> based buffer split. A new proto_hdr field is introduced in the
>>> reserved field of rte_eth_rxseg_split structure to specify protocol
>>> header. The proto_hdr field defines the split position of packet,
>>> splitting will always happens after the protocol header defined in the
>>> Rx packet segment. When Rx queue offload
>>> RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is enabled and corresponding
>> protocol
>>> header is configured, PMD will split the ingress packets into multiple
>> segments.
>>>
>>> struct rte_eth_rxseg_split {
>>>
>>>           struct rte_mempool *mp; /* memory pools to allocate segment from
>> */
>>>           uint16_t length; /* segment maximal data length,
>>>                               configures "split point" */
>>>           uint16_t offset; /* data offset from beginning
>>>                               of mbuf data buffer */
>>>           uint32_t proto_hdr; /* inner/outer L2/L3/L4 protocol header,
>>> 			       configures "split point" */
>>>       };
>>>
>>> Both inner and outer L2/L3/L4 level protocol header split can be supported.
>>> Corresponding protocol header capability is RTE_PTYPE_L2_ETHER,
>>> RTE_PTYPE_L3_IPV4, RTE_PTYPE_L3_IPV6, RTE_PTYPE_L4_TCP,
>>> RTE_PTYPE_L4_UDP, RTE_PTYPE_L4_SCTP, RTE_PTYPE_INNER_L2_ETHER,
>>> RTE_PTYPE_INNER_L3_IPV4, RTE_PTYPE_INNER_L3_IPV6,
>>> RTE_PTYPE_INNER_L4_TCP, RTE_PTYPE_INNER_L4_UDP,
>> RTE_PTYPE_INNER_L4_SCTP.
>>>
>>> For example, let's suppose we configured the Rx queue with the
>>> following segments:
>>>       seg0 - pool0, proto_hdr0=RTE_PTYPE_L3_IPV4, off0=2B
>>>       seg1 - pool1, proto_hdr1=RTE_PTYPE_L4_UDP, off1=128B
>>>       seg2 - pool2, off1=0B
>>>
>>> The packet consists of MAC_IPV4_UDP_PAYLOAD will be split like
>>> following:
>>>       seg0 - ipv4 header @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from
>> pool0
>>>       seg1 - udp header @ 128 in mbuf from pool1
>>>       seg2 - payload @ 0 in mbuf from pool2
>>
>> It must be defined how ICMPv4 packets will be split in such case.
>> And how UDP over IPv6 will be split.
> 
> The ICMP header type is missed, I will define the expected split behavior and
> add it in next version, thanks for your catch.
> 
> In fact, the buffer split based on protocol header depends on the driver parsing result.
> As long as driver can recognize this packet type, I think there is no difference between
> UDP over IPV4 and UDP over IPV6?

We can bind it to ptypes recognized by the HW+driver, but I can
easily imagine the case when HW has no means to report recognized
packet type (i.e. ptype get returns empty list), but still could
split on it.
Also, nobody guarantees that there is no different in UDP over IPv4 vs
IPv6 recognition and split. IPv6 could have a number of extension
headers which could be not that trivial to hop in HW. So, HW could
recognize IPv6, but not protocols after it.
Also it is very interesting question how to define protocol split
for IPv6 plus extension headers. Where to stop?

> 
>>>
>>> Now buffet split can be configured in two modes. For length based
>>> buffer split, the mp, length, offset field in Rx packet segment should
>>> be configured, while the proto_hdr field should not be configured.
>>> For protocol header based buffer split, the mp, offset, proto_hdr
>>> field in Rx packet segment should be configured, while the length
>>> field should not be configured.
>>>
>>> The split limitations imposed by underlying PMD is reported in the
>>> rte_eth_dev_info->rx_seg_capa field. The memory attributes for the
>>> split parts may differ either, dpdk memory and external memory,
>> respectively.
>>>
>>> Signed-off-by: Xuan Ding <xuan.ding@intel.com>
>>> Signed-off-by: Yuan Wang <yuanx.wang@intel.com>
>>> Signed-off-by: Wenxuan Wu <wenxuanx.wu@intel.com>
>>> Reviewed-by: Qi Zhang <qi.z.zhang@intel.com>
>>> Acked-by: Ray Kinsella <mdr@ashroe.eu>
>>> ---
>>>    lib/ethdev/rte_ethdev.c | 40 +++++++++++++++++++++++++++++++++-------
>>>    lib/ethdev/rte_ethdev.h | 28 +++++++++++++++++++++++++++-
>>>    2 files changed, 60 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
>>> 29a3d80466..fbd55cdd9d 100644
>>> --- a/lib/ethdev/rte_ethdev.c
>>> +++ b/lib/ethdev/rte_ethdev.c
>>> @@ -1661,6 +1661,7 @@ rte_eth_rx_queue_check_split(const struct
>> rte_eth_rxseg_split *rx_seg,
>>>    		struct rte_mempool *mpl = rx_seg[seg_idx].mp;
>>>    		uint32_t length = rx_seg[seg_idx].length;
>>>    		uint32_t offset = rx_seg[seg_idx].offset;
>>> +		uint32_t proto_hdr = rx_seg[seg_idx].proto_hdr;
>>>
>>>    		if (mpl == NULL) {
>>>    			RTE_ETHDEV_LOG(ERR, "null mempool pointer\n");
>> @@ -1694,13
>>> +1695,38 @@ rte_eth_rx_queue_check_split(const struct
>> rte_eth_rxseg_split *rx_seg,
>>>    		}
>>>    		offset += seg_idx != 0 ? 0 : RTE_PKTMBUF_HEADROOM;
>>>    		*mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
>>> -		length = length != 0 ? length : *mbp_buf_size;
>>> -		if (*mbp_buf_size < length + offset) {
>>> -			RTE_ETHDEV_LOG(ERR,
>>> -				       "%s mbuf_data_room_size %u < %u
>> (segment length=%u + segment offset=%u)\n",
>>> -				       mpl->name, *mbp_buf_size,
>>> -				       length + offset, length, offset);
>>> -			return -EINVAL;
>>> +		if (proto_hdr == RTE_PTYPE_UNKNOWN) {
>>> +			/* Split at fixed length. */
>>> +			length = length != 0 ? length : *mbp_buf_size;
>>> +			if (*mbp_buf_size < length + offset) {
>>> +				RTE_ETHDEV_LOG(ERR,
>>> +					"%s mbuf_data_room_size %u < %u
>> (segment length=%u + segment offset=%u)\n",
>>> +					mpl->name, *mbp_buf_size,
>>> +					length + offset, length, offset);
>>> +				return -EINVAL;
>>> +			}
>>> +		} else {
>>> +			/* Split after specified protocol header. */
>>> +			if (!(proto_hdr &
>> RTE_BUFFER_SPLIT_PROTO_HDR_MASK)) {
>>
>> The condition looks suspicious. It will be true if proto_hdr has no single bit
>> from the mask. I guess it is not the intent.
> 
> Actually it is the intent... Here the mask is used to check if proto_hdr
> belongs to the inner/outer L2/L3/L4 capability we defined. And which
> proto_hdr is supported by the NIC will be checked in the PMD later.

Frankly speaking I see no value in such incomplete check if
we still rely on driver. I simply see no reason to oblige the
driver to support one of these protocols.

> 
>> I guess the condition should be
>>     proto_hdr & ~RTE_BUFFER_SPLIT_PROTO_HDR_MASK i.e. there is
>> unsupported bits in proto_hdr
>>
>> IMHO we need extra field in dev_info to report supported protocols to split
>> on. Or a new API to get an array similar to ptype get.
>> May be a new API is a better choice to not overload dev_info and to be more
>> flexible in reporting.
> 
> Thanks for your suggestion.
> Here I hope to confirm the intent of dev_info or API to expose the supported proto_hdr of driver.
> Is it for the pro_hdr check in the rte_eth_rx_queue_check_split()?
> If so, could we just check whether pro_hdrs configured belongs to L2/L3/L4 in lib, and check the
> capability in PMD? This is what the current design does.

Look. Application needs to know what to expect from eth device.
It should know which protocols it can split on. Of course we can
enforce application to use try-fail approach which would make sense
if we have dedicated API to request Rx buffer split, but since it
is done via Rx queue configuration, it could be tricky for application
to realize which part of the configuration is wrong. It could simply
result in a too many retries with different configuration.

I.e. the information should be used by ethdev to validate request and
the information should be ued by the application to understand what is
supported.

> 
> Actually I have another question, do we need a API or dev_info to expose which buffer split the driver supports.
> i.e. length based or proto_hdr based. Because it requires different fields to be configured
> in RX packet segment.

See above. If dedicated API return -ENOTSUP or empty set of supported
protocols to split on, the answer is clear.

> 
> Hope to get your insights. :)
> 
>>
>>> +				RTE_ETHDEV_LOG(ERR,
>>> +					"Protocol header %u not
>> supported)\n",
>>> +					proto_hdr);
>>
>> I think it would be useful to log unsupported bits only, if we say so.
> 
> The same as above.
> Thanks again for your time.
> 
> Regards,
> Xuan
  
Ding, Xuan June 7, 2022, 10:13 a.m. UTC | #4
Hi Andrew,

> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Sent: Saturday, June 4, 2022 10:26 PM
> To: Ding, Xuan <xuan.ding@intel.com>; Wu, WenxuanX
> <wenxuanx.wu@intel.com>; thomas@monjalon.net; Li, Xiaoyun
> <xiaoyun.li@intel.com>; ferruh.yigit@xilinx.com; Singh, Aman Deep
> <aman.deep.singh@intel.com>; dev@dpdk.org; Zhang, Yuying
> <yuying.zhang@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> jerinjacobk@gmail.com
> Cc: stephen@networkplumber.org; Wang, YuanX <yuanx.wang@intel.com>;
> Ray Kinsella <mdr@ashroe.eu>
> Subject: Re: [PATCH v8 1/3] ethdev: introduce protocol hdr based buffer split
> 
> On 6/3/22 19:30, Ding, Xuan wrote:
> > Hi Andrew,
> >
> >> -----Original Message-----
> >> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> >> Sent: Thursday, June 2, 2022 9:21 PM
> >> To: Wu, WenxuanX <wenxuanx.wu@intel.com>; thomas@monjalon.net;
> Li,
> >> Xiaoyun <xiaoyun.li@intel.com>; ferruh.yigit@xilinx.com; Singh, Aman
> >> Deep <aman.deep.singh@intel.com>; dev@dpdk.org; Zhang, Yuying
> >> <yuying.zhang@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> >> jerinjacobk@gmail.com
> >> Cc: stephen@networkplumber.org; Ding, Xuan <xuan.ding@intel.com>;
> >> Wang, YuanX <yuanx.wang@intel.com>; Ray Kinsella <mdr@ashroe.eu>
> >> Subject: Re: [PATCH v8 1/3] ethdev: introduce protocol hdr based
> >> buffer split
> >>
> >> Is it the right one since it is listed in patchwork?
> >
> > Yes, it is.
> >
> >>
> >> On 6/1/22 16:50, wenxuanx.wu@intel.com wrote:
> >>> From: Wenxuan Wu <wenxuanx.wu@intel.com>
> >>>
> >>> Currently, Rx buffer split supports length based split. With Rx
> >>> queue offload RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT enabled and Rx
> packet
> >> segment
> >>> configured, PMD will be able to split the received packets into
> >>> multiple segments.
> >>>
> >>> However, length based buffer split is not suitable for NICs that do
> >>> split based on protocol headers. Given a arbitrarily variable length
> >>> in Rx packet
> >>
> >> a -> an
> >
> > Thanks for your catch, will fix it in next version.
> >
> >>
> >>> segment, it is almost impossible to pass a fixed protocol header to PMD.
> >>> Besides, the existence of tunneling results in the composition of a
> >>> packet is various, which makes the situation even worse.
> >>>
> >>> This patch extends current buffer split to support protocol header
> >>> based buffer split. A new proto_hdr field is introduced in the
> >>> reserved field of rte_eth_rxseg_split structure to specify protocol
> >>> header. The proto_hdr field defines the split position of packet,
> >>> splitting will always happens after the protocol header defined in
> >>> the Rx packet segment. When Rx queue offload
> >>> RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is enabled and corresponding
> >> protocol
> >>> header is configured, PMD will split the ingress packets into
> >>> multiple
> >> segments.
> >>>
> >>> struct rte_eth_rxseg_split {
> >>>
> >>>           struct rte_mempool *mp; /* memory pools to allocate
> >>> segment from
> >> */
> >>>           uint16_t length; /* segment maximal data length,
> >>>                               configures "split point" */
> >>>           uint16_t offset; /* data offset from beginning
> >>>                               of mbuf data buffer */
> >>>           uint32_t proto_hdr; /* inner/outer L2/L3/L4 protocol header,
> >>> 			       configures "split point" */
> >>>       };
> >>>
> >>> Both inner and outer L2/L3/L4 level protocol header split can be
> supported.
> >>> Corresponding protocol header capability is RTE_PTYPE_L2_ETHER,
> >>> RTE_PTYPE_L3_IPV4, RTE_PTYPE_L3_IPV6, RTE_PTYPE_L4_TCP,
> >>> RTE_PTYPE_L4_UDP, RTE_PTYPE_L4_SCTP, RTE_PTYPE_INNER_L2_ETHER,
> >>> RTE_PTYPE_INNER_L3_IPV4, RTE_PTYPE_INNER_L3_IPV6,
> >>> RTE_PTYPE_INNER_L4_TCP, RTE_PTYPE_INNER_L4_UDP,
> >> RTE_PTYPE_INNER_L4_SCTP.
> >>>
> >>> For example, let's suppose we configured the Rx queue with the
> >>> following segments:
> >>>       seg0 - pool0, proto_hdr0=RTE_PTYPE_L3_IPV4, off0=2B
> >>>       seg1 - pool1, proto_hdr1=RTE_PTYPE_L4_UDP, off1=128B
> >>>       seg2 - pool2, off1=0B
> >>>
> >>> The packet consists of MAC_IPV4_UDP_PAYLOAD will be split like
> >>> following:
> >>>       seg0 - ipv4 header @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from
> >> pool0
> >>>       seg1 - udp header @ 128 in mbuf from pool1
> >>>       seg2 - payload @ 0 in mbuf from pool2
> >>
> >> It must be defined how ICMPv4 packets will be split in such case.
> >> And how UDP over IPv6 will be split.
> >
> > The ICMP header type is missed, I will define the expected split
> > behavior and add it in next version, thanks for your catch.

I have a question here. Since ICMP packets are mainly used to check the
connectivity of network, is it necessary for us to split ICMP packets?
And I found there is no RTE_PTYPE for ICMP.

> >
> > In fact, the buffer split based on protocol header depends on the driver
> parsing result.
> > As long as driver can recognize this packet type, I think there is no
> > difference between UDP over IPV4 and UDP over IPV6?
> 
> We can bind it to ptypes recognized by the HW+driver, but I can easily
> imagine the case when HW has no means to report recognized packet type
> (i.e. ptype get returns empty list), but still could split on it.

Get your point. But if one ptype cannot be recognized by HW+driver, is it still necessary for
us to do the split? The main purpose of buffer split is to split header and payload. Although we
add split for various protocol headers now, we should focus the ptype can be recognized.

> Also, nobody guarantees that there is no different in UDP over IPv4 vs
> IPv6 recognition and split. IPv6 could have a number of extension headers
> which could be not that trivial to hop in HW. So, HW could recognize IPv6,
> but not protocols after it.
> Also it is very interesting question how to define protocol split for IPv6 plus
> extension headers. Where to stop?

The extension header you mentioned is indeed an interesting question.
On our device, the stop would be the end of extension header. The same as
above, the main purpose of buffers split is for header and payload.
Even rte_flow, we don't list all of the extension headers. So we can't cope with
all the IPV6 extension headers.

For IPV6 extension headers, what if we treat the IPV6 header and extension
header as one layer? Because 99% of cases will not require a separate extension
header.

Hope to get your insights.

> 
> >
> >>>
> >>> Now buffet split can be configured in two modes. For length based
> >>> buffer split, the mp, length, offset field in Rx packet segment
> >>> should be configured, while the proto_hdr field should not be configured.
> >>> For protocol header based buffer split, the mp, offset, proto_hdr
> >>> field in Rx packet segment should be configured, while the length
> >>> field should not be configured.
> >>>
> >>> The split limitations imposed by underlying PMD is reported in the
> >>> rte_eth_dev_info->rx_seg_capa field. The memory attributes for the
> >>> split parts may differ either, dpdk memory and external memory,
> >> respectively.
> >>>
> >>> Signed-off-by: Xuan Ding <xuan.ding@intel.com>
> >>> Signed-off-by: Yuan Wang <yuanx.wang@intel.com>
> >>> Signed-off-by: Wenxuan Wu <wenxuanx.wu@intel.com>
> >>> Reviewed-by: Qi Zhang <qi.z.zhang@intel.com>
> >>> Acked-by: Ray Kinsella <mdr@ashroe.eu>
> >>> ---
> >>>    lib/ethdev/rte_ethdev.c | 40 +++++++++++++++++++++++++++++++++---
> ----
> >>>    lib/ethdev/rte_ethdev.h | 28 +++++++++++++++++++++++++++-
> >>>    2 files changed, 60 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> >>> 29a3d80466..fbd55cdd9d 100644
> >>> --- a/lib/ethdev/rte_ethdev.c
> >>> +++ b/lib/ethdev/rte_ethdev.c
> >>> @@ -1661,6 +1661,7 @@ rte_eth_rx_queue_check_split(const struct
> >> rte_eth_rxseg_split *rx_seg,
> >>>    		struct rte_mempool *mpl = rx_seg[seg_idx].mp;
> >>>    		uint32_t length = rx_seg[seg_idx].length;
> >>>    		uint32_t offset = rx_seg[seg_idx].offset;
> >>> +		uint32_t proto_hdr = rx_seg[seg_idx].proto_hdr;
> >>>
> >>>    		if (mpl == NULL) {
> >>>    			RTE_ETHDEV_LOG(ERR, "null mempool pointer\n");
> >> @@ -1694,13
> >>> +1695,38 @@ rte_eth_rx_queue_check_split(const struct
> >> rte_eth_rxseg_split *rx_seg,
> >>>    		}
> >>>    		offset += seg_idx != 0 ? 0 : RTE_PKTMBUF_HEADROOM;
> >>>    		*mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
> >>> -		length = length != 0 ? length : *mbp_buf_size;
> >>> -		if (*mbp_buf_size < length + offset) {
> >>> -			RTE_ETHDEV_LOG(ERR,
> >>> -				       "%s mbuf_data_room_size %u < %u
> >> (segment length=%u + segment offset=%u)\n",
> >>> -				       mpl->name, *mbp_buf_size,
> >>> -				       length + offset, length, offset);
> >>> -			return -EINVAL;
> >>> +		if (proto_hdr == RTE_PTYPE_UNKNOWN) {
> >>> +			/* Split at fixed length. */
> >>> +			length = length != 0 ? length : *mbp_buf_size;
> >>> +			if (*mbp_buf_size < length + offset) {
> >>> +				RTE_ETHDEV_LOG(ERR,
> >>> +					"%s mbuf_data_room_size %u < %u
> >> (segment length=%u + segment offset=%u)\n",
> >>> +					mpl->name, *mbp_buf_size,
> >>> +					length + offset, length, offset);
> >>> +				return -EINVAL;
> >>> +			}
> >>> +		} else {
> >>> +			/* Split after specified protocol header. */
> >>> +			if (!(proto_hdr &
> >> RTE_BUFFER_SPLIT_PROTO_HDR_MASK)) {
> >>
> >> The condition looks suspicious. It will be true if proto_hdr has no
> >> single bit from the mask. I guess it is not the intent.
> >
> > Actually it is the intent... Here the mask is used to check if
> > proto_hdr belongs to the inner/outer L2/L3/L4 capability we defined.
> > And which proto_hdr is supported by the NIC will be checked in the PMD
> later.

Need to correct here. You are right, I made a bug in previous implementation.

> 
> Frankly speaking I see no value in such incomplete check if we still rely on
> driver. I simply see no reason to oblige the driver to support one of these
> protocols.

With API, we can get the driver's capabilities first, and do the check in ethdev later.
In this way, we can finish the checks in once. Please see v9.

> 
> >
> >> I guess the condition should be
> >>     proto_hdr & ~RTE_BUFFER_SPLIT_PROTO_HDR_MASK i.e. there is
> >> unsupported bits in proto_hdr
> >>
> >> IMHO we need extra field in dev_info to report supported protocols to
> >> split on. Or a new API to get an array similar to ptype get.
> >> May be a new API is a better choice to not overload dev_info and to
> >> be more flexible in reporting.
> >
> > Thanks for your suggestion.
> > Here I hope to confirm the intent of dev_info or API to expose the
> supported proto_hdr of driver.
> > Is it for the pro_hdr check in the rte_eth_rx_queue_check_split()?
> > If so, could we just check whether pro_hdrs configured belongs to
> > L2/L3/L4 in lib, and check the capability in PMD? This is what the current
> design does.
> 
> Look. Application needs to know what to expect from eth device.
> It should know which protocols it can split on. Of course we can enforce
> application to use try-fail approach which would make sense if we have
> dedicated API to request Rx buffer split, but since it is done via Rx queue
> configuration, it could be tricky for application to realize which part of the
> configuration is wrong. It could simply result in a too many retries with
> different configuration.

Agree. To avoid the unnecessary try-fails, I will add a new API in dev_ops,
please see v9.

> 
> I.e. the information should be used by ethdev to validate request and the
> information should be ued by the application to understand what is
> supported.
> 
> >
> > Actually I have another question, do we need a API or dev_info to expose
> which buffer split the driver supports.
> > i.e. length based or proto_hdr based. Because it requires different
> > fields to be configured in RX packet segment.
> 
> See above. If dedicated API return -ENOTSUP or empty set of supported
> protocols to split on, the answer is clear.

Get your point. I totally agree with your idea of a new API.
Through the API, the logic will look like:
ret = rte_get_supported_buffer_split_proto();
	if (ret == -ENOTSUP)
		Check length based buffer split
	else
		Checked proto based buffer split

We don't need to care about the irrelevant fields in respective
buffer split anymore.

BTW, could you help to review the deprecation notice for header split?
If it gets acked, I will start the deprecation in 22.11.

Thanks,
Xuan

> 
> >
> > Hope to get your insights. :)
> >
> >>
> >>> +				RTE_ETHDEV_LOG(ERR,
> >>> +					"Protocol header %u not
> >> supported)\n",
> >>> +					proto_hdr);
> >>
> >> I think it would be useful to log unsupported bits only, if we say so.
> >
> > The same as above.
> > Thanks again for your time.
> >
> > Regards,
> > Xuan
  
Andrew Rybchenko June 7, 2022, 10:48 a.m. UTC | #5
On 6/7/22 13:13, Ding, Xuan wrote:
> Hi Andrew,
> 
>> -----Original Message-----
>> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>> Sent: Saturday, June 4, 2022 10:26 PM
>> To: Ding, Xuan <xuan.ding@intel.com>; Wu, WenxuanX
>> <wenxuanx.wu@intel.com>; thomas@monjalon.net; Li, Xiaoyun
>> <xiaoyun.li@intel.com>; ferruh.yigit@xilinx.com; Singh, Aman Deep
>> <aman.deep.singh@intel.com>; dev@dpdk.org; Zhang, Yuying
>> <yuying.zhang@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
>> jerinjacobk@gmail.com
>> Cc: stephen@networkplumber.org; Wang, YuanX <yuanx.wang@intel.com>;
>> Ray Kinsella <mdr@ashroe.eu>
>> Subject: Re: [PATCH v8 1/3] ethdev: introduce protocol hdr based buffer split
>>
>> On 6/3/22 19:30, Ding, Xuan wrote:
>>> Hi Andrew,
>>>
>>>> -----Original Message-----
>>>> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>>>> Sent: Thursday, June 2, 2022 9:21 PM
>>>> To: Wu, WenxuanX <wenxuanx.wu@intel.com>; thomas@monjalon.net;
>> Li,
>>>> Xiaoyun <xiaoyun.li@intel.com>; ferruh.yigit@xilinx.com; Singh, Aman
>>>> Deep <aman.deep.singh@intel.com>; dev@dpdk.org; Zhang, Yuying
>>>> <yuying.zhang@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
>>>> jerinjacobk@gmail.com
>>>> Cc: stephen@networkplumber.org; Ding, Xuan <xuan.ding@intel.com>;
>>>> Wang, YuanX <yuanx.wang@intel.com>; Ray Kinsella <mdr@ashroe.eu>
>>>> Subject: Re: [PATCH v8 1/3] ethdev: introduce protocol hdr based
>>>> buffer split
>>>>
>>>> Is it the right one since it is listed in patchwork?
>>>
>>> Yes, it is.
>>>
>>>>
>>>> On 6/1/22 16:50, wenxuanx.wu@intel.com wrote:
>>>>> From: Wenxuan Wu <wenxuanx.wu@intel.com>
>>>>>
>>>>> Currently, Rx buffer split supports length based split. With Rx
>>>>> queue offload RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT enabled and Rx
>> packet
>>>> segment
>>>>> configured, PMD will be able to split the received packets into
>>>>> multiple segments.
>>>>>
>>>>> However, length based buffer split is not suitable for NICs that do
>>>>> split based on protocol headers. Given a arbitrarily variable length
>>>>> in Rx packet
>>>>
>>>> a -> an
>>>
>>> Thanks for your catch, will fix it in next version.
>>>
>>>>
>>>>> segment, it is almost impossible to pass a fixed protocol header to PMD.
>>>>> Besides, the existence of tunneling results in the composition of a
>>>>> packet is various, which makes the situation even worse.
>>>>>
>>>>> This patch extends current buffer split to support protocol header
>>>>> based buffer split. A new proto_hdr field is introduced in the
>>>>> reserved field of rte_eth_rxseg_split structure to specify protocol
>>>>> header. The proto_hdr field defines the split position of packet,
>>>>> splitting will always happens after the protocol header defined in
>>>>> the Rx packet segment. When Rx queue offload
>>>>> RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is enabled and corresponding
>>>> protocol
>>>>> header is configured, PMD will split the ingress packets into
>>>>> multiple
>>>> segments.
>>>>>
>>>>> struct rte_eth_rxseg_split {
>>>>>
>>>>>            struct rte_mempool *mp; /* memory pools to allocate
>>>>> segment from
>>>> */
>>>>>            uint16_t length; /* segment maximal data length,
>>>>>                                configures "split point" */
>>>>>            uint16_t offset; /* data offset from beginning
>>>>>                                of mbuf data buffer */
>>>>>            uint32_t proto_hdr; /* inner/outer L2/L3/L4 protocol header,
>>>>> 			       configures "split point" */
>>>>>        };
>>>>>
>>>>> Both inner and outer L2/L3/L4 level protocol header split can be
>> supported.
>>>>> Corresponding protocol header capability is RTE_PTYPE_L2_ETHER,
>>>>> RTE_PTYPE_L3_IPV4, RTE_PTYPE_L3_IPV6, RTE_PTYPE_L4_TCP,
>>>>> RTE_PTYPE_L4_UDP, RTE_PTYPE_L4_SCTP, RTE_PTYPE_INNER_L2_ETHER,
>>>>> RTE_PTYPE_INNER_L3_IPV4, RTE_PTYPE_INNER_L3_IPV6,
>>>>> RTE_PTYPE_INNER_L4_TCP, RTE_PTYPE_INNER_L4_UDP,
>>>> RTE_PTYPE_INNER_L4_SCTP.
>>>>>
>>>>> For example, let's suppose we configured the Rx queue with the
>>>>> following segments:
>>>>>        seg0 - pool0, proto_hdr0=RTE_PTYPE_L3_IPV4, off0=2B
>>>>>        seg1 - pool1, proto_hdr1=RTE_PTYPE_L4_UDP, off1=128B
>>>>>        seg2 - pool2, off1=0B
>>>>>
>>>>> The packet consists of MAC_IPV4_UDP_PAYLOAD will be split like
>>>>> following:
>>>>>        seg0 - ipv4 header @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from
>>>> pool0
>>>>>        seg1 - udp header @ 128 in mbuf from pool1
>>>>>        seg2 - payload @ 0 in mbuf from pool2
>>>>
>>>> It must be defined how ICMPv4 packets will be split in such case.
>>>> And how UDP over IPv6 will be split.
>>>
>>> The ICMP header type is missed, I will define the expected split
>>> behavior and add it in next version, thanks for your catch.
> 
> I have a question here. Since ICMP packets are mainly used to check the
> connectivity of network, is it necessary for us to split ICMP packets?
> And I found there is no RTE_PTYPE for ICMP.

I'm not saying that we should split on ICMP. I'm just saying that we
must define the behaviour when happens for packets which do not match
split specification. Does it split on longest match and everything else
is put in (which?) last buffer? E.g. we configure split on ETH-IPv4-TCP.
What does happen with ETH-IPv4-UDP? ETH-IPv6?

>>>
>>> In fact, the buffer split based on protocol header depends on the driver
>> parsing result.
>>> As long as driver can recognize this packet type, I think there is no
>>> difference between UDP over IPV4 and UDP over IPV6?
>>
>> We can bind it to ptypes recognized by the HW+driver, but I can easily
>> imagine the case when HW has no means to report recognized packet type
>> (i.e. ptype get returns empty list), but still could split on it.
> 
> Get your point. But if one ptype cannot be recognized by HW+driver, is it still necessary for
> us to do the split? The main purpose of buffer split is to split header and payload. Although we
> add split for various protocol headers now, we should focus the ptype can be recognized.

Recognition and reporting is a separate things. It could be recognized,
but it can have no means to report it to the driver. ptype_get is about
reporting.

> 
>> Also, nobody guarantees that there is no different in UDP over IPv4 vs
>> IPv6 recognition and split. IPv6 could have a number of extension headers
>> which could be not that trivial to hop in HW. So, HW could recognize IPv6,
>> but not protocols after it.
>> Also it is very interesting question how to define protocol split for IPv6 plus
>> extension headers. Where to stop?
> 
> The extension header you mentioned is indeed an interesting question.
> On our device, the stop would be the end of extension header. The same as
> above, the main purpose of buffers split is for header and payload.
> Even rte_flow, we don't list all of the extension headers. So we can't cope with
> all the IPV6 extension headers.

Again, we must define the behaviour. Application needs to know what to
expect.

> 
> For IPV6 extension headers, what if we treat the IPV6 header and extension
> header as one layer? Because 99% of cases will not require a separate extension
> header.

I'd like to highlight that it is not "an extension header". It is
'extension headers' (plural). I'm not sure that we can say that
in order to split on IPv6 HW must support *all* (even future)
extension headers.

> 
> Hope to get your insights.

Unfortunately I have no solutions. Just questions to be answered...

> 
>>
>>>
>>>>>
>>>>> Now buffet split can be configured in two modes. For length based
>>>>> buffer split, the mp, length, offset field in Rx packet segment
>>>>> should be configured, while the proto_hdr field should not be configured.
>>>>> For protocol header based buffer split, the mp, offset, proto_hdr
>>>>> field in Rx packet segment should be configured, while the length
>>>>> field should not be configured.
>>>>>
>>>>> The split limitations imposed by underlying PMD is reported in the
>>>>> rte_eth_dev_info->rx_seg_capa field. The memory attributes for the
>>>>> split parts may differ either, dpdk memory and external memory,
>>>> respectively.
>>>>>
>>>>> Signed-off-by: Xuan Ding <xuan.ding@intel.com>
>>>>> Signed-off-by: Yuan Wang <yuanx.wang@intel.com>
>>>>> Signed-off-by: Wenxuan Wu <wenxuanx.wu@intel.com>
>>>>> Reviewed-by: Qi Zhang <qi.z.zhang@intel.com>
>>>>> Acked-by: Ray Kinsella <mdr@ashroe.eu>
>>>>> ---
>>>>>     lib/ethdev/rte_ethdev.c | 40 +++++++++++++++++++++++++++++++++---
>> ----
>>>>>     lib/ethdev/rte_ethdev.h | 28 +++++++++++++++++++++++++++-
>>>>>     2 files changed, 60 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
>>>>> 29a3d80466..fbd55cdd9d 100644
>>>>> --- a/lib/ethdev/rte_ethdev.c
>>>>> +++ b/lib/ethdev/rte_ethdev.c
>>>>> @@ -1661,6 +1661,7 @@ rte_eth_rx_queue_check_split(const struct
>>>> rte_eth_rxseg_split *rx_seg,
>>>>>     		struct rte_mempool *mpl = rx_seg[seg_idx].mp;
>>>>>     		uint32_t length = rx_seg[seg_idx].length;
>>>>>     		uint32_t offset = rx_seg[seg_idx].offset;
>>>>> +		uint32_t proto_hdr = rx_seg[seg_idx].proto_hdr;
>>>>>
>>>>>     		if (mpl == NULL) {
>>>>>     			RTE_ETHDEV_LOG(ERR, "null mempool pointer\n");
>>>> @@ -1694,13
>>>>> +1695,38 @@ rte_eth_rx_queue_check_split(const struct
>>>> rte_eth_rxseg_split *rx_seg,
>>>>>     		}
>>>>>     		offset += seg_idx != 0 ? 0 : RTE_PKTMBUF_HEADROOM;
>>>>>     		*mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
>>>>> -		length = length != 0 ? length : *mbp_buf_size;
>>>>> -		if (*mbp_buf_size < length + offset) {
>>>>> -			RTE_ETHDEV_LOG(ERR,
>>>>> -				       "%s mbuf_data_room_size %u < %u
>>>> (segment length=%u + segment offset=%u)\n",
>>>>> -				       mpl->name, *mbp_buf_size,
>>>>> -				       length + offset, length, offset);
>>>>> -			return -EINVAL;
>>>>> +		if (proto_hdr == RTE_PTYPE_UNKNOWN) {
>>>>> +			/* Split at fixed length. */
>>>>> +			length = length != 0 ? length : *mbp_buf_size;
>>>>> +			if (*mbp_buf_size < length + offset) {
>>>>> +				RTE_ETHDEV_LOG(ERR,
>>>>> +					"%s mbuf_data_room_size %u < %u
>>>> (segment length=%u + segment offset=%u)\n",
>>>>> +					mpl->name, *mbp_buf_size,
>>>>> +					length + offset, length, offset);
>>>>> +				return -EINVAL;
>>>>> +			}
>>>>> +		} else {
>>>>> +			/* Split after specified protocol header. */
>>>>> +			if (!(proto_hdr &
>>>> RTE_BUFFER_SPLIT_PROTO_HDR_MASK)) {
>>>>
>>>> The condition looks suspicious. It will be true if proto_hdr has no
>>>> single bit from the mask. I guess it is not the intent.
>>>
>>> Actually it is the intent... Here the mask is used to check if
>>> proto_hdr belongs to the inner/outer L2/L3/L4 capability we defined.
>>> And which proto_hdr is supported by the NIC will be checked in the PMD
>> later.
> 
> Need to correct here. You are right, I made a bug in previous implementation.
> 
>>
>> Frankly speaking I see no value in such incomplete check if we still rely on
>> driver. I simply see no reason to oblige the driver to support one of these
>> protocols.
> 
> With API, we can get the driver's capabilities first, and do the check in ethdev later.
> In this way, we can finish the checks in once. Please see v9.
> 
>>
>>>
>>>> I guess the condition should be
>>>>      proto_hdr & ~RTE_BUFFER_SPLIT_PROTO_HDR_MASK i.e. there is
>>>> unsupported bits in proto_hdr
>>>>
>>>> IMHO we need extra field in dev_info to report supported protocols to
>>>> split on. Or a new API to get an array similar to ptype get.
>>>> May be a new API is a better choice to not overload dev_info and to
>>>> be more flexible in reporting.
>>>
>>> Thanks for your suggestion.
>>> Here I hope to confirm the intent of dev_info or API to expose the
>> supported proto_hdr of driver.
>>> Is it for the pro_hdr check in the rte_eth_rx_queue_check_split()?
>>> If so, could we just check whether pro_hdrs configured belongs to
>>> L2/L3/L4 in lib, and check the capability in PMD? This is what the current
>> design does.
>>
>> Look. Application needs to know what to expect from eth device.
>> It should know which protocols it can split on. Of course we can enforce
>> application to use try-fail approach which would make sense if we have
>> dedicated API to request Rx buffer split, but since it is done via Rx queue
>> configuration, it could be tricky for application to realize which part of the
>> configuration is wrong. It could simply result in a too many retries with
>> different configuration.
> 
> Agree. To avoid the unnecessary try-fails, I will add a new API in dev_ops,
> please see v9.
> 
>>
>> I.e. the information should be used by ethdev to validate request and the
>> information should be ued by the application to understand what is
>> supported.
>>
>>>
>>> Actually I have another question, do we need a API or dev_info to expose
>> which buffer split the driver supports.
>>> i.e. length based or proto_hdr based. Because it requires different
>>> fields to be configured in RX packet segment.
>>
>> See above. If dedicated API return -ENOTSUP or empty set of supported
>> protocols to split on, the answer is clear.
> 
> Get your point. I totally agree with your idea of a new API.
> Through the API, the logic will look like:
> ret = rte_get_supported_buffer_split_proto();
> 	if (ret == -ENOTSUP)
> 		Check length based buffer split
> 	else
> 		Checked proto based buffer split
> 
> We don't need to care about the irrelevant fields in respective
> buffer split anymore.
> 
> BTW, could you help to review the deprecation notice for header split?
> If it gets acked, I will start the deprecation in 22.11.

Since the feature is definitely dead. I'd use faster track.
Deprecate in 22.07 and remove in 22.11.

> 
> Thanks,
> Xuan
> 
>>
>>>
>>> Hope to get your insights. :)
>>>
>>>>
>>>>> +				RTE_ETHDEV_LOG(ERR,
>>>>> +					"Protocol header %u not
>>>> supported)\n",
>>>>> +					proto_hdr);
>>>>
>>>> I think it would be useful to log unsupported bits only, if we say so.
>>>
>>> The same as above.
>>> Thanks again for your time.
>>>
>>> Regards,
>>> Xuan
  
Ding, Xuan June 10, 2022, 3:04 p.m. UTC | #6
Hi Andrew,

Sorry for the late response, please see replies inline.

> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Sent: Tuesday, June 7, 2022 6:49 PM
> To: Ding, Xuan <xuan.ding@intel.com>; Wu, WenxuanX
> <wenxuanx.wu@intel.com>; thomas@monjalon.net; Li, Xiaoyun
> <xiaoyun.li@intel.com>; ferruh.yigit@xilinx.com; Singh, Aman Deep
> <aman.deep.singh@intel.com>; dev@dpdk.org; Zhang, Yuying
> <yuying.zhang@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> jerinjacobk@gmail.com
> Cc: stephen@networkplumber.org; Wang, YuanX <yuanx.wang@intel.com>;
> Ray Kinsella <mdr@ashroe.eu>
> Subject: Re: [PATCH v8 1/3] ethdev: introduce protocol hdr based buffer split
> 
> On 6/7/22 13:13, Ding, Xuan wrote:
> > Hi Andrew,
> >
> >> -----Original Message-----
> >> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> >> Sent: Saturday, June 4, 2022 10:26 PM
> >> To: Ding, Xuan <xuan.ding@intel.com>; Wu, WenxuanX
> >> <wenxuanx.wu@intel.com>; thomas@monjalon.net; Li, Xiaoyun
> >> <xiaoyun.li@intel.com>; ferruh.yigit@xilinx.com; Singh, Aman Deep
> >> <aman.deep.singh@intel.com>; dev@dpdk.org; Zhang, Yuying
> >> <yuying.zhang@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> >> jerinjacobk@gmail.com
> >> Cc: stephen@networkplumber.org; Wang, YuanX
> <yuanx.wang@intel.com>;
> >> Ray Kinsella <mdr@ashroe.eu>
> >> Subject: Re: [PATCH v8 1/3] ethdev: introduce protocol hdr based
> >> buffer split
> >>
> >> On 6/3/22 19:30, Ding, Xuan wrote:
> >>> Hi Andrew,
> >>>
> >>>> -----Original Message-----
> >>>> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> >>>> Sent: Thursday, June 2, 2022 9:21 PM
> >>>> To: Wu, WenxuanX <wenxuanx.wu@intel.com>;
> thomas@monjalon.net;
> >> Li,
> >>>> Xiaoyun <xiaoyun.li@intel.com>; ferruh.yigit@xilinx.com; Singh,
> >>>> Aman Deep <aman.deep.singh@intel.com>; dev@dpdk.org; Zhang,
> Yuying
> >>>> <yuying.zhang@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> >>>> jerinjacobk@gmail.com
> >>>> Cc: stephen@networkplumber.org; Ding, Xuan <xuan.ding@intel.com>;
> >>>> Wang, YuanX <yuanx.wang@intel.com>; Ray Kinsella <mdr@ashroe.eu>
> >>>> Subject: Re: [PATCH v8 1/3] ethdev: introduce protocol hdr based
> >>>> buffer split
> >>>>
> >>>> Is it the right one since it is listed in patchwork?
> >>>
> >>> Yes, it is.
> >>>
> >>>>
> >>>> On 6/1/22 16:50, wenxuanx.wu@intel.com wrote:
> >>>>> From: Wenxuan Wu <wenxuanx.wu@intel.com>
> >>>>>
> >>>>> Currently, Rx buffer split supports length based split. With Rx
> >>>>> queue offload RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT enabled and Rx
> >> packet
> >>>> segment
> >>>>> configured, PMD will be able to split the received packets into
> >>>>> multiple segments.
> >>>>>
> >>>>> However, length based buffer split is not suitable for NICs that
> >>>>> do split based on protocol headers. Given a arbitrarily variable
> >>>>> length in Rx packet
> >>>>
> >>>> a -> an
> >>>
> >>> Thanks for your catch, will fix it in next version.
> >>>
> >>>>
> >>>>> segment, it is almost impossible to pass a fixed protocol header to
> PMD.
> >>>>> Besides, the existence of tunneling results in the composition of
> >>>>> a packet is various, which makes the situation even worse.
> >>>>>
> >>>>> This patch extends current buffer split to support protocol header
> >>>>> based buffer split. A new proto_hdr field is introduced in the
> >>>>> reserved field of rte_eth_rxseg_split structure to specify
> >>>>> protocol header. The proto_hdr field defines the split position of
> >>>>> packet, splitting will always happens after the protocol header
> >>>>> defined in the Rx packet segment. When Rx queue offload
> >>>>> RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is enabled and corresponding
> >>>> protocol
> >>>>> header is configured, PMD will split the ingress packets into
> >>>>> multiple
> >>>> segments.
> >>>>>
> >>>>> struct rte_eth_rxseg_split {
> >>>>>
> >>>>>            struct rte_mempool *mp; /* memory pools to allocate
> >>>>> segment from
> >>>> */
> >>>>>            uint16_t length; /* segment maximal data length,
> >>>>>                                configures "split point" */
> >>>>>            uint16_t offset; /* data offset from beginning
> >>>>>                                of mbuf data buffer */
> >>>>>            uint32_t proto_hdr; /* inner/outer L2/L3/L4 protocol header,
> >>>>> 			       configures "split point" */
> >>>>>        };
> >>>>>
> >>>>> Both inner and outer L2/L3/L4 level protocol header split can be
> >> supported.
> >>>>> Corresponding protocol header capability is RTE_PTYPE_L2_ETHER,
> >>>>> RTE_PTYPE_L3_IPV4, RTE_PTYPE_L3_IPV6, RTE_PTYPE_L4_TCP,
> >>>>> RTE_PTYPE_L4_UDP, RTE_PTYPE_L4_SCTP,
> RTE_PTYPE_INNER_L2_ETHER,
> >>>>> RTE_PTYPE_INNER_L3_IPV4, RTE_PTYPE_INNER_L3_IPV6,
> >>>>> RTE_PTYPE_INNER_L4_TCP, RTE_PTYPE_INNER_L4_UDP,
> >>>> RTE_PTYPE_INNER_L4_SCTP.
> >>>>>
> >>>>> For example, let's suppose we configured the Rx queue with the
> >>>>> following segments:
> >>>>>        seg0 - pool0, proto_hdr0=RTE_PTYPE_L3_IPV4, off0=2B
> >>>>>        seg1 - pool1, proto_hdr1=RTE_PTYPE_L4_UDP, off1=128B
> >>>>>        seg2 - pool2, off1=0B
> >>>>>
> >>>>> The packet consists of MAC_IPV4_UDP_PAYLOAD will be split like
> >>>>> following:
> >>>>>        seg0 - ipv4 header @ RTE_PKTMBUF_HEADROOM + 2 in mbuf
> from
> >>>> pool0
> >>>>>        seg1 - udp header @ 128 in mbuf from pool1
> >>>>>        seg2 - payload @ 0 in mbuf from pool2
> >>>>
> >>>> It must be defined how ICMPv4 packets will be split in such case.
> >>>> And how UDP over IPv6 will be split.
> >>>
> >>> The ICMP header type is missed, I will define the expected split
> >>> behavior and add it in next version, thanks for your catch.
> >
> > I have a question here. Since ICMP packets are mainly used to check
> > the connectivity of network, is it necessary for us to split ICMP packets?
> > And I found there is no RTE_PTYPE for ICMP.
> 
> I'm not saying that we should split on ICMP. I'm just saying that we must
> define the behaviour when happens for packets which do not match split
> specification. Does it split on longest match and everything else is put in
> (which?) last buffer? E.g. we configure split on ETH-IPv4-TCP.
> What does happen with ETH-IPv4-UDP? ETH-IPv6?

Get your point. Firstly, our device only supports to split the packets into two segments,
So there will be an exact match for the configured protocol header. Back to this
question, for the set of proto_hdrs configured, it can have two behaviors:
1. The aggressive way is to split on longest match you mentioned, E.g. we configure split
on ETH-IPV4-TCP, when receives ETH-IPV4-UDP or ETH-IPV6, it can also split on ETH-IPV4
or ETH.
2. A more conservative way is to split only when the packets meet the protocol headers
in the RX packet segment. In the above situation, it will not do split for ETH-IPV4-UDP
and ETH-IPV6.

I prefer the second behavior, because the split is usually for the inner most header and
payload, if it does not meet, the rest of the headers have no actual value.
What do you think?

> 
> >>>
> >>> In fact, the buffer split based on protocol header depends on the
> >>> driver
> >> parsing result.
> >>> As long as driver can recognize this packet type, I think there is
> >>> no difference between UDP over IPV4 and UDP over IPV6?
> >>
> >> We can bind it to ptypes recognized by the HW+driver, but I can
> >> easily imagine the case when HW has no means to report recognized
> >> packet type (i.e. ptype get returns empty list), but still could split on it.
> >
> > Get your point. But if one ptype cannot be recognized by HW+driver, is
> > it still necessary for us to do the split? The main purpose of buffer
> > split is to split header and payload. Although we add split for various
> protocol headers now, we should focus the ptype can be recognized.
> 
> Recognition and reporting is a separate things. It could be recognized, but it
> can have no means to report it to the driver. ptype_get is about reporting.
> 
> >
> >> Also, nobody guarantees that there is no different in UDP over IPv4
> >> vs
> >> IPv6 recognition and split. IPv6 could have a number of extension
> >> headers which could be not that trivial to hop in HW. So, HW could
> >> recognize IPv6, but not protocols after it.
> >> Also it is very interesting question how to define protocol split for
> >> IPv6 plus extension headers. Where to stop?
> >
> > The extension header you mentioned is indeed an interesting question.
> > On our device, the stop would be the end of extension header. The same
> > as above, the main purpose of buffers split is for header and payload.
> > Even rte_flow, we don't list all of the extension headers. So we can't
> > cope with all the IPV6 extension headers.
> 
> Again, we must define the behaviour. Application needs to know what to
> expect.

Now I understand the behavior needs to defined clearly for application.
Application need a clear expectation for each function call.

> 
> >
> > For IPV6 extension headers, what if we treat the IPV6 header and
> > extension header as one layer? Because 99% of cases will not require a
> > separate extension header.
> 
> I'd like to highlight that it is not "an extension header". It is 'extension
> headers' (plural). I'm not sure that we can say that in order to split on IPv6
> HW must support *all* (even future) extension headers.

Yes, I'm also referring to extension headers(plural) here.

Whether it's one or more layers of headers for IPV6, as long as the driver can
recognize the end of the IPV6 extension header, we split the IPV6 header once.
E.g. For ETH-IPV6-IPV6 extension-UDP-payload, we split it to ETH-IPV6-IPV6-extension,
UDP and payload with Rx segment RTE_PTYPE_IPV6, RTE_PTYPE_UDP configured.

This is the behavior I hope to define for IPV6 extension(also our device's behavior).
We don't specify the IPV6 extension header in the RX segment. If IPV6 split is configured,
the IPV6 and extension header(if have) will be tread as a layer.

> 
> >
> > Hope to get your insights.
> 
> Unfortunately I have no solutions. Just questions to be answered...
> 
> >
> >>
> >>>
> >>>>>
> >>>>> Now buffet split can be configured in two modes. For length based
> >>>>> buffer split, the mp, length, offset field in Rx packet segment
> >>>>> should be configured, while the proto_hdr field should not be
> configured.
> >>>>> For protocol header based buffer split, the mp, offset, proto_hdr
> >>>>> field in Rx packet segment should be configured, while the length
> >>>>> field should not be configured.
> >>>>>
> >>>>> The split limitations imposed by underlying PMD is reported in the
> >>>>> rte_eth_dev_info->rx_seg_capa field. The memory attributes for the
> >>>>> split parts may differ either, dpdk memory and external memory,
> >>>> respectively.
> >>>>>
> >>>>> Signed-off-by: Xuan Ding <xuan.ding@intel.com>
> >>>>> Signed-off-by: Yuan Wang <yuanx.wang@intel.com>
> >>>>> Signed-off-by: Wenxuan Wu <wenxuanx.wu@intel.com>
> >>>>> Reviewed-by: Qi Zhang <qi.z.zhang@intel.com>
> >>>>> Acked-by: Ray Kinsella <mdr@ashroe.eu>
> >>>>> ---
> >>>>>     lib/ethdev/rte_ethdev.c | 40
> >>>>> +++++++++++++++++++++++++++++++++---
> >> ----
> >>>>>     lib/ethdev/rte_ethdev.h | 28 +++++++++++++++++++++++++++-
> >>>>>     2 files changed, 60 insertions(+), 8 deletions(-)
> >>>>>
> >>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> >>>>> index 29a3d80466..fbd55cdd9d 100644
> >>>>> --- a/lib/ethdev/rte_ethdev.c
> >>>>> +++ b/lib/ethdev/rte_ethdev.c
> >>>>> @@ -1661,6 +1661,7 @@ rte_eth_rx_queue_check_split(const struct
> >>>> rte_eth_rxseg_split *rx_seg,
> >>>>>     		struct rte_mempool *mpl = rx_seg[seg_idx].mp;
> >>>>>     		uint32_t length = rx_seg[seg_idx].length;
> >>>>>     		uint32_t offset = rx_seg[seg_idx].offset;
> >>>>> +		uint32_t proto_hdr = rx_seg[seg_idx].proto_hdr;
> >>>>>
> >>>>>     		if (mpl == NULL) {
> >>>>>     			RTE_ETHDEV_LOG(ERR, "null mempool
> pointer\n");
> >>>> @@ -1694,13
> >>>>> +1695,38 @@ rte_eth_rx_queue_check_split(const struct
> >>>> rte_eth_rxseg_split *rx_seg,
> >>>>>     		}
> >>>>>     		offset += seg_idx != 0 ? 0 :
> RTE_PKTMBUF_HEADROOM;
> >>>>>     		*mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
> >>>>> -		length = length != 0 ? length : *mbp_buf_size;
> >>>>> -		if (*mbp_buf_size < length + offset) {
> >>>>> -			RTE_ETHDEV_LOG(ERR,
> >>>>> -				       "%s mbuf_data_room_size %u < %u
> >>>> (segment length=%u + segment offset=%u)\n",
> >>>>> -				       mpl->name, *mbp_buf_size,
> >>>>> -				       length + offset, length, offset);
> >>>>> -			return -EINVAL;
> >>>>> +		if (proto_hdr == RTE_PTYPE_UNKNOWN) {
> >>>>> +			/* Split at fixed length. */
> >>>>> +			length = length != 0 ? length : *mbp_buf_size;
> >>>>> +			if (*mbp_buf_size < length + offset) {
> >>>>> +				RTE_ETHDEV_LOG(ERR,
> >>>>> +					"%s
> mbuf_data_room_size %u < %u
> >>>> (segment length=%u + segment offset=%u)\n",
> >>>>> +					mpl->name, *mbp_buf_size,
> >>>>> +					length + offset, length, offset);
> >>>>> +				return -EINVAL;
> >>>>> +			}
> >>>>> +		} else {
> >>>>> +			/* Split after specified protocol header. */
> >>>>> +			if (!(proto_hdr &
> >>>> RTE_BUFFER_SPLIT_PROTO_HDR_MASK)) {
> >>>>
> >>>> The condition looks suspicious. It will be true if proto_hdr has no
> >>>> single bit from the mask. I guess it is not the intent.
> >>>
> >>> Actually it is the intent... Here the mask is used to check if
> >>> proto_hdr belongs to the inner/outer L2/L3/L4 capability we defined.
> >>> And which proto_hdr is supported by the NIC will be checked in the
> >>> PMD
> >> later.
> >
> > Need to correct here. You are right, I made a bug in previous
> implementation.
> >
> >>
> >> Frankly speaking I see no value in such incomplete check if we still
> >> rely on driver. I simply see no reason to oblige the driver to
> >> support one of these protocols.
> >
> > With API, we can get the driver's capabilities first, and do the check in
> ethdev later.
> > In this way, we can finish the checks in once. Please see v9.
> >
> >>
> >>>
> >>>> I guess the condition should be
> >>>>      proto_hdr & ~RTE_BUFFER_SPLIT_PROTO_HDR_MASK i.e. there is
> >>>> unsupported bits in proto_hdr
> >>>>
> >>>> IMHO we need extra field in dev_info to report supported protocols
> >>>> to split on. Or a new API to get an array similar to ptype get.
> >>>> May be a new API is a better choice to not overload dev_info and to
> >>>> be more flexible in reporting.
> >>>
> >>> Thanks for your suggestion.
> >>> Here I hope to confirm the intent of dev_info or API to expose the
> >> supported proto_hdr of driver.
> >>> Is it for the pro_hdr check in the rte_eth_rx_queue_check_split()?
> >>> If so, could we just check whether pro_hdrs configured belongs to
> >>> L2/L3/L4 in lib, and check the capability in PMD? This is what the
> >>> current
> >> design does.
> >>
> >> Look. Application needs to know what to expect from eth device.
> >> It should know which protocols it can split on. Of course we can
> >> enforce application to use try-fail approach which would make sense
> >> if we have dedicated API to request Rx buffer split, but since it is
> >> done via Rx queue configuration, it could be tricky for application
> >> to realize which part of the configuration is wrong. It could simply
> >> result in a too many retries with different configuration.
> >
> > Agree. To avoid the unnecessary try-fails, I will add a new API in
> > dev_ops, please see v9.
> >
> >>
> >> I.e. the information should be used by ethdev to validate request and
> >> the information should be ued by the application to understand what
> >> is supported.
> >>
> >>>
> >>> Actually I have another question, do we need a API or dev_info to
> >>> expose
> >> which buffer split the driver supports.
> >>> i.e. length based or proto_hdr based. Because it requires different
> >>> fields to be configured in RX packet segment.
> >>
> >> See above. If dedicated API return -ENOTSUP or empty set of supported
> >> protocols to split on, the answer is clear.
> >
> > Get your point. I totally agree with your idea of a new API.
> > Through the API, the logic will look like:
> > ret = rte_get_supported_buffer_split_proto();
> > 	if (ret == -ENOTSUP)
> > 		Check length based buffer split
> > 	else
> > 		Checked proto based buffer split
> >
> > We don't need to care about the irrelevant fields in respective buffer
> > split anymore.
> >
> > BTW, could you help to review the deprecation notice for header split?
> > If it gets acked, I will start the deprecation in 22.11.
> 
> Since the feature is definitely dead. I'd use faster track.
> Deprecate in 22.07 and remove in 22.11.

Does this mean I don't need to do the remove work? I'm also willing to help.
A deprecation notice has been sent in 22.07.
http://patchwork.dpdk.org/project/dpdk/patch/20220523142016.44451-1-xuan.ding@intel.com/

Thanks,
Xuan

> 
> >
> > Thanks,
> > Xuan
> >
> >>
> >>>
> >>> Hope to get your insights. :)
> >>>
> >>>>
> >>>>> +				RTE_ETHDEV_LOG(ERR,
> >>>>> +					"Protocol header %u not
> >>>> supported)\n",
> >>>>> +					proto_hdr);
> >>>>
> >>>> I think it would be useful to log unsupported bits only, if we say so.
> >>>
> >>> The same as above.
> >>> Thanks again for your time.
> >>>
> >>> Regards,
> >>> Xuan
  

Patch

diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 29a3d80466..fbd55cdd9d 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -1661,6 +1661,7 @@  rte_eth_rx_queue_check_split(const struct rte_eth_rxseg_split *rx_seg,
 		struct rte_mempool *mpl = rx_seg[seg_idx].mp;
 		uint32_t length = rx_seg[seg_idx].length;
 		uint32_t offset = rx_seg[seg_idx].offset;
+		uint32_t proto_hdr = rx_seg[seg_idx].proto_hdr;
 
 		if (mpl == NULL) {
 			RTE_ETHDEV_LOG(ERR, "null mempool pointer\n");
@@ -1694,13 +1695,38 @@  rte_eth_rx_queue_check_split(const struct rte_eth_rxseg_split *rx_seg,
 		}
 		offset += seg_idx != 0 ? 0 : RTE_PKTMBUF_HEADROOM;
 		*mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
-		length = length != 0 ? length : *mbp_buf_size;
-		if (*mbp_buf_size < length + offset) {
-			RTE_ETHDEV_LOG(ERR,
-				       "%s mbuf_data_room_size %u < %u (segment length=%u + segment offset=%u)\n",
-				       mpl->name, *mbp_buf_size,
-				       length + offset, length, offset);
-			return -EINVAL;
+		if (proto_hdr == RTE_PTYPE_UNKNOWN) {
+			/* Split at fixed length. */
+			length = length != 0 ? length : *mbp_buf_size;
+			if (*mbp_buf_size < length + offset) {
+				RTE_ETHDEV_LOG(ERR,
+					"%s mbuf_data_room_size %u < %u (segment length=%u + segment offset=%u)\n",
+					mpl->name, *mbp_buf_size,
+					length + offset, length, offset);
+				return -EINVAL;
+			}
+		} else {
+			/* Split after specified protocol header. */
+			if (!(proto_hdr & RTE_BUFFER_SPLIT_PROTO_HDR_MASK)) {
+				RTE_ETHDEV_LOG(ERR,
+					"Protocol header %u not supported)\n",
+					proto_hdr);
+				return -EINVAL;
+			}
+
+			if (length != 0) {
+				RTE_ETHDEV_LOG(ERR, "segment length should be set to zero in protocol header "
+					       "based buffer split\n");
+				return -EINVAL;
+			}
+
+			if (*mbp_buf_size < offset) {
+				RTE_ETHDEV_LOG(ERR,
+						"%s mbuf_data_room_size %u < %u segment offset)\n",
+						mpl->name, *mbp_buf_size,
+						offset);
+				return -EINVAL;
+			}
 		}
 	}
 	return 0;
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04cff8ee10..0cd9dd6cc0 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1187,6 +1187,9 @@  struct rte_eth_txmode {
  *   mbuf) the following data will be pushed to the next segment
  *   up to its own length, and so on.
  *
+ * - The proto_hdrs in the elements define the split position of
+ *   received packets.
+ *
  * - If the length in the segment description element is zero
  *   the actual buffer size will be deduced from the appropriate
  *   memory pool properties.
@@ -1197,14 +1200,37 @@  struct rte_eth_txmode {
  *     - pool from the last valid element
  *     - the buffer size from this pool
  *     - zero offset
+ *
+ * - Length based buffer split:
+ *     - mp, length, offset should be configured.
+ *     - The proto_hdr field should not be configured.
+ *
+ * - Protocol header based buffer split:
+ *     - mp, offset, proto_hdr should be configured.
+ *     - The length field should not be configured.
  */
 struct rte_eth_rxseg_split {
 	struct rte_mempool *mp; /**< Memory pool to allocate segment from. */
 	uint16_t length; /**< Segment data length, configures split point. */
 	uint16_t offset; /**< Data offset from beginning of mbuf data buffer. */
-	uint32_t reserved; /**< Reserved field. */
+	uint32_t proto_hdr; /**< Inner/outer L2/L3/L4 protocol header, configures split point. */
 };
 
+/* Buffer split protocol header capability. */
+#define RTE_BUFFER_SPLIT_PROTO_HDR_MASK ( \
+	RTE_PTYPE_L2_ETHER | \
+	RTE_PTYPE_L3_IPV4 | \
+	RTE_PTYPE_L3_IPV6 | \
+	RTE_PTYPE_L4_TCP | \
+	RTE_PTYPE_L4_UDP | \
+	RTE_PTYPE_L4_SCTP | \
+	RTE_PTYPE_INNER_L2_ETHER | \
+	RTE_PTYPE_INNER_L3_IPV4 | \
+	RTE_PTYPE_INNER_L3_IPV6 | \
+	RTE_PTYPE_INNER_L4_TCP | \
+	RTE_PTYPE_INNER_L4_UDP | \
+	RTE_PTYPE_INNER_L4_SCTP)
+
 /**
  * @warning
  * @b EXPERIMENTAL: this structure may change without prior notice.