[dpdk-dev,v6,1/6] ethdev: add Tx preparation

Message ID 1476457519-6840-2-git-send-email-tomaszx.kulasek@intel.com (mailing list archive)
State Superseded, archived
Headers

Commit Message

Tomasz Kulasek Oct. 14, 2016, 3:05 p.m. UTC
  Added API for `rte_eth_tx_prep`

uint16_t rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id,
	struct rte_mbuf **tx_pkts, uint16_t nb_pkts)

Added fields to the `struct rte_eth_desc_lim`:

	uint16_t nb_seg_max;
		/**< Max number of segments per whole packet. */

	uint16_t nb_mtu_seg_max;
		/**< Max number of segments per one MTU */

Created `rte_pkt.h` header with common used functions:

int rte_validate_tx_offload(struct rte_mbuf *m)
	to validate general requirements for tx offload in packet such a
	flag completness. In current implementation this function is called
	optionaly when RTE_LIBRTE_ETHDEV_DEBUG is enabled.

int rte_phdr_cksum_fix(struct rte_mbuf *m)
	to fix pseudo header checksum for TSO and non-TSO tcp/udp packets
	before hardware tx checksum offload.
	 - for non-TSO tcp/udp packets full pseudo-header checksum is
	   counted and set.
	 - for TSO the IP payload length is not included.


PERFORMANCE TESTS
-----------------

This feature was tested with modified csum engine from test-pmd.

The packet checksum preparation was moved from application to Tx
preparation step placed before burst.

We may expect some overhead costs caused by:
1) using additional callback before burst,
2) rescanning burst,
3) additional condition checking (packet validation),
4) worse optimization (e.g. packet data access, etc.)

We tested it using ixgbe Tx preparation implementation with some parts
disabled to have comparable information about the impact of diferent
parts of implementation.

IMPACT:

1) For unimplemented Tx preparation callback the performance impact is
   negligible,
2) For packet condition check without checksum modifications (nb_segs,
   available offloads, etc.) is 14626628/14252168 (~2.62% drop),
3) Full support in ixgbe driver (point 2 + packet checksum
   initialization) is 14060924/13588094 (~3.48% drop)


Signed-off-by: Tomasz Kulasek <tomaszx.kulasek@intel.com>
---
 config/common_base            |    1 +
 lib/librte_ether/rte_ethdev.h |   85 +++++++++++++++++++++++++
 lib/librte_mbuf/rte_mbuf.h    |    9 +++
 lib/librte_net/Makefile       |    3 +-
 lib/librte_net/rte_pkt.h      |  137 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 234 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_net/rte_pkt.h
  

Comments

Olivier Matz Oct. 18, 2016, 2:57 p.m. UTC | #1
Hi Tomasz,

I think the principle of tx_prep() is good, it may for instance help to
remove the function virtio_tso_fix_cksum() from the virtio, and maybe
even change the mbuf TSO/cksum API.

I have some questions/comments below, I'm sorry it comes very late.

On 10/14/2016 05:05 PM, Tomasz Kulasek wrote:
> Added API for `rte_eth_tx_prep`
> 
> uint16_t rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id,
> 	struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> 
> Added fields to the `struct rte_eth_desc_lim`:
> 
> 	uint16_t nb_seg_max;
> 		/**< Max number of segments per whole packet. */
> 
> 	uint16_t nb_mtu_seg_max;
> 		/**< Max number of segments per one MTU */

Not sure I understand the second one. Is this is case of TSO?

Is it a usual limitation in different network hardware?
Can this info be retrieved/used by the application?

> 
> Created `rte_pkt.h` header with common used functions:
> 
> int rte_validate_tx_offload(struct rte_mbuf *m)
> 	to validate general requirements for tx offload in packet such a
> 	flag completness. In current implementation this function is called
> 	optionaly when RTE_LIBRTE_ETHDEV_DEBUG is enabled.
> 
> int rte_phdr_cksum_fix(struct rte_mbuf *m)
> 	to fix pseudo header checksum for TSO and non-TSO tcp/udp packets
> 	before hardware tx checksum offload.
> 	 - for non-TSO tcp/udp packets full pseudo-header checksum is
> 	   counted and set.
> 	 - for TSO the IP payload length is not included.

Why not in rte_net.h?


> [...]
>  
> @@ -2816,6 +2825,82 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
>  	return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, nb_pkts);
>  }
>  
> +/**
> + * Process a burst of output packets on a transmit queue of an Ethernet device.
> + *
> + * The rte_eth_tx_prep() function is invoked to prepare output packets to be
> + * transmitted on the output queue *queue_id* of the Ethernet device designated
> + * by its *port_id*.
> + * The *nb_pkts* parameter is the number of packets to be prepared which are
> + * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of them
> + * allocated from a pool created with rte_pktmbuf_pool_create().
> + * For each packet to send, the rte_eth_tx_prep() function performs
> + * the following operations:
> + *
> + * - Check if packet meets devices requirements for tx offloads.

Do you mean hardware requirements?
Can the application be aware of these requirements? I mean capability
flags, or something in dev_infos?

Maybe the comment could be more precise?

> + * - Check limitations about number of segments.
> + *
> + * - Check additional requirements when debug is enabled.

What kind of additional requirements?

> + *
> + * - Update and/or reset required checksums when tx offload is set for packet.
> + *

By reading this, I think it may not be clear for the user about what
should be set in the mbuf. In mbuf API, it is said:

 * TCP segmentation offload. To enable this offload feature for a
 * packet to be transmitted on hardware supporting TSO:
 *  - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
 *    PKT_TX_TCP_CKSUM)
 *  - set the flag PKT_TX_IPV4 or PKT_TX_IPV6
 *  - if it's IPv4, set the PKT_TX_IP_CKSUM flag and write the IP checksum
 *    to 0 in the packet
 *  - fill the mbuf offload information: l2_len, l3_len, l4_len, tso_segsz
 *  - calculate the pseudo header checksum without taking ip_len in account,
 *    and set it in the TCP header. Refer to rte_ipv4_phdr_cksum() and
 *    rte_ipv6_phdr_cksum() that can be used as helpers.


If I understand well, using tx_prep(), the user will have to do the
same except writing the IP checksum to 0, and without setting the
TCP pseudo header checksum, right?


> + * The rte_eth_tx_prep() function returns the number of packets ready to be
> + * sent. A return value equal to *nb_pkts* means that all packets are valid and
> + * ready to be sent.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The index of the transmit queue through which output packets must be
> + *   sent.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param tx_pkts
> + *   The address of an array of *nb_pkts* pointers to *rte_mbuf* structures
> + *   which contain the output packets.
> + * @param nb_pkts
> + *   The maximum number of packets to process.
> + * @return
> + *   The number of packets correct and ready to be sent. The return value can be
> + *   less than the value of the *tx_pkts* parameter when some packet doesn't
> + *   meet devices requirements with rte_errno set appropriately.
> + */

Can we add the constraint that invalid packets are left untouched?

I think most of the time there will be a software fallback in that
case, so it would be good to ensure that this function does not change
the flags or the packet data.

Another thing that could be interesting for the caller is to know the
reason of the failure. Maybe the different errno types could be detailed
here. For instance:
- EINVAL: offload flags are not correctly set (i.e. would fail whatever
  the hardware)
- ENOTSUP: the offload feature is not supported by the hardware
- ...

> +
> +#ifdef RTE_ETHDEV_TX_PREP
> +
> +static inline uint16_t
> +rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id, struct rte_mbuf **tx_pkts,
> +		uint16_t nb_pkts)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (!dev->tx_pkt_prep)
> +		return nb_pkts;
> +
> +#ifdef RTE_LIBRTE_ETHDEV_DEBUG
> +	if (queue_id >= dev->data->nb_tx_queues) {
> +		RTE_PMD_DEBUG_TRACE("Invalid TX queue_id=%d\n", queue_id);
> +		rte_errno = -EINVAL;
> +		return 0;
> +	}
> +#endif

Why checking the queue_id but not the port_id?

Maybe the API comment should also be modified to say that the port_id
has to be valid, because most ethdev functions do the check.

> +
> +	return (*dev->tx_pkt_prep)(dev->data->tx_queues[queue_id],
> +			tx_pkts, nb_pkts);
> +}
> +
> +#else
> +
> +static inline uint16_t
> +rte_eth_tx_prep(__rte_unused uint8_t port_id, __rte_unused uint16_t queue_id,
> +		__rte_unused struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> +{
> +	return nb_pkts;
> +}
> +
> +#endif
> +

nit: I wonder if the #else part should be inside the function instead
(with the proper RTE_SET_USED()), it would avoid to define the prototype
twice.

>  typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
>  		void *userdata);
>  
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 109e666..cfd6284 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -276,6 +276,15 @@ extern "C" {
>   */
>  #define PKT_TX_OUTER_IPV4   (1ULL << 59)
>  
> +#define PKT_TX_OFFLOAD_MASK (    \
> +		PKT_TX_IP_CKSUM |        \
> +		PKT_TX_L4_MASK |         \
> +		PKT_TX_OUTER_IP_CKSUM |  \
> +		PKT_TX_TCP_SEG |         \
> +		PKT_TX_QINQ_PKT |        \
> +		PKT_TX_VLAN_PKT |        \
> +		PKT_TX_TUNNEL_MASK)
> +

Could you add an API comment?

> --- /dev/null
> +++ b/lib/librte_net/rte_pkt.h
> 
> [...]
> +/**
> + * Validate general requirements for tx offload in packet.
> + */

The API comment does not have the usual format.

> +static inline int
> +rte_validate_tx_offload(struct rte_mbuf *m)

should be const struct rte_mbuf *m

> +{
> +	uint64_t ol_flags = m->ol_flags;
> +
> +	/* Does packet set any of available offloads? */
> +	if (!(ol_flags & PKT_TX_OFFLOAD_MASK))
> +		return 0;
> +
> +	/* IP checksum can be counted only for IPv4 packet */
> +	if ((ol_flags & PKT_TX_IP_CKSUM) && (ol_flags & PKT_TX_IPV6))
> +		return -EINVAL;
> +
> +	if (ol_flags & (PKT_TX_L4_MASK | PKT_TX_TCP_SEG))
> +		/* IP type not set */
> +		if (!(ol_flags & (PKT_TX_IPV4 | PKT_TX_IPV6)))
> +			return -EINVAL;
> +
> +	if (ol_flags & PKT_TX_TCP_SEG)
> +		/* PKT_TX_IP_CKSUM offload not set for IPv4 TSO packet */
> +		if ((m->tso_segsz == 0) ||
> +				((ol_flags & PKT_TX_IPV4) && !(ol_flags & PKT_TX_IP_CKSUM)))
> +			return -EINVAL;
> +
> +	/* PKT_TX_OUTER_IP_CKSUM set for non outer IPv4 packet. */
> +	if ((ol_flags & PKT_TX_OUTER_IP_CKSUM) && !(ol_flags & PKT_TX_OUTER_IPV4))
> +		return -EINVAL;
> +
> +	return 0;
> +}

It looks this function is only used when RTE_LIBRTE_ETHDEV_DEBUG is
set.

I'd say this function should go in rte_mbuf.h, because it's purely
related to mbuf flags, and does not rely on packet data or network
headers.


> +
> +/**
> + * Fix pseudo header checksum for TSO and non-TSO tcp/udp packets before
> + * hardware tx checksum.
> + * For non-TSO tcp/udp packets full pseudo-header checksum is counted and set.
> + * For TSO the IP payload length is not included.
> + */

The API comment should be fixed.

> +static inline int
> +rte_phdr_cksum_fix(struct rte_mbuf *m)
> +{
> +	struct ipv4_hdr *ipv4_hdr;
> +	struct ipv6_hdr *ipv6_hdr;
> +	struct tcp_hdr *tcp_hdr;
> +	struct udp_hdr *udp_hdr;
> +	uint64_t ol_flags = m->ol_flags;
> +	uint64_t inner_l3_offset = m->l2_len;
> +
> +	if (ol_flags & PKT_TX_OUTER_IP_CKSUM)
> +		inner_l3_offset += m->outer_l2_len + m->outer_l3_len;
> +
> +	if ((ol_flags & PKT_TX_UDP_CKSUM) == PKT_TX_UDP_CKSUM) {
> +		if (ol_flags & PKT_TX_IPV4) {
> +			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> +					inner_l3_offset);
> +
> +			if (ol_flags & PKT_TX_IP_CKSUM)
> +				ipv4_hdr->hdr_checksum = 0;
> +
> +			udp_hdr = (struct udp_hdr *)((char *)ipv4_hdr + m->l3_len);
> +			udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
> +		} else {
> +			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> +					inner_l3_offset);
> +			/* non-TSO udp */
> +			udp_hdr = rte_pktmbuf_mtod_offset(m, struct udp_hdr *,
> +					inner_l3_offset + m->l3_len);
> +			udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
> +		}
> +	} else if ((ol_flags & PKT_TX_TCP_CKSUM) ||
> +			(ol_flags & PKT_TX_TCP_SEG)) {
> +		if (ol_flags & PKT_TX_IPV4) {
> +			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> +					inner_l3_offset);
> +
> +			if (ol_flags & PKT_TX_IP_CKSUM)
> +				ipv4_hdr->hdr_checksum = 0;
> +
> +			/* non-TSO tcp or TSO */
> +			tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr + m->l3_len);
> +			tcp_hdr->cksum = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
> +		} else {
> +			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> +					inner_l3_offset);
> +			/* non-TSO tcp or TSO */
> +			tcp_hdr = rte_pktmbuf_mtod_offset(m, struct tcp_hdr *,
> +					inner_l3_offset + m->l3_len);
> +			tcp_hdr->cksum = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
> +		}
> +	}
> +	return 0;
> +}
> +
> +#endif /* _RTE_PKT_H_ */
> 

The function expects that all the network headers are in the first, and
that each of them is contiguous.

Also, I had an interesting remark from Stephen [1] on a similar code.
If the mbuf is a clone, it will modify the data of the direct mbuf,
which should be read-only. Note that it is likely to happen in
a TCP stack, because the packet is kept locally in case it has to
be retransmitted. Cloning a mbuf is more efficient than duplicating
it.

I plan to fix it in virtio code by "uncloning" the headers.

[1] http://dpdk.org/ml/archives/dev/2016-October/048873.html



Regards,
Olivier
  
Tomasz Kulasek Oct. 19, 2016, 3:42 p.m. UTC | #2
Hi Olivier,

> -----Original Message-----
> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Tuesday, October 18, 2016 16:57
> To: Kulasek, TomaszX <tomaszx.kulasek@intel.com>; dev@dpdk.org
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> thomas.monjalon@6wind.com
> Subject: Re: [dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation
> 
> Hi Tomasz,
> 
> I think the principle of tx_prep() is good, it may for instance help to
> remove the function virtio_tso_fix_cksum() from the virtio, and maybe even
> change the mbuf TSO/cksum API.
> 
> I have some questions/comments below, I'm sorry it comes very late.
> 
> On 10/14/2016 05:05 PM, Tomasz Kulasek wrote:
> > Added API for `rte_eth_tx_prep`
> >
> > uint16_t rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id,
> > 	struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> >
> > Added fields to the `struct rte_eth_desc_lim`:
> >
> > 	uint16_t nb_seg_max;
> > 		/**< Max number of segments per whole packet. */
> >
> > 	uint16_t nb_mtu_seg_max;
> > 		/**< Max number of segments per one MTU */
> 
> Not sure I understand the second one. Is this is case of TSO?
> 
> Is it a usual limitation in different network hardware?
> Can this info be retrieved/used by the application?
> 

Yes, limitation of number of segments may differ depend of TSO/non TSO, e.g. for Fortville NIC. This information is available for application

> >
> > Created `rte_pkt.h` header with common used functions:
> >
> > int rte_validate_tx_offload(struct rte_mbuf *m)
> > 	to validate general requirements for tx offload in packet such a
> > 	flag completness. In current implementation this function is called
> > 	optionaly when RTE_LIBRTE_ETHDEV_DEBUG is enabled.
> >
> > int rte_phdr_cksum_fix(struct rte_mbuf *m)
> > 	to fix pseudo header checksum for TSO and non-TSO tcp/udp packets
> > 	before hardware tx checksum offload.
> > 	 - for non-TSO tcp/udp packets full pseudo-header checksum is
> > 	   counted and set.
> > 	 - for TSO the IP payload length is not included.
> 
> Why not in rte_net.h?
> 
> 
> > [...]
> >
> > @@ -2816,6 +2825,82 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t
> queue_id,
> >  	return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts,
> > nb_pkts);  }
> >
> > +/**
> > + * Process a burst of output packets on a transmit queue of an Ethernet
> device.
> > + *
> > + * The rte_eth_tx_prep() function is invoked to prepare output
> > +packets to be
> > + * transmitted on the output queue *queue_id* of the Ethernet device
> > +designated
> > + * by its *port_id*.
> > + * The *nb_pkts* parameter is the number of packets to be prepared
> > +which are
> > + * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of
> > +them
> > + * allocated from a pool created with rte_pktmbuf_pool_create().
> > + * For each packet to send, the rte_eth_tx_prep() function performs
> > + * the following operations:
> > + *
> > + * - Check if packet meets devices requirements for tx offloads.
> 
> Do you mean hardware requirements?
> Can the application be aware of these requirements? I mean capability
> flags, or something in dev_infos?

Yes. If some offloads cannot be handled by hardware, it fails. Also if e.g. number of segments is invalid and so on.

> 
> Maybe the comment could be more precise?
> 
> > + * - Check limitations about number of segments.
> > + *
> > + * - Check additional requirements when debug is enabled.
> 
> What kind of additional requirements?
> 

We may assume, that application setts right, e.g. ip header version is required for most of checksum offloads. To not have additional performance overhead, these checks are done only when debug is on.

> > + *
> > + * - Update and/or reset required checksums when tx offload is set for
> packet.
> > + *
> 
> By reading this, I think it may not be clear for the user about what
> should be set in the mbuf. In mbuf API, it is said:
> 
>  * TCP segmentation offload. To enable this offload feature for a
>  * packet to be transmitted on hardware supporting TSO:
>  *  - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
>  *    PKT_TX_TCP_CKSUM)
>  *  - set the flag PKT_TX_IPV4 or PKT_TX_IPV6
>  *  - if it's IPv4, set the PKT_TX_IP_CKSUM flag and write the IP checksum
>  *    to 0 in the packet
>  *  - fill the mbuf offload information: l2_len, l3_len, l4_len, tso_segsz
>  *  - calculate the pseudo header checksum without taking ip_len in
> account,
>  *    and set it in the TCP header. Refer to rte_ipv4_phdr_cksum() and
>  *    rte_ipv6_phdr_cksum() that can be used as helpers.
> 
> 
> If I understand well, using tx_prep(), the user will have to do the same
> except writing the IP checksum to 0, and without setting the TCP pseudo
> header checksum, right?
> 

Right, but this header information is still valid for tx_burst operation done without preparation stage.

> 
> > + * The rte_eth_tx_prep() function returns the number of packets ready
> > + to be
> > + * sent. A return value equal to *nb_pkts* means that all packets are
> > + valid and
> > + * ready to be sent.
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *   The index of the transmit queue through which output packets must
> be
> > + *   sent.
> > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param tx_pkts
> > + *   The address of an array of *nb_pkts* pointers to *rte_mbuf*
> structures
> > + *   which contain the output packets.
> > + * @param nb_pkts
> > + *   The maximum number of packets to process.
> > + * @return
> > + *   The number of packets correct and ready to be sent. The return
> value can be
> > + *   less than the value of the *tx_pkts* parameter when some packet
> doesn't
> > + *   meet devices requirements with rte_errno set appropriately.
> > + */
> 
> Can we add the constraint that invalid packets are left untouched?
> 
> I think most of the time there will be a software fallback in that case,
> so it would be good to ensure that this function does not change the flags
> or the packet data.

In current implementation, if packet is invalid, its data is never modified. Only checks are done. The only exception is when checksum needs to be updated or initialized, but it's done after packet validation.
If we want to use/restore packet in application it didn't should be changed in any way for invalid packets.

> 
> Another thing that could be interesting for the caller is to know the
> reason of the failure. Maybe the different errno types could be detailed
> here. For instance:
> - EINVAL: offload flags are not correctly set (i.e. would fail whatever
>   the hardware)
> - ENOTSUP: the offload feature is not supported by the hardware
> - ...
> 

Ok.

> > +
> > +#ifdef RTE_ETHDEV_TX_PREP
> > +
> > +static inline uint16_t
> > +rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id, struct rte_mbuf
> **tx_pkts,
> > +		uint16_t nb_pkts)
> > +{
> > +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +
> > +	if (!dev->tx_pkt_prep)
> > +		return nb_pkts;
> > +
> > +#ifdef RTE_LIBRTE_ETHDEV_DEBUG
> > +	if (queue_id >= dev->data->nb_tx_queues) {
> > +		RTE_PMD_DEBUG_TRACE("Invalid TX queue_id=%d\n", queue_id);
> > +		rte_errno = -EINVAL;
> > +		return 0;
> > +	}
> > +#endif
> 
> Why checking the queue_id but not the port_id?
> 
> Maybe the API comment should also be modified to say that the port_id has
> to be valid, because most ethdev functions do the check.
> 

I can add this check when debug is on to made it more complete, and update an API comment for the default case.

> > +
> > +	return (*dev->tx_pkt_prep)(dev->data->tx_queues[queue_id],
> > +			tx_pkts, nb_pkts);
> > +}
> > +
> > +#else
> > +
> > +static inline uint16_t
> > +rte_eth_tx_prep(__rte_unused uint8_t port_id, __rte_unused uint16_t
> queue_id,
> > +		__rte_unused struct rte_mbuf **tx_pkts, uint16_t nb_pkts) {
> > +	return nb_pkts;
> > +}
> > +
> > +#endif
> > +
> 
> nit: I wonder if the #else part should be inside the function instead
> (with the proper RTE_SET_USED()), it would avoid to define the prototype
> twice.
> 
> >  typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t
> count,
> >  		void *userdata);
> >
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 109e666..cfd6284 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -276,6 +276,15 @@ extern "C" {
> >   */
> >  #define PKT_TX_OUTER_IPV4   (1ULL << 59)
> >
> > +#define PKT_TX_OFFLOAD_MASK (    \
> > +		PKT_TX_IP_CKSUM |        \
> > +		PKT_TX_L4_MASK |         \
> > +		PKT_TX_OUTER_IP_CKSUM |  \
> > +		PKT_TX_TCP_SEG |         \
> > +		PKT_TX_QINQ_PKT |        \
> > +		PKT_TX_VLAN_PKT |        \
> > +		PKT_TX_TUNNEL_MASK)
> > +
> 
> Could you add an API comment?
> 
> > --- /dev/null
> > +++ b/lib/librte_net/rte_pkt.h
> >
> > [...]
> > +/**
> > + * Validate general requirements for tx offload in packet.
> > + */
> 
> The API comment does not have the usual format.
> 
> > +static inline int
> > +rte_validate_tx_offload(struct rte_mbuf *m)
> 
> should be const struct rte_mbuf *m
> 
> > +{
> > +	uint64_t ol_flags = m->ol_flags;
> > +
> > +	/* Does packet set any of available offloads? */
> > +	if (!(ol_flags & PKT_TX_OFFLOAD_MASK))
> > +		return 0;
> > +
> > +	/* IP checksum can be counted only for IPv4 packet */
> > +	if ((ol_flags & PKT_TX_IP_CKSUM) && (ol_flags & PKT_TX_IPV6))
> > +		return -EINVAL;
> > +
> > +	if (ol_flags & (PKT_TX_L4_MASK | PKT_TX_TCP_SEG))
> > +		/* IP type not set */
> > +		if (!(ol_flags & (PKT_TX_IPV4 | PKT_TX_IPV6)))
> > +			return -EINVAL;
> > +
> > +	if (ol_flags & PKT_TX_TCP_SEG)
> > +		/* PKT_TX_IP_CKSUM offload not set for IPv4 TSO packet */
> > +		if ((m->tso_segsz == 0) ||
> > +				((ol_flags & PKT_TX_IPV4) && !(ol_flags &
> PKT_TX_IP_CKSUM)))
> > +			return -EINVAL;
> > +
> > +	/* PKT_TX_OUTER_IP_CKSUM set for non outer IPv4 packet. */
> > +	if ((ol_flags & PKT_TX_OUTER_IP_CKSUM) && !(ol_flags &
> PKT_TX_OUTER_IPV4))
> > +		return -EINVAL;
> > +
> > +	return 0;
> > +}
> 
> It looks this function is only used when RTE_LIBRTE_ETHDEV_DEBUG is set.
> 

Yes, for performance reasons we expect, that application sets these values right. This is the most of "additional checks" mentioned before.

> I'd say this function should go in rte_mbuf.h, because it's purely related
> to mbuf flags, and does not rely on packet data or network headers.
> 
> 
> > +
> > +/**
> > + * Fix pseudo header checksum for TSO and non-TSO tcp/udp packets
> > +before
> > + * hardware tx checksum.
> > + * For non-TSO tcp/udp packets full pseudo-header checksum is counted
> and set.
> > + * For TSO the IP payload length is not included.
> > + */
> 
> The API comment should be fixed.

Ok.

> 
> > +static inline int
> > +rte_phdr_cksum_fix(struct rte_mbuf *m) {
> > +	struct ipv4_hdr *ipv4_hdr;
> > +	struct ipv6_hdr *ipv6_hdr;
> > +	struct tcp_hdr *tcp_hdr;
> > +	struct udp_hdr *udp_hdr;
> > +	uint64_t ol_flags = m->ol_flags;
> > +	uint64_t inner_l3_offset = m->l2_len;
> > +
> > +	if (ol_flags & PKT_TX_OUTER_IP_CKSUM)
> > +		inner_l3_offset += m->outer_l2_len + m->outer_l3_len;
> > +
> > +	if ((ol_flags & PKT_TX_UDP_CKSUM) == PKT_TX_UDP_CKSUM) {
> > +		if (ol_flags & PKT_TX_IPV4) {
> > +			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> > +					inner_l3_offset);
> > +
> > +			if (ol_flags & PKT_TX_IP_CKSUM)
> > +				ipv4_hdr->hdr_checksum = 0;
> > +
> > +			udp_hdr = (struct udp_hdr *)((char *)ipv4_hdr + m-
> >l3_len);
> > +			udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr,
> ol_flags);
> > +		} else {
> > +			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> > +					inner_l3_offset);
> > +			/* non-TSO udp */
> > +			udp_hdr = rte_pktmbuf_mtod_offset(m, struct udp_hdr *,
> > +					inner_l3_offset + m->l3_len);
> > +			udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr,
> ol_flags);
> > +		}
> > +	} else if ((ol_flags & PKT_TX_TCP_CKSUM) ||
> > +			(ol_flags & PKT_TX_TCP_SEG)) {
> > +		if (ol_flags & PKT_TX_IPV4) {
> > +			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> > +					inner_l3_offset);
> > +
> > +			if (ol_flags & PKT_TX_IP_CKSUM)
> > +				ipv4_hdr->hdr_checksum = 0;
> > +
> > +			/* non-TSO tcp or TSO */
> > +			tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr + m-
> >l3_len);
> > +			tcp_hdr->cksum = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
> > +		} else {
> > +			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> > +					inner_l3_offset);
> > +			/* non-TSO tcp or TSO */
> > +			tcp_hdr = rte_pktmbuf_mtod_offset(m, struct tcp_hdr *,
> > +					inner_l3_offset + m->l3_len);
> > +			tcp_hdr->cksum = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> > +#endif /* _RTE_PKT_H_ */
> >
> 
> The function expects that all the network headers are in the first, and
> that each of them is contiguous.
> 

Yes, I see...

> Also, I had an interesting remark from Stephen [1] on a similar code.
> If the mbuf is a clone, it will modify the data of the direct mbuf, which
> should be read-only. Note that it is likely to happen in a TCP stack,
> because the packet is kept locally in case it has to be retransmitted.
> Cloning a mbuf is more efficient than duplicating it.
> 
> I plan to fix it in virtio code by "uncloning" the headers.
> 
> [1] http://dpdk.org/ml/archives/dev/2016-October/048873.html
> 
> 
> 
> Regards,
> Olivier

Thanks,
Tomasz
  
Ananyev, Konstantin Oct. 19, 2016, 10:07 p.m. UTC | #3
Hi guys,

> >
> > > +static inline int
> > > +rte_phdr_cksum_fix(struct rte_mbuf *m) {
> > > +	struct ipv4_hdr *ipv4_hdr;
> > > +	struct ipv6_hdr *ipv6_hdr;
> > > +	struct tcp_hdr *tcp_hdr;
> > > +	struct udp_hdr *udp_hdr;
> > > +	uint64_t ol_flags = m->ol_flags;
> > > +	uint64_t inner_l3_offset = m->l2_len;
> > > +
> > > +	if (ol_flags & PKT_TX_OUTER_IP_CKSUM)
> > > +		inner_l3_offset += m->outer_l2_len + m->outer_l3_len;
> > > +
> > > +	if ((ol_flags & PKT_TX_UDP_CKSUM) == PKT_TX_UDP_CKSUM) {
> > > +		if (ol_flags & PKT_TX_IPV4) {
> > > +			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> > > +					inner_l3_offset);
> > > +
> > > +			if (ol_flags & PKT_TX_IP_CKSUM)
> > > +				ipv4_hdr->hdr_checksum = 0;
> > > +
> > > +			udp_hdr = (struct udp_hdr *)((char *)ipv4_hdr + m-
> > >l3_len);
> > > +			udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr,
> > ol_flags);
> > > +		} else {
> > > +			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> > > +					inner_l3_offset);
> > > +			/* non-TSO udp */
> > > +			udp_hdr = rte_pktmbuf_mtod_offset(m, struct udp_hdr *,
> > > +					inner_l3_offset + m->l3_len);
> > > +			udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr,
> > ol_flags);
> > > +		}
> > > +	} else if ((ol_flags & PKT_TX_TCP_CKSUM) ||
> > > +			(ol_flags & PKT_TX_TCP_SEG)) {
> > > +		if (ol_flags & PKT_TX_IPV4) {
> > > +			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> > > +					inner_l3_offset);
> > > +
> > > +			if (ol_flags & PKT_TX_IP_CKSUM)
> > > +				ipv4_hdr->hdr_checksum = 0;
> > > +
> > > +			/* non-TSO tcp or TSO */
> > > +			tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr + m-
> > >l3_len);
> > > +			tcp_hdr->cksum = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
> > > +		} else {
> > > +			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> > > +					inner_l3_offset);
> > > +			/* non-TSO tcp or TSO */
> > > +			tcp_hdr = rte_pktmbuf_mtod_offset(m, struct tcp_hdr *,
> > > +					inner_l3_offset + m->l3_len);
> > > +			tcp_hdr->cksum = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
> > > +		}
> > > +	}
> > > +	return 0;
> > > +}
> > > +
> > > +#endif /* _RTE_PKT_H_ */
> > >
> >
> > The function expects that all the network headers are in the first, and
> > that each of them is contiguous.
> >
> 
> Yes, I see...

Yes it does.
I suppose it is a legitimate restriction  (assumptions) for those who'd like to use that function.
But that's a good point and I suppose we need to state it explicitly in the API comments.

> 
> > Also, I had an interesting remark from Stephen [1] on a similar code.
> > If the mbuf is a clone, it will modify the data of the direct mbuf, which
> > should be read-only. Note that it is likely to happen in a TCP stack,
> > because the packet is kept locally in case it has to be retransmitted.
> > Cloning a mbuf is more efficient than duplicating it.
> >
> > I plan to fix it in virtio code by "uncloning" the headers.
> >
> > [1] http://dpdk.org/ml/archives/dev/2016-October/048873.html

This subject is probably a bit off-topic... 
My position on it - it shouldn't be a PMD responsibility to make these kind of checks.
I think it should be upper layer responsibility to provide to the PMD an mbuf
that can be safely used (and in that case modified) by the driver.
Let say if upper layer would like to use packet clone for TCP retransmissions,
It can easily overcome that problem by cloning only data part of the packet,
and putting L2/L3/L4 headers in a separate (head) mbuf and chaining it with
cloned data mbuf.

Konstantin
  

Patch

diff --git a/config/common_base b/config/common_base
index c7fd3db..619284b 100644
--- a/config/common_base
+++ b/config/common_base
@@ -120,6 +120,7 @@  CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
 CONFIG_RTE_LIBRTE_IEEE1588=n
 CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
 CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
+CONFIG_RTE_ETHDEV_TX_PREP=y
 
 #
 # Support NIC bypass logic
diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 38641e8..a10ed9c 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -182,6 +182,7 @@  extern "C" {
 #include <rte_pci.h>
 #include <rte_dev.h>
 #include <rte_devargs.h>
+#include <rte_errno.h>
 #include "rte_ether.h"
 #include "rte_eth_ctrl.h"
 #include "rte_dev_info.h"
@@ -699,6 +700,8 @@  struct rte_eth_desc_lim {
 	uint16_t nb_max;   /**< Max allowed number of descriptors. */
 	uint16_t nb_min;   /**< Min allowed number of descriptors. */
 	uint16_t nb_align; /**< Number of descriptors should be aligned to. */
+	uint16_t nb_seg_max;     /**< Max number of segments per whole packet. */
+	uint16_t nb_mtu_seg_max; /**< Max number of segments per one MTU */
 };
 
 /**
@@ -1188,6 +1191,11 @@  typedef uint16_t (*eth_tx_burst_t)(void *txq,
 				   uint16_t nb_pkts);
 /**< @internal Send output packets on a transmit queue of an Ethernet device. */
 
+typedef uint16_t (*eth_tx_prep_t)(void *txq,
+				   struct rte_mbuf **tx_pkts,
+				   uint16_t nb_pkts);
+/**< @internal Prepare output packets on a transmit queue of an Ethernet device. */
+
 typedef int (*flow_ctrl_get_t)(struct rte_eth_dev *dev,
 			       struct rte_eth_fc_conf *fc_conf);
 /**< @internal Get current flow control parameter on an Ethernet device */
@@ -1622,6 +1630,7 @@  struct rte_eth_rxtx_callback {
 struct rte_eth_dev {
 	eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive function. */
 	eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit function. */
+	eth_tx_prep_t tx_pkt_prep; /**< Pointer to PMD transmit prepare function. */
 	struct rte_eth_dev_data *data;  /**< Pointer to device data */
 	const struct eth_driver *driver;/**< Driver for this device */
 	const struct eth_dev_ops *dev_ops; /**< Functions exported by PMD */
@@ -2816,6 +2825,82 @@  rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
 	return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, nb_pkts);
 }
 
+/**
+ * Process a burst of output packets on a transmit queue of an Ethernet device.
+ *
+ * The rte_eth_tx_prep() function is invoked to prepare output packets to be
+ * transmitted on the output queue *queue_id* of the Ethernet device designated
+ * by its *port_id*.
+ * The *nb_pkts* parameter is the number of packets to be prepared which are
+ * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of them
+ * allocated from a pool created with rte_pktmbuf_pool_create().
+ * For each packet to send, the rte_eth_tx_prep() function performs
+ * the following operations:
+ *
+ * - Check if packet meets devices requirements for tx offloads.
+ *
+ * - Check limitations about number of segments.
+ *
+ * - Check additional requirements when debug is enabled.
+ *
+ * - Update and/or reset required checksums when tx offload is set for packet.
+ *
+ * The rte_eth_tx_prep() function returns the number of packets ready to be
+ * sent. A return value equal to *nb_pkts* means that all packets are valid and
+ * ready to be sent.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The index of the transmit queue through which output packets must be
+ *   sent.
+ *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param tx_pkts
+ *   The address of an array of *nb_pkts* pointers to *rte_mbuf* structures
+ *   which contain the output packets.
+ * @param nb_pkts
+ *   The maximum number of packets to process.
+ * @return
+ *   The number of packets correct and ready to be sent. The return value can be
+ *   less than the value of the *tx_pkts* parameter when some packet doesn't
+ *   meet devices requirements with rte_errno set appropriately.
+ */
+
+#ifdef RTE_ETHDEV_TX_PREP
+
+static inline uint16_t
+rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (!dev->tx_pkt_prep)
+		return nb_pkts;
+
+#ifdef RTE_LIBRTE_ETHDEV_DEBUG
+	if (queue_id >= dev->data->nb_tx_queues) {
+		RTE_PMD_DEBUG_TRACE("Invalid TX queue_id=%d\n", queue_id);
+		rte_errno = -EINVAL;
+		return 0;
+	}
+#endif
+
+	return (*dev->tx_pkt_prep)(dev->data->tx_queues[queue_id],
+			tx_pkts, nb_pkts);
+}
+
+#else
+
+static inline uint16_t
+rte_eth_tx_prep(__rte_unused uint8_t port_id, __rte_unused uint16_t queue_id,
+		__rte_unused struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	return nb_pkts;
+}
+
+#endif
+
 typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
 		void *userdata);
 
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 109e666..cfd6284 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -276,6 +276,15 @@  extern "C" {
  */
 #define PKT_TX_OUTER_IPV4   (1ULL << 59)
 
+#define PKT_TX_OFFLOAD_MASK (    \
+		PKT_TX_IP_CKSUM |        \
+		PKT_TX_L4_MASK |         \
+		PKT_TX_OUTER_IP_CKSUM |  \
+		PKT_TX_TCP_SEG |         \
+		PKT_TX_QINQ_PKT |        \
+		PKT_TX_VLAN_PKT |        \
+		PKT_TX_TUNNEL_MASK)
+
 /**
  * Packet outer header is IPv6. This flag must be set when using any
  * outer offload feature (L4 checksum) to tell the NIC that the outer
diff --git a/lib/librte_net/Makefile b/lib/librte_net/Makefile
index e5758ce..cc69bc0 100644
--- a/lib/librte_net/Makefile
+++ b/lib/librte_net/Makefile
@@ -1,6 +1,6 @@ 
 #   BSD LICENSE
 #
-#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+#   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
 #   All rights reserved.
 #
 #   Redistribution and use in source and binary forms, with or without
@@ -44,6 +44,7 @@  SRCS-$(CONFIG_RTE_LIBRTE_NET) := rte_net.c
 SYMLINK-$(CONFIG_RTE_LIBRTE_NET)-include := rte_ip.h rte_tcp.h rte_udp.h
 SYMLINK-$(CONFIG_RTE_LIBRTE_NET)-include += rte_sctp.h rte_icmp.h rte_arp.h
 SYMLINK-$(CONFIG_RTE_LIBRTE_NET)-include += rte_ether.h rte_gre.h rte_net.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_NET)-include += rte_pkt.h
 
 DEPDIRS-$(CONFIG_RTE_LIBRTE_NET) += lib/librte_eal lib/librte_mempool
 DEPDIRS-$(CONFIG_RTE_LIBRTE_NET) += lib/librte_mbuf
diff --git a/lib/librte_net/rte_pkt.h b/lib/librte_net/rte_pkt.h
new file mode 100644
index 0000000..c4bd7b2
--- /dev/null
+++ b/lib/librte_net/rte_pkt.h
@@ -0,0 +1,137 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_PKT_H_
+#define _RTE_PKT_H_
+
+#include <rte_ip.h>
+#include <rte_udp.h>
+#include <rte_tcp.h>
+#include <rte_sctp.h>
+
+/**
+ * Validate general requirements for tx offload in packet.
+ */
+static inline int
+rte_validate_tx_offload(struct rte_mbuf *m)
+{
+	uint64_t ol_flags = m->ol_flags;
+
+	/* Does packet set any of available offloads? */
+	if (!(ol_flags & PKT_TX_OFFLOAD_MASK))
+		return 0;
+
+	/* IP checksum can be counted only for IPv4 packet */
+	if ((ol_flags & PKT_TX_IP_CKSUM) && (ol_flags & PKT_TX_IPV6))
+		return -EINVAL;
+
+	if (ol_flags & (PKT_TX_L4_MASK | PKT_TX_TCP_SEG))
+		/* IP type not set */
+		if (!(ol_flags & (PKT_TX_IPV4 | PKT_TX_IPV6)))
+			return -EINVAL;
+
+	if (ol_flags & PKT_TX_TCP_SEG)
+		/* PKT_TX_IP_CKSUM offload not set for IPv4 TSO packet */
+		if ((m->tso_segsz == 0) ||
+				((ol_flags & PKT_TX_IPV4) && !(ol_flags & PKT_TX_IP_CKSUM)))
+			return -EINVAL;
+
+	/* PKT_TX_OUTER_IP_CKSUM set for non outer IPv4 packet. */
+	if ((ol_flags & PKT_TX_OUTER_IP_CKSUM) && !(ol_flags & PKT_TX_OUTER_IPV4))
+		return -EINVAL;
+
+	return 0;
+}
+
+/**
+ * Fix pseudo header checksum for TSO and non-TSO tcp/udp packets before
+ * hardware tx checksum.
+ * For non-TSO tcp/udp packets full pseudo-header checksum is counted and set.
+ * For TSO the IP payload length is not included.
+ */
+static inline int
+rte_phdr_cksum_fix(struct rte_mbuf *m)
+{
+	struct ipv4_hdr *ipv4_hdr;
+	struct ipv6_hdr *ipv6_hdr;
+	struct tcp_hdr *tcp_hdr;
+	struct udp_hdr *udp_hdr;
+	uint64_t ol_flags = m->ol_flags;
+	uint64_t inner_l3_offset = m->l2_len;
+
+	if (ol_flags & PKT_TX_OUTER_IP_CKSUM)
+		inner_l3_offset += m->outer_l2_len + m->outer_l3_len;
+
+	if ((ol_flags & PKT_TX_UDP_CKSUM) == PKT_TX_UDP_CKSUM) {
+		if (ol_flags & PKT_TX_IPV4) {
+			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
+					inner_l3_offset);
+
+			if (ol_flags & PKT_TX_IP_CKSUM)
+				ipv4_hdr->hdr_checksum = 0;
+
+			udp_hdr = (struct udp_hdr *)((char *)ipv4_hdr + m->l3_len);
+			udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
+		} else {
+			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
+					inner_l3_offset);
+			/* non-TSO udp */
+			udp_hdr = rte_pktmbuf_mtod_offset(m, struct udp_hdr *,
+					inner_l3_offset + m->l3_len);
+			udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
+		}
+	} else if ((ol_flags & PKT_TX_TCP_CKSUM) ||
+			(ol_flags & PKT_TX_TCP_SEG)) {
+		if (ol_flags & PKT_TX_IPV4) {
+			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
+					inner_l3_offset);
+
+			if (ol_flags & PKT_TX_IP_CKSUM)
+				ipv4_hdr->hdr_checksum = 0;
+
+			/* non-TSO tcp or TSO */
+			tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr + m->l3_len);
+			tcp_hdr->cksum = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
+		} else {
+			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
+					inner_l3_offset);
+			/* non-TSO tcp or TSO */
+			tcp_hdr = rte_pktmbuf_mtod_offset(m, struct tcp_hdr *,
+					inner_l3_offset + m->l3_len);
+			tcp_hdr->cksum = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
+		}
+	}
+	return 0;
+}
+
+#endif /* _RTE_PKT_H_ */