[dpdk-dev,v12,1/6] ethdev: add Tx preparation

Message ID 1479922585-8640-2-git-send-email-tomaszx.kulasek@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers

Checks

Context Check Description
checkpatch/checkpatch success coding style OK

Commit Message

Tomasz Kulasek Nov. 23, 2016, 5:36 p.m. UTC
  Added API for `rte_eth_tx_prepare`

uint16_t rte_eth_tx_prepare(uint8_t port_id, uint16_t queue_id,
	struct rte_mbuf **tx_pkts, uint16_t nb_pkts)

Added fields to the `struct rte_eth_desc_lim`:

	uint16_t nb_seg_max;
		/**< Max number of segments per whole packet. */

	uint16_t nb_mtu_seg_max;
		/**< Max number of segments per one MTU */

Added functions:

int rte_validate_tx_offload(struct rte_mbuf *m)
	to validate general requirements for tx offload set in mbuf of packet
  such a flag completness. In current implementation this function is
  called optionaly when RTE_LIBRTE_ETHDEV_DEBUG is enabled.

int rte_net_intel_cksum_prepare(struct rte_mbuf *m)
	to fix pseudo header checksum for TSO and non-TSO tcp/udp packets
	before hardware tx checksum offload.
	 - for non-TSO tcp/udp packets full pseudo-header checksum is
	   counted and set.
	 - for TSO the IP payload length is not included.

PERFORMANCE TESTS
-----------------

This feature was tested with modified csum engine from test-pmd.

The packet checksum preparation was moved from application to Tx
preparation step placed before burst.

We may expect some overhead costs caused by:
1) using additional callback before burst,
2) rescanning burst,
3) additional condition checking (packet validation),
4) worse optimization (e.g. packet data access, etc.)

We tested it using ixgbe Tx preparation implementation with some parts
disabled to have comparable information about the impact of different
parts of implementation.

IMPACT:

1) For unimplemented Tx preparation callback the performance impact is
   negligible,
2) For packet condition check without checksum modifications (nb_segs,
   available offloads, etc.) is 14626628/14252168 (~2.62% drop),
3) Full support in ixgbe driver (point 2 + packet checksum
   initialization) is 14060924/13588094 (~3.48% drop)

Signed-off-by: Tomasz Kulasek <tomaszx.kulasek@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
---
 config/common_base            |    1 +
 lib/librte_ether/rte_ethdev.h |  106 +++++++++++++++++++++++++++++++++++++++++
 lib/librte_mbuf/rte_mbuf.h    |   64 +++++++++++++++++++++++++
 lib/librte_net/rte_net.h      |   85 +++++++++++++++++++++++++++++++++
 4 files changed, 256 insertions(+)
  

Comments

Thomas Monjalon Nov. 28, 2016, 10:54 a.m. UTC | #1
Hi,

2016-11-23 18:36, Tomasz Kulasek:
> --- a/config/common_base
> +++ b/config/common_base
> @@ -120,6 +120,7 @@ CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
>  CONFIG_RTE_LIBRTE_IEEE1588=n
>  CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
>  CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
> +CONFIG_RTE_ETHDEV_TX_PREPARE=y

Please, remind me why is there a configuration here.
It should be the responsibility of the application to call tx_prepare
or not. If the application choose to use this new API but it is
disabled, then the packets won't be prepared and there is no error code:

> +#else
> +
> +static inline uint16_t
> +rte_eth_tx_prepare(__rte_unused uint8_t port_id, __rte_unused uint16_t queue_id,
> +               __rte_unused struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> +{
> +       return nb_pkts;
> +}
> +
> +#endif

So the application is not aware of the issue and it will not use
any fallback.
  
Thomas Monjalon Dec. 1, 2016, 4:24 p.m. UTC | #2
Please, a reply to this question would be greatly appreciated.

2016-11-28 11:54, Thomas Monjalon:
> Hi,
> 
> 2016-11-23 18:36, Tomasz Kulasek:
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -120,6 +120,7 @@ CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
> >  CONFIG_RTE_LIBRTE_IEEE1588=n
> >  CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
> >  CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
> > +CONFIG_RTE_ETHDEV_TX_PREPARE=y
> 
> Please, remind me why is there a configuration here.
> It should be the responsibility of the application to call tx_prepare
> or not. If the application choose to use this new API but it is
> disabled, then the packets won't be prepared and there is no error code:
> 
> > +#else
> > +
> > +static inline uint16_t
> > +rte_eth_tx_prepare(__rte_unused uint8_t port_id, __rte_unused uint16_t queue_id,
> > +               __rte_unused struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> > +{
> > +       return nb_pkts;
> > +}
> > +
> > +#endif
> 
> So the application is not aware of the issue and it will not use
> any fallback.
  
Thomas Monjalon Dec. 1, 2016, 4:26 p.m. UTC | #3
2016-11-23 18:36, Tomasz Kulasek:
> Added fields to the `struct rte_eth_desc_lim`:
> 
> 	uint16_t nb_seg_max;
> 		/**< Max number of segments per whole packet. */
> 
> 	uint16_t nb_mtu_seg_max;
> 		/**< Max number of segments per one MTU */

How (and when) an application is supposed to use these fields?
Is it useful to expose them if we make tx_prepare() mandatory?
  
Thomas Monjalon Dec. 1, 2016, 4:28 p.m. UTC | #4
2016-11-23 18:36, Tomasz Kulasek:
> +/**
> + * Process a burst of output packets on a transmit queue of an Ethernet device.
> + *
> + * The rte_eth_tx_prepare() function is invoked to prepare output packets to be
> + * transmitted on the output queue *queue_id* of the Ethernet device designated
> + * by its *port_id*.
> + * The *nb_pkts* parameter is the number of packets to be prepared which are
> + * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of them
> + * allocated from a pool created with rte_pktmbuf_pool_create().
> + * For each packet to send, the rte_eth_tx_prepare() function performs
> + * the following operations:
> + *
> + * - Check if packet meets devices requirements for tx offloads.
> + *
> + * - Check limitations about number of segments.
> + *
> + * - Check additional requirements when debug is enabled.
> + *
> + * - Update and/or reset required checksums when tx offload is set for packet.
> + *
> + * Since this function can modify packet data, provided mbufs must be safely
> + * writable (e.g. modified data cannot be in shared segment).

I think we will have to remove this limitation in next releases.
As we don't know how it could affect the API, I suggest to declare this
API EXPERIMENTAL.
  
Tomasz Kulasek Dec. 1, 2016, 7:20 p.m. UTC | #5
Hi Thomas,

Sorry, I have answered for this question in another thread and I missed about this one. Detailed answer is below.

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Thursday, December 1, 2016 17:24
> To: Kulasek, TomaszX <tomaszx.kulasek@intel.com>
> Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> olivier.matz@6wind.com; Richardson, Bruce <bruce.richardson@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
> 
> Please, a reply to this question would be greatly appreciated.
> 
> 2016-11-28 11:54, Thomas Monjalon:
> > Hi,
> >
> > 2016-11-23 18:36, Tomasz Kulasek:
> > > --- a/config/common_base
> > > +++ b/config/common_base
> > > @@ -120,6 +120,7 @@ CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
> > >  CONFIG_RTE_LIBRTE_IEEE1588=n
> > >  CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
> > >  CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
> > > +CONFIG_RTE_ETHDEV_TX_PREPARE=y
> >
> > Please, remind me why is there a configuration here.
> > It should be the responsibility of the application to call tx_prepare
> > or not. If the application choose to use this new API but it is
> > disabled, then the packets won't be prepared and there is no error code:
> >
> > > +#else
> > > +
> > > +static inline uint16_t
> > > +rte_eth_tx_prepare(__rte_unused uint8_t port_id, __rte_unused
> uint16_t queue_id,
> > > +               __rte_unused struct rte_mbuf **tx_pkts, uint16_t
> > > +nb_pkts) {
> > > +       return nb_pkts;
> > > +}
> > > +
> > > +#endif
> >
> > So the application is not aware of the issue and it will not use any
> > fallback.

tx_prepare mechanism can be turned off by compilation flag (as discussed with Jerin in http://dpdk.org/dev/patchwork/patch/15770/) to provide real NOOP functionality (e.g. for low-end CPUs, where even unnecessary memory dereference and check can have significant impact on performance).

Jerin observed that on some architectures (e.g. low-end ARM with embedded NIC), just reading and comparing 'dev->tx_pkt_prepare' may cause significant performance drop, so he proposed to introduce this configuration flag to provide real NOOP when tx_prepare functionality is not required, and can be turned on based on the _target_ configuration.

For other cases, when this flag is turned on (by default), and tx_prepare is not implemented, functional NOOP is used based on comparison (dev->tx_pkt_prepare == NULL).

Tomasz
  
Thomas Monjalon Dec. 1, 2016, 7:52 p.m. UTC | #6
2016-12-01 19:20, Kulasek, TomaszX:
> Hi Thomas,
> 
> Sorry, I have answered for this question in another thread and I missed about this one. Detailed answer is below.

Yes you already gave this answer.
And I will continue asking the question until you understand it.

> > 2016-11-28 11:54, Thomas Monjalon:
> > > Hi,
> > >
> > > 2016-11-23 18:36, Tomasz Kulasek:
> > > > --- a/config/common_base
> > > > +++ b/config/common_base
> > > > @@ -120,6 +120,7 @@ CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
> > > >  CONFIG_RTE_LIBRTE_IEEE1588=n
> > > >  CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
> > > >  CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
> > > > +CONFIG_RTE_ETHDEV_TX_PREPARE=y
> > >
> > > Please, remind me why is there a configuration here.
> > > It should be the responsibility of the application to call tx_prepare
> > > or not. If the application choose to use this new API but it is
> > > disabled, then the packets won't be prepared and there is no error code:
> > >
> > > > +#else
> > > > +
> > > > +static inline uint16_t
> > > > +rte_eth_tx_prepare(__rte_unused uint8_t port_id, __rte_unused
> > uint16_t queue_id,
> > > > +               __rte_unused struct rte_mbuf **tx_pkts, uint16_t
> > > > +nb_pkts) {
> > > > +       return nb_pkts;
> > > > +}
> > > > +
> > > > +#endif
> > >
> > > So the application is not aware of the issue and it will not use any
> > > fallback.
> 
> tx_prepare mechanism can be turned off by compilation flag (as discussed with Jerin in http://dpdk.org/dev/patchwork/patch/15770/) to provide real NOOP functionality (e.g. for low-end CPUs, where even unnecessary memory dereference and check can have significant impact on performance).
> 
> Jerin observed that on some architectures (e.g. low-end ARM with embedded NIC), just reading and comparing 'dev->tx_pkt_prepare' may cause significant performance drop, so he proposed to introduce this configuration flag to provide real NOOP when tx_prepare functionality is not required, and can be turned on based on the _target_ configuration.
> 
> For other cases, when this flag is turned on (by default), and tx_prepare is not implemented, functional NOOP is used based on comparison (dev->tx_pkt_prepare == NULL).

So if the application call this function and if it is disabled, it simply
won't work. Packets won't be prepared, checksum won't be computed.

I give up, I just NACK.
  
Jerin Jacob Dec. 1, 2016, 9:56 p.m. UTC | #7
On Thu, Dec 01, 2016 at 08:52:22PM +0100, Thomas Monjalon wrote:
> 2016-12-01 19:20, Kulasek, TomaszX:
> > Hi Thomas,
> > 
> > Sorry, I have answered for this question in another thread and I missed about this one. Detailed answer is below.
> 
> Yes you already gave this answer.
> And I will continue asking the question until you understand it.
> 
> > > 2016-11-28 11:54, Thomas Monjalon:
> > > > Hi,
> > > >
> > > > 2016-11-23 18:36, Tomasz Kulasek:
> > > > > --- a/config/common_base
> > > > > +++ b/config/common_base
> > > > > @@ -120,6 +120,7 @@ CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
> > > > >  CONFIG_RTE_LIBRTE_IEEE1588=n
> > > > >  CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
> > > > >  CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
> > > > > +CONFIG_RTE_ETHDEV_TX_PREPARE=y
> > > >
> > > > Please, remind me why is there a configuration here.
> > > > It should be the responsibility of the application to call tx_prepare
> > > > or not. If the application choose to use this new API but it is
> > > > disabled, then the packets won't be prepared and there is no error code:
> > > >
> > > > > +#else
> > > > > +
> > > > > +static inline uint16_t
> > > > > +rte_eth_tx_prepare(__rte_unused uint8_t port_id, __rte_unused
> > > uint16_t queue_id,
> > > > > +               __rte_unused struct rte_mbuf **tx_pkts, uint16_t
> > > > > +nb_pkts) {
> > > > > +       return nb_pkts;
> > > > > +}
> > > > > +
> > > > > +#endif
> > > >
> > > > So the application is not aware of the issue and it will not use any
> > > > fallback.
> > 
> > tx_prepare mechanism can be turned off by compilation flag (as discussed with Jerin in http://dpdk.org/dev/patchwork/patch/15770/) to provide real NOOP functionality (e.g. for low-end CPUs, where even unnecessary memory dereference and check can have significant impact on performance).
> > 
> > Jerin observed that on some architectures (e.g. low-end ARM with embedded NIC), just reading and comparing 'dev->tx_pkt_prepare' may cause significant performance drop, so he proposed to introduce this configuration flag to provide real NOOP when tx_prepare functionality is not required, and can be turned on based on the _target_ configuration.
> > 
> > For other cases, when this flag is turned on (by default), and tx_prepare is not implemented, functional NOOP is used based on comparison (dev->tx_pkt_prepare == NULL).
> 
> So if the application call this function and if it is disabled, it simply
> won't work. Packets won't be prepared, checksum won't be computed.
The use case I was referring  was "integrated NIC" case where
- DPDK target with no external NW PCI card support
AND
- The "integrated NIC" does not need tx_prepare
  
Tomasz Kulasek Dec. 1, 2016, 10:31 p.m. UTC | #8
Hi Thomas,

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Thursday, December 1, 2016 20:52
> To: Kulasek, TomaszX <tomaszx.kulasek@intel.com>
> Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> olivier.matz@6wind.com; Richardson, Bruce <bruce.richardson@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
> 
> 2016-12-01 19:20, Kulasek, TomaszX:
> > Hi Thomas,
> >
> > Sorry, I have answered for this question in another thread and I missed
> about this one. Detailed answer is below.
> 
> Yes you already gave this answer.
> And I will continue asking the question until you understand it.
> 
> > > 2016-11-28 11:54, Thomas Monjalon:
> > > > Hi,
> > > >
> > > > 2016-11-23 18:36, Tomasz Kulasek:
> > > > > --- a/config/common_base
> > > > > +++ b/config/common_base
> > > > > @@ -120,6 +120,7 @@ CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
> > > > >  CONFIG_RTE_LIBRTE_IEEE1588=n
> > > > >  CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
> > > > >  CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
> > > > > +CONFIG_RTE_ETHDEV_TX_PREPARE=y
> > > >
> > > > Please, remind me why is there a configuration here.
> > > > It should be the responsibility of the application to call
> > > > tx_prepare or not. If the application choose to use this new API
> > > > but it is disabled, then the packets won't be prepared and there is
> no error code:
> > > >
> > > > > +#else
> > > > > +
> > > > > +static inline uint16_t
> > > > > +rte_eth_tx_prepare(__rte_unused uint8_t port_id, __rte_unused
> > > uint16_t queue_id,
> > > > > +               __rte_unused struct rte_mbuf **tx_pkts, uint16_t
> > > > > +nb_pkts) {
> > > > > +       return nb_pkts;
> > > > > +}
> > > > > +
> > > > > +#endif
> > > >
> > > > So the application is not aware of the issue and it will not use
> > > > any fallback.
> >
> > tx_prepare mechanism can be turned off by compilation flag (as discussed
> with Jerin in http://dpdk.org/dev/patchwork/patch/15770/) to provide real
> NOOP functionality (e.g. for low-end CPUs, where even unnecessary memory
> dereference and check can have significant impact on performance).
> >
> > Jerin observed that on some architectures (e.g. low-end ARM with
> embedded NIC), just reading and comparing 'dev->tx_pkt_prepare' may cause
> significant performance drop, so he proposed to introduce this
> configuration flag to provide real NOOP when tx_prepare functionality is
> not required, and can be turned on based on the _target_ configuration.
> >
> > For other cases, when this flag is turned on (by default), and
> tx_prepare is not implemented, functional NOOP is used based on comparison
> (dev->tx_pkt_prepare == NULL).
> 
> So if the application call this function and if it is disabled, it simply
> won't work. Packets won't be prepared, checksum won't be computed.
> 
> I give up, I just NACK.

It is not to be turned on/off whatever someone wants, but only and only for the case, when platform developer knows, that his platform doesn't need this callback, so, he may turn off it and then save some performance (this option is per target).

For this case, the behavior of tx_prepare will be exactly the same when it is turned on or off. If is not the same, there's no sense to turn it off. There were long topic, where we've tried to convince you, that it should be turned on for all devices.

Tomasz
  
Thomas Monjalon Dec. 1, 2016, 11:50 p.m. UTC | #9
2016-12-01 22:31, Kulasek, TomaszX:
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > 2016-12-01 19:20, Kulasek, TomaszX:
> > > Hi Thomas,
> > >
> > > Sorry, I have answered for this question in another thread and I missed
> > about this one. Detailed answer is below.
> > 
> > Yes you already gave this answer.
> > And I will continue asking the question until you understand it.
> > 
> > > > 2016-11-28 11:54, Thomas Monjalon:
> > > > > Hi,
> > > > >
> > > > > 2016-11-23 18:36, Tomasz Kulasek:
> > > > > > --- a/config/common_base
> > > > > > +++ b/config/common_base
> > > > > > @@ -120,6 +120,7 @@ CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
> > > > > >  CONFIG_RTE_LIBRTE_IEEE1588=n
> > > > > >  CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
> > > > > >  CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
> > > > > > +CONFIG_RTE_ETHDEV_TX_PREPARE=y
> > > > >
> > > > > Please, remind me why is there a configuration here.
> > > > > It should be the responsibility of the application to call
> > > > > tx_prepare or not. If the application choose to use this new API
> > > > > but it is disabled, then the packets won't be prepared and there is
> > no error code:
> > > > >
> > > > > > +#else
> > > > > > +
> > > > > > +static inline uint16_t
> > > > > > +rte_eth_tx_prepare(__rte_unused uint8_t port_id, __rte_unused
> > > > uint16_t queue_id,
> > > > > > +               __rte_unused struct rte_mbuf **tx_pkts, uint16_t
> > > > > > +nb_pkts) {
> > > > > > +       return nb_pkts;
> > > > > > +}
> > > > > > +
> > > > > > +#endif
> > > > >
> > > > > So the application is not aware of the issue and it will not use
> > > > > any fallback.
> > >
> > > tx_prepare mechanism can be turned off by compilation flag (as discussed
> > with Jerin in http://dpdk.org/dev/patchwork/patch/15770/) to provide real
> > NOOP functionality (e.g. for low-end CPUs, where even unnecessary memory
> > dereference and check can have significant impact on performance).
> > >
> > > Jerin observed that on some architectures (e.g. low-end ARM with
> > embedded NIC), just reading and comparing 'dev->tx_pkt_prepare' may cause
> > significant performance drop, so he proposed to introduce this
> > configuration flag to provide real NOOP when tx_prepare functionality is
> > not required, and can be turned on based on the _target_ configuration.
> > >
> > > For other cases, when this flag is turned on (by default), and
> > tx_prepare is not implemented, functional NOOP is used based on comparison
> > (dev->tx_pkt_prepare == NULL).
> > 
> > So if the application call this function and if it is disabled, it simply
> > won't work. Packets won't be prepared, checksum won't be computed.
> > 
> > I give up, I just NACK.
> 
> It is not to be turned on/off whatever someone wants, but only and only for the case, when platform developer knows, that his platform doesn't need this callback, so, he may turn off it and then save some performance (this option is per target).

How may he know? There is no comment in the config file, no documentation.

> For this case, the behavior of tx_prepare will be exactly the same when it is turned on or off. If is not the same, there's no sense to turn it off. There were long topic, where we've tried to convince you, that it should be turned on for all devices.

Really? You tried to convince me to turn it on?
No you were trying to convince Jerin.
I think it is a wrong idea to allow disabling this function.
I didn't comment in first discussion because Jerin told it was really
important for small hardware with fixed NIC, and I thought it would
be implemented in a way the application cannot be misleaded.

The only solution I see here is to add some comments in the configuration
file, below the #else and in the doc.
Have you checked doc/guides/prog_guide/poll_mode_drv.rst?
  
Ananyev, Konstantin Dec. 2, 2016, 12:10 a.m. UTC | #10
> 
> Hi Thomas,
> 
> > -----Original Message-----
> > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > Sent: Thursday, December 1, 2016 20:52
> > To: Kulasek, TomaszX <tomaszx.kulasek@intel.com>
> > Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > olivier.matz@6wind.com; Richardson, Bruce <bruce.richardson@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
> >
> > 2016-12-01 19:20, Kulasek, TomaszX:
> > > Hi Thomas,
> > >
> > > Sorry, I have answered for this question in another thread and I missed
> > about this one. Detailed answer is below.
> >
> > Yes you already gave this answer.
> > And I will continue asking the question until you understand it.
> >
> > > > 2016-11-28 11:54, Thomas Monjalon:
> > > > > Hi,
> > > > >
> > > > > 2016-11-23 18:36, Tomasz Kulasek:
> > > > > > --- a/config/common_base
> > > > > > +++ b/config/common_base
> > > > > > @@ -120,6 +120,7 @@ CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
> > > > > >  CONFIG_RTE_LIBRTE_IEEE1588=n
> > > > > >  CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
> > > > > >  CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
> > > > > > +CONFIG_RTE_ETHDEV_TX_PREPARE=y
> > > > >
> > > > > Please, remind me why is there a configuration here.
> > > > > It should be the responsibility of the application to call
> > > > > tx_prepare or not. If the application choose to use this new API
> > > > > but it is disabled, then the packets won't be prepared and there is
> > no error code:
> > > > >
> > > > > > +#else
> > > > > > +
> > > > > > +static inline uint16_t
> > > > > > +rte_eth_tx_prepare(__rte_unused uint8_t port_id, __rte_unused
> > > > uint16_t queue_id,
> > > > > > +               __rte_unused struct rte_mbuf **tx_pkts, uint16_t
> > > > > > +nb_pkts) {
> > > > > > +       return nb_pkts;
> > > > > > +}
> > > > > > +
> > > > > > +#endif
> > > > >
> > > > > So the application is not aware of the issue and it will not use
> > > > > any fallback.
> > >
> > > tx_prepare mechanism can be turned off by compilation flag (as discussed
> > with Jerin in http://dpdk.org/dev/patchwork/patch/15770/) to provide real
> > NOOP functionality (e.g. for low-end CPUs, where even unnecessary memory
> > dereference and check can have significant impact on performance).
> > >
> > > Jerin observed that on some architectures (e.g. low-end ARM with
> > embedded NIC), just reading and comparing 'dev->tx_pkt_prepare' may cause
> > significant performance drop, so he proposed to introduce this
> > configuration flag to provide real NOOP when tx_prepare functionality is
> > not required, and can be turned on based on the _target_ configuration.
> > >
> > > For other cases, when this flag is turned on (by default), and
> > tx_prepare is not implemented, functional NOOP is used based on comparison
> > (dev->tx_pkt_prepare == NULL).
> >
> > So if the application call this function and if it is disabled, it simply
> > won't work. Packets won't be prepared, checksum won't be computed.
> >
> > I give up, I just NACK.
> 
> It is not to be turned on/off whatever someone wants, but only and only for the case, when platform developer knows, that his platform
> doesn't need this callback, so, he may turn off it and then save some performance (this option is per target).
> 
> For this case, the behavior of tx_prepare will be exactly the same when it is turned on or off. If is not the same, there's no sense to turn it
> off. There were long topic, where we've tried to convince you, that it should be turned on for all devices.

As Tomasz pointed out the RTE_ETHDEV_TX_PREPARE was introduced to fulfill Jerin request.
From here:
"Low-end ARMv7,ARMv8 targets may not have PCIE-RC support and it may have
only integrated NIC controller. On those targets/configs, where integrated NIC
controller does not use tx_prep service it can made it as NOOP to save
cycles on following "rte_eth_tx_prep" and associated "if (unlikely(nb_prep
< nb_rx))" checks in the application."
According to the measurements he done it can save ~7% on some low-end ARM machine.
You can read whole story here:
http://dpdk.org/dev/patchwork/patch/15770/
Though, if now you guys believe that this is not good enough reason,
I have absolutely no problem to remove the RTE_ETHDEV_TX_PREPARE and associated logic.
I personally don't use ARM boxes and don't plan to,
and in theory users can still do conditional compilation at the upper layer, if they want to. 
Konstantin
  
Ananyev, Konstantin Dec. 2, 2016, 1:06 a.m. UTC | #11
> 
> 2016-11-23 18:36, Tomasz Kulasek:
> > +/**
> > + * Process a burst of output packets on a transmit queue of an Ethernet device.
> > + *
> > + * The rte_eth_tx_prepare() function is invoked to prepare output packets to be
> > + * transmitted on the output queue *queue_id* of the Ethernet device designated
> > + * by its *port_id*.
> > + * The *nb_pkts* parameter is the number of packets to be prepared which are
> > + * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of them
> > + * allocated from a pool created with rte_pktmbuf_pool_create().
> > + * For each packet to send, the rte_eth_tx_prepare() function performs
> > + * the following operations:
> > + *
> > + * - Check if packet meets devices requirements for tx offloads.
> > + *
> > + * - Check limitations about number of segments.
> > + *
> > + * - Check additional requirements when debug is enabled.
> > + *
> > + * - Update and/or reset required checksums when tx offload is set for packet.
> > + *
> > + * Since this function can modify packet data, provided mbufs must be safely
> > + * writable (e.g. modified data cannot be in shared segment).
> 
> I think we will have to remove this limitation in next releases.
> As we don't know how it could affect the API, I suggest to declare this
> API EXPERIMENTAL.

While I don't really mind to mart it as experimental, I don't really understand the reasoning:
Why " this function can modify packet data, provided mbufs must be safely writable" suddenly becomes a problem?
That seems like and obvious limitation to me and let say tx_burst() has the same one.
Second, I don't see how you are going to remove it without introducing a heavy performance impact.
Konstantin
  
Olivier Matz Dec. 2, 2016, 8:24 a.m. UTC | #12
Hi Konstantin,

On Fri, 2 Dec 2016 01:06:30 +0000, "Ananyev, Konstantin"
<konstantin.ananyev@intel.com> wrote:
> > 
> > 2016-11-23 18:36, Tomasz Kulasek:  
> > > +/**
> > > + * Process a burst of output packets on a transmit queue of an
> > > Ethernet device.
> > > + *
> > > + * The rte_eth_tx_prepare() function is invoked to prepare
> > > output packets to be
> > > + * transmitted on the output queue *queue_id* of the Ethernet
> > > device designated
> > > + * by its *port_id*.
> > > + * The *nb_pkts* parameter is the number of packets to be
> > > prepared which are
> > > + * supplied in the *tx_pkts* array of *rte_mbuf* structures,
> > > each of them
> > > + * allocated from a pool created with rte_pktmbuf_pool_create().
> > > + * For each packet to send, the rte_eth_tx_prepare() function
> > > performs
> > > + * the following operations:
> > > + *
> > > + * - Check if packet meets devices requirements for tx offloads.
> > > + *
> > > + * - Check limitations about number of segments.
> > > + *
> > > + * - Check additional requirements when debug is enabled.
> > > + *
> > > + * - Update and/or reset required checksums when tx offload is
> > > set for packet.
> > > + *
> > > + * Since this function can modify packet data, provided mbufs
> > > must be safely
> > > + * writable (e.g. modified data cannot be in shared segment).  
> > 
> > I think we will have to remove this limitation in next releases.
> > As we don't know how it could affect the API, I suggest to declare
> > this API EXPERIMENTAL.  
> 
> While I don't really mind to mart it as experimental, I don't really
> understand the reasoning: Why " this function can modify packet data,
> provided mbufs must be safely writable" suddenly becomes a problem?
> That seems like and obvious limitation to me and let say tx_burst()
> has the same one. Second, I don't see how you are going to remove it
> without introducing a heavy performance impact. Konstantin 
> 

About tx_burst(), I don't think we should force the user to provide a
writable mbuf. There are many use cases where passing a clone
already works as of today and it avoids duplicating the mbuf data. For
instance: traffic generator, multicast, bridging/tap, etc...

Moreover, this requirement would be inconsistent with the model you are
proposing in case of pipeline:
 - tx_prepare() on core X, may update the data
 - tx_burst() on core Y, should not touch the data to avoid cache misses


Regards,
Olivier
  
Ananyev, Konstantin Dec. 2, 2016, 4:17 p.m. UTC | #13
Hi Olivier,

> -----Original Message-----
> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Friday, December 2, 2016 8:24 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; Kulasek, TomaszX <tomaszx.kulasek@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
> 
> Hi Konstantin,
> 
> On Fri, 2 Dec 2016 01:06:30 +0000, "Ananyev, Konstantin"
> <konstantin.ananyev@intel.com> wrote:
> > >
> > > 2016-11-23 18:36, Tomasz Kulasek:
> > > > +/**
> > > > + * Process a burst of output packets on a transmit queue of an
> > > > Ethernet device.
> > > > + *
> > > > + * The rte_eth_tx_prepare() function is invoked to prepare
> > > > output packets to be
> > > > + * transmitted on the output queue *queue_id* of the Ethernet
> > > > device designated
> > > > + * by its *port_id*.
> > > > + * The *nb_pkts* parameter is the number of packets to be
> > > > prepared which are
> > > > + * supplied in the *tx_pkts* array of *rte_mbuf* structures,
> > > > each of them
> > > > + * allocated from a pool created with rte_pktmbuf_pool_create().
> > > > + * For each packet to send, the rte_eth_tx_prepare() function
> > > > performs
> > > > + * the following operations:
> > > > + *
> > > > + * - Check if packet meets devices requirements for tx offloads.
> > > > + *
> > > > + * - Check limitations about number of segments.
> > > > + *
> > > > + * - Check additional requirements when debug is enabled.
> > > > + *
> > > > + * - Update and/or reset required checksums when tx offload is
> > > > set for packet.
> > > > + *
> > > > + * Since this function can modify packet data, provided mbufs
> > > > must be safely
> > > > + * writable (e.g. modified data cannot be in shared segment).
> > >
> > > I think we will have to remove this limitation in next releases.
> > > As we don't know how it could affect the API, I suggest to declare
> > > this API EXPERIMENTAL.
> >
> > While I don't really mind to mart it as experimental, I don't really
> > understand the reasoning: Why " this function can modify packet data,
> > provided mbufs must be safely writable" suddenly becomes a problem?
> > That seems like and obvious limitation to me and let say tx_burst()
> > has the same one. Second, I don't see how you are going to remove it
> > without introducing a heavy performance impact. Konstantin
> >
> 
> About tx_burst(), I don't think we should force the user to provide a
> writable mbuf. There are many use cases where passing a clone
> already works as of today and it avoids duplicating the mbuf data. For
> instance: traffic generator, multicast, bridging/tap, etc...
> 
> Moreover, this requirement would be inconsistent with the model you are
> proposing in case of pipeline:
>  - tx_prepare() on core X, may update the data
>  - tx_burst() on core Y, should not touch the data to avoid cache misses
> 

Probably I wasn't very clear in my previous mail.
I am not saying that we should force the user to pass a writable mbuf.
What I am saying that for tx_burst() current expectation is that
after mbuf is handled to tx_burst() user shouldn't try to modify its buffer contents
till TX engine is done with the buffer (mbuf_free() is called by TX func for it).
For tx_prep(), I think, it is the same though restrictions are a bit more strict:
user should not try to read/write to the mbuf while tx_prep() is not finished with it.
What puzzles me is that why that should be the reason to mark tx_prep() as experimental. 
Konstantin
  
Olivier Matz Dec. 8, 2016, 5:24 p.m. UTC | #14
Hi Konstantin,

On Fri, 2 Dec 2016 16:17:51 +0000, "Ananyev, Konstantin"
<konstantin.ananyev@intel.com> wrote:
> Hi Olivier,
> 
> > -----Original Message-----
> > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > Sent: Friday, December 2, 2016 8:24 AM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; Kulasek, TomaszX
> > <tomaszx.kulasek@intel.com>; dev@dpdk.org Subject: Re: [dpdk-dev]
> > [PATCH v12 1/6] ethdev: add Tx preparation
> > 
> > Hi Konstantin,
> > 
> > On Fri, 2 Dec 2016 01:06:30 +0000, "Ananyev, Konstantin"
> > <konstantin.ananyev@intel.com> wrote:  
> > > >
> > > > 2016-11-23 18:36, Tomasz Kulasek:  
> > > > > +/**
> > > > > + * Process a burst of output packets on a transmit queue of
> > > > > an Ethernet device.
> > > > > + *
> > > > > + * The rte_eth_tx_prepare() function is invoked to prepare
> > > > > output packets to be
> > > > > + * transmitted on the output queue *queue_id* of the Ethernet
> > > > > device designated
> > > > > + * by its *port_id*.
> > > > > + * The *nb_pkts* parameter is the number of packets to be
> > > > > prepared which are
> > > > > + * supplied in the *tx_pkts* array of *rte_mbuf* structures,
> > > > > each of them
> > > > > + * allocated from a pool created with
> > > > > rte_pktmbuf_pool_create().
> > > > > + * For each packet to send, the rte_eth_tx_prepare() function
> > > > > performs
> > > > > + * the following operations:
> > > > > + *
> > > > > + * - Check if packet meets devices requirements for tx
> > > > > offloads.
> > > > > + *
> > > > > + * - Check limitations about number of segments.
> > > > > + *
> > > > > + * - Check additional requirements when debug is enabled.
> > > > > + *
> > > > > + * - Update and/or reset required checksums when tx offload
> > > > > is set for packet.
> > > > > + *
> > > > > + * Since this function can modify packet data, provided mbufs
> > > > > must be safely
> > > > > + * writable (e.g. modified data cannot be in shared
> > > > > segment).  
> > > >
> > > > I think we will have to remove this limitation in next releases.
> > > > As we don't know how it could affect the API, I suggest to
> > > > declare this API EXPERIMENTAL.  
> > >
> > > While I don't really mind to mart it as experimental, I don't
> > > really understand the reasoning: Why " this function can modify
> > > packet data, provided mbufs must be safely writable" suddenly
> > > becomes a problem? That seems like and obvious limitation to me
> > > and let say tx_burst() has the same one. Second, I don't see how
> > > you are going to remove it without introducing a heavy
> > > performance impact. Konstantin 
> > 
> > About tx_burst(), I don't think we should force the user to provide
> > a writable mbuf. There are many use cases where passing a clone
> > already works as of today and it avoids duplicating the mbuf data.
> > For instance: traffic generator, multicast, bridging/tap, etc...
> > 
> > Moreover, this requirement would be inconsistent with the model you
> > are proposing in case of pipeline:
> >  - tx_prepare() on core X, may update the data
> >  - tx_burst() on core Y, should not touch the data to avoid cache
> > misses 
> 
> Probably I wasn't very clear in my previous mail.
> I am not saying that we should force the user to pass a writable mbuf.
> What I am saying that for tx_burst() current expectation is that
> after mbuf is handled to tx_burst() user shouldn't try to modify its
> buffer contents till TX engine is done with the buffer (mbuf_free()
> is called by TX func for it). For tx_prep(), I think, it is the same
> though restrictions are a bit more strict: user should not try to
> read/write to the mbuf while tx_prep() is not finished with it. What
> puzzles me is that why that should be the reason to mark tx_prep() as
> experimental. Konstantin

To be sure we're on the same page, let me reword:

- mbufs passed to tx_prepare() by the application must have their
  headers (l2_len + l3_len + l4_len) writable because the phdr checksum
  can be replaced. It could be precised in the api comment.

- mbufs passed to tx_burst() must not be modified by the driver/hw, nor
  by the application.


About the API itself, I have one more question. I know you've
already discussed this a bit with Adrien, I don't want to spawn a new
big thread from here ;)

The API provides tx_prepare() to check the packets have the proper
format, and possibly modify them (ex: csum offload / tso) to match hw
requirements. So it does both checks (number of segments) and fixes
(csum/tso). What determines things that should be checked and things
that should be fixed?

The application gets few information from tx_prepare() about what should
be done to make the packet accepted by the hw, and the actions will
probably be different depending on hardware. So could we imagine that
in the future the function also tries to fix the packet? I've seen your
comment saying that it has to be an application decision, so what about
having a parameter saying "fix the packet" or "don't fix it"?

About rte_eth_desc_lim->nb_seg_max and
rte_eth_desc_lim->nb_mtu_seg_max, I'm still quite reserved, especially
for the 2nd one, because I don't see how it can be used by the
application. Well, it does not hurt to have them, but for me it looks a
bit useless.

Last thing, I think this API should become the default in the future.
For instance, it would prevent the application to calculate a phdr csum
that will not be used by the hw. Not calling tx_prepare() would require
the user/application to exactly knows the underlying hw and the kind of
packet that are generated. So for me it means we'll need to also update
other examples (other testpmd engines, l2fwd, ...). Do you agree?


Regards,
Olivier
  
Tomasz Kulasek Dec. 9, 2016, 1:25 p.m. UTC | #15
Hi Thomas,

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Friday, December 2, 2016 00:51
> To: Kulasek, TomaszX <tomaszx.kulasek@intel.com>
> Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> olivier.matz@6wind.com; Richardson, Bruce <bruce.richardson@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
> 
> 2016-12-01 22:31, Kulasek, TomaszX:
> > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > > 2016-12-01 19:20, Kulasek, TomaszX:
> > > > Hi Thomas,
> > > >
> > > > Sorry, I have answered for this question in another thread and I
> > > > missed
> > > about this one. Detailed answer is below.
> > >
> > > Yes you already gave this answer.
> > > And I will continue asking the question until you understand it.
> > >
> > > > > 2016-11-28 11:54, Thomas Monjalon:
> > > > > > Hi,
> > > > > >
> > > > > > 2016-11-23 18:36, Tomasz Kulasek:
> > > > > > > --- a/config/common_base
> > > > > > > +++ b/config/common_base
> > > > > > > @@ -120,6 +120,7 @@ CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
> > > > > > >  CONFIG_RTE_LIBRTE_IEEE1588=n
> > > > > > >  CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
> > > > > > >  CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
> > > > > > > +CONFIG_RTE_ETHDEV_TX_PREPARE=y
> > > > > >
> > > > > > Please, remind me why is there a configuration here.
> > > > > > It should be the responsibility of the application to call
> > > > > > tx_prepare or not. If the application choose to use this new
> > > > > > API but it is disabled, then the packets won't be prepared and
> > > > > > there is
> > > no error code:
> > > > > >
> > > > > > > +#else
> > > > > > > +
> > > > > > > +static inline uint16_t
> > > > > > > +rte_eth_tx_prepare(__rte_unused uint8_t port_id,
> > > > > > > +__rte_unused
> > > > > uint16_t queue_id,
> > > > > > > +               __rte_unused struct rte_mbuf **tx_pkts,
> > > > > > > +uint16_t
> > > > > > > +nb_pkts) {
> > > > > > > +       return nb_pkts;
> > > > > > > +}
> > > > > > > +
> > > > > > > +#endif
> > > > > >
> > > > > > So the application is not aware of the issue and it will not
> > > > > > use any fallback.
> > > >
> > > > tx_prepare mechanism can be turned off by compilation flag (as
> > > > discussed
> > > with Jerin in http://dpdk.org/dev/patchwork/patch/15770/) to provide
> > > real NOOP functionality (e.g. for low-end CPUs, where even
> > > unnecessary memory dereference and check can have significant impact
> on performance).
> > > >
> > > > Jerin observed that on some architectures (e.g. low-end ARM with
> > > embedded NIC), just reading and comparing 'dev->tx_pkt_prepare' may
> > > cause significant performance drop, so he proposed to introduce this
> > > configuration flag to provide real NOOP when tx_prepare
> > > functionality is not required, and can be turned on based on the
> _target_ configuration.
> > > >
> > > > For other cases, when this flag is turned on (by default), and
> > > tx_prepare is not implemented, functional NOOP is used based on
> > > comparison (dev->tx_pkt_prepare == NULL).
> > >
> > > So if the application call this function and if it is disabled, it
> > > simply won't work. Packets won't be prepared, checksum won't be
> computed.
> > >
> > > I give up, I just NACK.
> >
> > It is not to be turned on/off whatever someone wants, but only and only
> for the case, when platform developer knows, that his platform doesn't
> need this callback, so, he may turn off it and then save some performance
> (this option is per target).
> 
> How may he know? There is no comment in the config file, no documentation.
> 
> > For this case, the behavior of tx_prepare will be exactly the same when
> it is turned on or off. If is not the same, there's no sense to turn it
> off. There were long topic, where we've tried to convince you, that it
> should be turned on for all devices.
> 
> Really? You tried to convince me to turn it on?
> No you were trying to convince Jerin.
> I think it is a wrong idea to allow disabling this function.
> I didn't comment in first discussion because Jerin told it was really
> important for small hardware with fixed NIC, and I thought it would be
> implemented in a way the application cannot be misleaded.
> 
> The only solution I see here is to add some comments in the configuration
> file, below the #else and in the doc.
> Have you checked doc/guides/prog_guide/poll_mode_drv.rst?

I can change the name of CONFIG_RTE_ETHDEV_TX_PREPARE=y to something like CONFIG_RTE_ETHDEV_TX_PREPARE_NOOP=n to made it less confusing, and add comments to describe why it is introduced and how it use safely.

I can also remove it at all if you don't like it.

As for doc/guides/prog_guide/poll_mode_drv.rst, do you mean, to add new section describing this feature?

Tomasz
  
Tomasz Kulasek Dec. 9, 2016, 5:19 p.m. UTC | #16
Hi Oliver,

My 5 cents below:

> -----Original Message-----
> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Thursday, December 8, 2016 18:24
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; Kulasek, TomaszX
> <tomaszx.kulasek@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
> 
> Hi Konstantin,
> 
> On Fri, 2 Dec 2016 16:17:51 +0000, "Ananyev, Konstantin"
> <konstantin.ananyev@intel.com> wrote:
> > Hi Olivier,
> >
> > > -----Original Message-----
> > > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > > Sent: Friday, December 2, 2016 8:24 AM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; Kulasek, TomaszX
> > > <tomaszx.kulasek@intel.com>; dev@dpdk.org Subject: Re: [dpdk-dev]
> > > [PATCH v12 1/6] ethdev: add Tx preparation
> > >
> > > Hi Konstantin,
> > >
> > > On Fri, 2 Dec 2016 01:06:30 +0000, "Ananyev, Konstantin"
> > > <konstantin.ananyev@intel.com> wrote:
> > > > >
> > > > > 2016-11-23 18:36, Tomasz Kulasek:
> > > > > > +/**
> > > > > > + * Process a burst of output packets on a transmit queue of
> > > > > > an Ethernet device.
> > > > > > + *
> > > > > > + * The rte_eth_tx_prepare() function is invoked to prepare
> > > > > > output packets to be
> > > > > > + * transmitted on the output queue *queue_id* of the Ethernet
> > > > > > device designated
> > > > > > + * by its *port_id*.
> > > > > > + * The *nb_pkts* parameter is the number of packets to be
> > > > > > prepared which are
> > > > > > + * supplied in the *tx_pkts* array of *rte_mbuf* structures,
> > > > > > each of them
> > > > > > + * allocated from a pool created with
> > > > > > rte_pktmbuf_pool_create().
> > > > > > + * For each packet to send, the rte_eth_tx_prepare() function
> > > > > > performs
> > > > > > + * the following operations:
> > > > > > + *
> > > > > > + * - Check if packet meets devices requirements for tx
> > > > > > offloads.
> > > > > > + *
> > > > > > + * - Check limitations about number of segments.
> > > > > > + *
> > > > > > + * - Check additional requirements when debug is enabled.
> > > > > > + *
> > > > > > + * - Update and/or reset required checksums when tx offload
> > > > > > is set for packet.
> > > > > > + *
> > > > > > + * Since this function can modify packet data, provided mbufs
> > > > > > must be safely
> > > > > > + * writable (e.g. modified data cannot be in shared
> > > > > > segment).
> > > > >
> > > > > I think we will have to remove this limitation in next releases.
> > > > > As we don't know how it could affect the API, I suggest to
> > > > > declare this API EXPERIMENTAL.
> > > >
> > > > While I don't really mind to mart it as experimental, I don't
> > > > really understand the reasoning: Why " this function can modify
> > > > packet data, provided mbufs must be safely writable" suddenly
> > > > becomes a problem? That seems like and obvious limitation to me
> > > > and let say tx_burst() has the same one. Second, I don't see how
> > > > you are going to remove it without introducing a heavy performance
> > > > impact. Konstantin
> > >
> > > About tx_burst(), I don't think we should force the user to provide
> > > a writable mbuf. There are many use cases where passing a clone
> > > already works as of today and it avoids duplicating the mbuf data.
> > > For instance: traffic generator, multicast, bridging/tap, etc...
> > >
> > > Moreover, this requirement would be inconsistent with the model you
> > > are proposing in case of pipeline:
> > >  - tx_prepare() on core X, may update the data
> > >  - tx_burst() on core Y, should not touch the data to avoid cache
> > > misses
> >
> > Probably I wasn't very clear in my previous mail.
> > I am not saying that we should force the user to pass a writable mbuf.
> > What I am saying that for tx_burst() current expectation is that after
> > mbuf is handled to tx_burst() user shouldn't try to modify its buffer
> > contents till TX engine is done with the buffer (mbuf_free() is called
> > by TX func for it). For tx_prep(), I think, it is the same though
> > restrictions are a bit more strict: user should not try to read/write
> > to the mbuf while tx_prep() is not finished with it. What puzzles me
> > is that why that should be the reason to mark tx_prep() as
> > experimental. Konstantin
> 
> To be sure we're on the same page, let me reword:
> 
> - mbufs passed to tx_prepare() by the application must have their
>   headers (l2_len + l3_len + l4_len) writable because the phdr checksum
>   can be replaced. It could be precised in the api comment.
> 
> - mbufs passed to tx_burst() must not be modified by the driver/hw, nor
>   by the application.
> 
> 
> About the API itself, I have one more question. I know you've already
> discussed this a bit with Adrien, I don't want to spawn a new big thread
> from here ;)
> 
> The API provides tx_prepare() to check the packets have the proper format,
> and possibly modify them (ex: csum offload / tso) to match hw
> requirements. So it does both checks (number of segments) and fixes
> (csum/tso). What determines things that should be checked and things that
> should be fixed?
> 

1) Performance -- we may assume that packets are created with the common rules (e.g. doesn't tries to count IP checksum for IPv6 packet, sets all required fields, etc.). For now, additional checks are done only in DEBUG mode.

2) Uncommon requirements, such a number of segments for TSO/non-TSO, where invalid values can cause unexpected results (or even hang device on burst), so it is critical (at least for ixgbe and i40e) and must be checked.

3) Checksum field must be initialized in a hardware specific way to count it properly.

4) If packet is invalid its content isn't modified nor restored.

> The application gets few information from tx_prepare() about what should
> be done to make the packet accepted by the hw, and the actions will
> probably be different depending on hardware. So could we imagine that in
> the future the function also tries to fix the packet? I've seen your
> comment saying that it has to be an application decision, so what about
> having a parameter saying "fix the packet" or "don't fix it"?
> 

The main question here is how invasive in packet data tx_prepare should be.

If I understand you correctly, you're talking about the situation when tx_prepare will try to restore packet internally, e.g. linearize data for i40e to meet number of segments requirements, split mbufs, when are too large, maybe someone will wanted to have full software checksum computation fallback for devices which doesn't support it, and so on? And then this parameter will indicate "do the required work for me", or "just check, initialize what should be initialized, if fails, let me decide"?

Implemented model is quite simple and clear:

1) Validate uncommon requirements which may cause a problem and let application to deal with it.
2) If packet is valid, initialize required fields according to the hardware requirements.

In fact, it doesn't fix anything for now, but initializes checksums, and the main reason why it is done here, is the fact that it cannot be done in a trivial way in the application itself.

IMHO, we should to keep this API as simple as it possible and not let it grow in unexpected way (also for performance reason).

> About rte_eth_desc_lim->nb_seg_max and
> rte_eth_desc_lim->nb_mtu_seg_max, I'm still quite reserved, especially for
> the 2nd one, because I don't see how it can be used by the application.
> Well, it does not hurt to have them, but for me it looks a bit useless.
> 

* "nb_seg_max": Maximum number of segments per whole packet.
* "nb_mtu_seg_max": Maximum number of segments per one MTU.

Application can use provided values in a fallowed way:

* For non-TSO packet, a single transmit packet may span up to "nb_mtu_seg_max" buffers.
* For TSO packet the total number of data descriptors is "nb_seg_max", and each segment within the TSO may span up to "nb_mtu_seg_max".

> Last thing, I think this API should become the default in the future.
> For instance, it would prevent the application to calculate a phdr csum
> that will not be used by the hw. Not calling tx_prepare() would require
> the user/application to exactly knows the underlying hw and the kind of
> packet that are generated. So for me it means we'll need to also update
> other examples (other testpmd engines, l2fwd, ...). Do you agree?
> 

Most of sample applications doesn't use TX offloads at all and doesn't count checksums. So the question is "should tx_prepare be used even if not required?"

> 
> Regards,
> Olivier
> 

Tomasz
  
Ananyev, Konstantin Dec. 12, 2016, 11:51 a.m. UTC | #17
Hi Olivier and Tomasz,

> -----Original Message-----
> From: Kulasek, TomaszX
> Sent: Friday, December 9, 2016 5:19 PM
> To: Olivier Matz <olivier.matz@6wind.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
> 
> Hi Oliver,
> 
> My 5 cents below:
> 
> > -----Original Message-----
> > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > Sent: Thursday, December 8, 2016 18:24
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; Kulasek, TomaszX
> > <tomaszx.kulasek@intel.com>; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
> >
> > Hi Konstantin,
> >
> > On Fri, 2 Dec 2016 16:17:51 +0000, "Ananyev, Konstantin"
> > <konstantin.ananyev@intel.com> wrote:
> > > Hi Olivier,
> > >
> > > > -----Original Message-----
> > > > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > > > Sent: Friday, December 2, 2016 8:24 AM
> > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > > Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; Kulasek, TomaszX
> > > > <tomaszx.kulasek@intel.com>; dev@dpdk.org Subject: Re: [dpdk-dev]
> > > > [PATCH v12 1/6] ethdev: add Tx preparation
> > > >
> > > > Hi Konstantin,
> > > >
> > > > On Fri, 2 Dec 2016 01:06:30 +0000, "Ananyev, Konstantin"
> > > > <konstantin.ananyev@intel.com> wrote:
> > > > > >
> > > > > > 2016-11-23 18:36, Tomasz Kulasek:
> > > > > > > +/**
> > > > > > > + * Process a burst of output packets on a transmit queue of
> > > > > > > an Ethernet device.
> > > > > > > + *
> > > > > > > + * The rte_eth_tx_prepare() function is invoked to prepare
> > > > > > > output packets to be
> > > > > > > + * transmitted on the output queue *queue_id* of the Ethernet
> > > > > > > device designated
> > > > > > > + * by its *port_id*.
> > > > > > > + * The *nb_pkts* parameter is the number of packets to be
> > > > > > > prepared which are
> > > > > > > + * supplied in the *tx_pkts* array of *rte_mbuf* structures,
> > > > > > > each of them
> > > > > > > + * allocated from a pool created with
> > > > > > > rte_pktmbuf_pool_create().
> > > > > > > + * For each packet to send, the rte_eth_tx_prepare() function
> > > > > > > performs
> > > > > > > + * the following operations:
> > > > > > > + *
> > > > > > > + * - Check if packet meets devices requirements for tx
> > > > > > > offloads.
> > > > > > > + *
> > > > > > > + * - Check limitations about number of segments.
> > > > > > > + *
> > > > > > > + * - Check additional requirements when debug is enabled.
> > > > > > > + *
> > > > > > > + * - Update and/or reset required checksums when tx offload
> > > > > > > is set for packet.
> > > > > > > + *
> > > > > > > + * Since this function can modify packet data, provided mbufs
> > > > > > > must be safely
> > > > > > > + * writable (e.g. modified data cannot be in shared
> > > > > > > segment).
> > > > > >
> > > > > > I think we will have to remove this limitation in next releases.
> > > > > > As we don't know how it could affect the API, I suggest to
> > > > > > declare this API EXPERIMENTAL.
> > > > >
> > > > > While I don't really mind to mart it as experimental, I don't
> > > > > really understand the reasoning: Why " this function can modify
> > > > > packet data, provided mbufs must be safely writable" suddenly
> > > > > becomes a problem? That seems like and obvious limitation to me
> > > > > and let say tx_burst() has the same one. Second, I don't see how
> > > > > you are going to remove it without introducing a heavy performance
> > > > > impact. Konstantin
> > > >
> > > > About tx_burst(), I don't think we should force the user to provide
> > > > a writable mbuf. There are many use cases where passing a clone
> > > > already works as of today and it avoids duplicating the mbuf data.
> > > > For instance: traffic generator, multicast, bridging/tap, etc...
> > > >
> > > > Moreover, this requirement would be inconsistent with the model you
> > > > are proposing in case of pipeline:
> > > >  - tx_prepare() on core X, may update the data
> > > >  - tx_burst() on core Y, should not touch the data to avoid cache
> > > > misses
> > >
> > > Probably I wasn't very clear in my previous mail.
> > > I am not saying that we should force the user to pass a writable mbuf.
> > > What I am saying that for tx_burst() current expectation is that after
> > > mbuf is handled to tx_burst() user shouldn't try to modify its buffer
> > > contents till TX engine is done with the buffer (mbuf_free() is called
> > > by TX func for it). For tx_prep(), I think, it is the same though
> > > restrictions are a bit more strict: user should not try to read/write
> > > to the mbuf while tx_prep() is not finished with it. What puzzles me
> > > is that why that should be the reason to mark tx_prep() as
> > > experimental. Konstantin
> >
> > To be sure we're on the same page, let me reword:
> >
> > - mbufs passed to tx_prepare() by the application must have their
> >   headers (l2_len + l3_len + l4_len) writable because the phdr checksum
> >   can be replaced. It could be precised in the api comment.
> >
> > - mbufs passed to tx_burst() must not be modified by the driver/hw, nor
> >   by the application.

Yes, agree with both.

> >
> >
> > About the API itself, I have one more question. I know you've already
> > discussed this a bit with Adrien, I don't want to spawn a new big thread
> > from here ;)
> >
> > The API provides tx_prepare() to check the packets have the proper format,
> > and possibly modify them (ex: csum offload / tso) to match hw
> > requirements. So it does both checks (number of segments) and fixes
> > (csum/tso). What determines things that should be checked and things that
> > should be fixed?
> >

Just to echo Tomasz detailed reply below: I think right now tx_prepare()
doesn't make any fixing.
It just: checks that packet satisfies PMD/HW requirements, if not returns
an error without trying to fix anything/
If the  packet seems valid - calculates and fills L4 cksum to fulfill HW reqs . 
About the question what should be checked - as usual there is a tradeoff between 
additional checks and performance.
As Tomasz stated below right now in non-debug mode we do checks only for things
that can have a critical impact on HW itself (tx hang, etc.).
All extra checks are implemented only in DEBUG mode right now.  

> 
> 1) Performance -- we may assume that packets are created with the common rules (e.g. doesn't tries to count IP checksum for IPv6
> packet, sets all required fields, etc.). For now, additional checks are done only in DEBUG mode.
> 
> 2) Uncommon requirements, such a number of segments for TSO/non-TSO, where invalid values can cause unexpected results (or
> even hang device on burst), so it is critical (at least for ixgbe and i40e) and must be checked.
> 
> 3) Checksum field must be initialized in a hardware specific way to count it properly.
> 
> 4) If packet is invalid its content isn't modified nor restored.

> 
> > The application gets few information from tx_prepare() about what should
> > be done to make the packet accepted by the hw, and the actions will
> > probably be different depending on hardware.

That's true.
I am open to suggestions how in future to provide extra information to the upper layer.
Set rte_errno to different values depending on type of error,
OR extra parameter in tx_prepare() that will provide more detailed error information,
OR something else?

> So could we imagine that in
> > the future the function also tries to fix the packet? I've seen your
> > comment saying that it has to be an application decision, so what about
> > having a parameter saying "fix the packet" or "don't fix it"?
> >

While I am not directly opposed to that idea, I am a bit skeptical
that it could be implemented in useful and effective way.    
As I said before it may be fixed in different ways:
coalesce mbufs, split mbuf chain into multiple ones,
might be some sort of combination of these 2 approaches.
Again it seems a bit ineffective from performance point of view:
form packet first then check and re-adjust it again.
Seems more plausible to have it formed in a right way straightway.
Also it would mean that we'll need to pass to tx_prepare() information
about which mempool to use for new mbufs, probably some extra information. 
That's why I think it might be better to do a proper (re)formatting of the packet
before passing it down to tx_prepare().
Might be some sort of generic helper function in rte_net that would use information
from rte_eth_desc_lim and could be called by upper layer before tx_prepare().
But if you think it could be merged into tx_prepare() into some smart and effective way -     
sure thing, let's look at particular implementation ideas you have. 

> 
> The main question here is how invasive in packet data tx_prepare should be.
> 
> If I understand you correctly, you're talking about the situation when tx_prepare will try to restore packet internally, e.g. linearize data
> for i40e to meet number of segments requirements, split mbufs, when are too large, maybe someone will wanted to have full
> software checksum computation fallback for devices which doesn't support it, and so on? And then this parameter will indicate "do
> the required work for me", or "just check, initialize what should be initialized, if fails, let me decide"?
> 
> Implemented model is quite simple and clear:
> 
> 1) Validate uncommon requirements which may cause a problem and let application to deal with it.
> 2) If packet is valid, initialize required fields according to the hardware requirements.
> 
> In fact, it doesn't fix anything for now, but initializes checksums, and the main reason why it is done here, is the fact that it cannot be
> done in a trivial way in the application itself.
> 
> IMHO, we should to keep this API as simple as it possible and not let it grow in unexpected way (also for performance reason).
> 
> > About rte_eth_desc_lim->nb_seg_max and
> > rte_eth_desc_lim->nb_mtu_seg_max, I'm still quite reserved, especially for
> > the 2nd one, because I don't see how it can be used by the application.
> > Well, it does not hurt to have them, but for me it looks a bit useless.
> >
> 
> * "nb_seg_max": Maximum number of segments per whole packet.
> * "nb_mtu_seg_max": Maximum number of segments per one MTU.
> 
> Application can use provided values in a fallowed way:
> 
> * For non-TSO packet, a single transmit packet may span up to "nb_mtu_seg_max" buffers.
> * For TSO packet the total number of data descriptors is "nb_seg_max", and each segment within the TSO may span up to
> "nb_mtu_seg_max".

I suppose these fields will be useful.
Let say you are about to form and send packets.
Right now, how would you know what are HW limitations on number of segments per packet?  

> 
> > Last thing, I think this API should become the default in the future.
> > For instance, it would prevent the application to calculate a phdr csum
> > that will not be used by the hw. Not calling tx_prepare() would require
> > the user/application to exactly knows the underlying hw and the kind of
> > packet that are generated. So for me it means we'll need to also update
> > other examples (other testpmd engines, l2fwd, ...). Do you agree?
> >
> 
> Most of sample applications doesn't use TX offloads at all and doesn't count checksums. So the question is "should tx_prepare be
> used even if not required?"

I suppose it wouldn't be a big harm to add tx_prepare() into sample apps when appropriate.
In most cases (when simple TS path used) it will be NOP anyway.

Konstantin
  
Thomas Monjalon Dec. 22, 2016, 1:14 p.m. UTC | #18
2016-12-02 00:10, Ananyev, Konstantin:
> I have absolutely no problem to remove the RTE_ETHDEV_TX_PREPARE and associated logic.
> I personally don't use ARM boxes and don't plan to,
> and in theory users can still do conditional compilation at the upper layer, if they want to. 

Yes you're right. The application can avoid calling tx_prepare at all.
No need of an ifdef inside DPDK.
  
Thomas Monjalon Dec. 22, 2016, 1:30 p.m. UTC | #19
2016-12-12 11:51, Ananyev, Konstantin:
> > > The application gets few information from tx_prepare() about what should
> > > be done to make the packet accepted by the hw, and the actions will
> > > probably be different depending on hardware.
> 
> That's true.
> I am open to suggestions how in future to provide extra information to the upper layer.
> Set rte_errno to different values depending on type of error,
> OR extra parameter in tx_prepare() that will provide more detailed error information,
> OR something else?

That's one of the reason which give me a feeling that it is safer
to introduce tx_prepare as an experimental API in 17.02.
So the users will know that it can change in the next release.
What do you think?
  
Jerin Jacob Dec. 22, 2016, 1:37 p.m. UTC | #20
On Thu, Dec 22, 2016 at 02:14:45PM +0100, Thomas Monjalon wrote:
> 2016-12-02 00:10, Ananyev, Konstantin:
> > I have absolutely no problem to remove the RTE_ETHDEV_TX_PREPARE and associated logic.
> > I personally don't use ARM boxes and don't plan to,
> > and in theory users can still do conditional compilation at the upper layer, if they want to. 
> 
> Yes you're right. The application can avoid calling tx_prepare at all.

There are applications inside dpdk repo which will be using tx_prep so
in that case, IMHO, let the ifdef inside the DPDK library and disable it by
default so that if required we can disable it in one shot on integrated
controllers targets where is the system has only one integrated controller and
integrated controller does not need tx_prep


> No need of an ifdef inside DPDK.
  
Ananyev, Konstantin Dec. 22, 2016, 2:11 p.m. UTC | #21
> 
> 2016-12-12 11:51, Ananyev, Konstantin:
> > > > The application gets few information from tx_prepare() about what should
> > > > be done to make the packet accepted by the hw, and the actions will
> > > > probably be different depending on hardware.
> >
> > That's true.
> > I am open to suggestions how in future to provide extra information to the upper layer.
> > Set rte_errno to different values depending on type of error,
> > OR extra parameter in tx_prepare() that will provide more detailed error information,
> > OR something else?
> 
> That's one of the reason which give me a feeling that it is safer
> to introduce tx_prepare as an experimental API in 17.02.
> So the users will know that it can change in the next release.
> What do you think?

I think that's the good reason and I am ok with it. 
Konstantin
  

Patch

diff --git a/config/common_base b/config/common_base
index 4bff83a..d609a88 100644
--- a/config/common_base
+++ b/config/common_base
@@ -120,6 +120,7 @@  CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
 CONFIG_RTE_LIBRTE_IEEE1588=n
 CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
 CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
+CONFIG_RTE_ETHDEV_TX_PREPARE=y
 
 #
 # Support NIC bypass logic
diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 9678179..4ffc1b3 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -182,6 +182,7 @@ 
 #include <rte_pci.h>
 #include <rte_dev.h>
 #include <rte_devargs.h>
+#include <rte_errno.h>
 #include "rte_ether.h"
 #include "rte_eth_ctrl.h"
 #include "rte_dev_info.h"
@@ -702,6 +703,8 @@  struct rte_eth_desc_lim {
 	uint16_t nb_max;   /**< Max allowed number of descriptors. */
 	uint16_t nb_min;   /**< Min allowed number of descriptors. */
 	uint16_t nb_align; /**< Number of descriptors should be aligned to. */
+	uint16_t nb_seg_max;     /**< Max number of segments per whole packet. */
+	uint16_t nb_mtu_seg_max; /**< Max number of segments per one MTU */
 };
 
 /**
@@ -1191,6 +1194,11 @@  typedef uint16_t (*eth_tx_burst_t)(void *txq,
 				   uint16_t nb_pkts);
 /**< @internal Send output packets on a transmit queue of an Ethernet device. */
 
+typedef uint16_t (*eth_tx_prep_t)(void *txq,
+				   struct rte_mbuf **tx_pkts,
+				   uint16_t nb_pkts);
+/**< @internal Prepare output packets on a transmit queue of an Ethernet device. */
+
 typedef int (*flow_ctrl_get_t)(struct rte_eth_dev *dev,
 			       struct rte_eth_fc_conf *fc_conf);
 /**< @internal Get current flow control parameter on an Ethernet device */
@@ -1625,6 +1633,7 @@  struct rte_eth_rxtx_callback {
 struct rte_eth_dev {
 	eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive function. */
 	eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit function. */
+	eth_tx_prep_t tx_pkt_prepare; /**< Pointer to PMD transmit prepare function. */
 	struct rte_eth_dev_data *data;  /**< Pointer to device data */
 	const struct eth_driver *driver;/**< Driver for this device */
 	const struct eth_dev_ops *dev_ops; /**< Functions exported by PMD */
@@ -2819,6 +2828,103 @@  int rte_eth_dev_set_vlan_ether_type(uint8_t port_id,
 	return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, nb_pkts);
 }
 
+/**
+ * Process a burst of output packets on a transmit queue of an Ethernet device.
+ *
+ * The rte_eth_tx_prepare() function is invoked to prepare output packets to be
+ * transmitted on the output queue *queue_id* of the Ethernet device designated
+ * by its *port_id*.
+ * The *nb_pkts* parameter is the number of packets to be prepared which are
+ * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of them
+ * allocated from a pool created with rte_pktmbuf_pool_create().
+ * For each packet to send, the rte_eth_tx_prepare() function performs
+ * the following operations:
+ *
+ * - Check if packet meets devices requirements for tx offloads.
+ *
+ * - Check limitations about number of segments.
+ *
+ * - Check additional requirements when debug is enabled.
+ *
+ * - Update and/or reset required checksums when tx offload is set for packet.
+ *
+ * Since this function can modify packet data, provided mbufs must be safely
+ * writable (e.g. modified data cannot be in shared segment).
+ *
+ * The rte_eth_tx_prepare() function returns the number of packets ready to be
+ * sent. A return value equal to *nb_pkts* means that all packets are valid and
+ * ready to be sent, otherwise stops processing on the first invalid packet and
+ * leaves the rest packets untouched.
+ *
+ * When this functionality is not implemented in the driver, all packets are
+ * are returned untouched.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *   The value must be a valid port id.
+ * @param queue_id
+ *   The index of the transmit queue through which output packets must be
+ *   sent.
+ *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param tx_pkts
+ *   The address of an array of *nb_pkts* pointers to *rte_mbuf* structures
+ *   which contain the output packets.
+ * @param nb_pkts
+ *   The maximum number of packets to process.
+ * @return
+ *   The number of packets correct and ready to be sent. The return value can be
+ *   less than the value of the *tx_pkts* parameter when some packet doesn't
+ *   meet devices requirements with rte_errno set appropriately:
+ *   - -EINVAL: offload flags are not correctly set
+ *   - -ENOTSUP: the offload feature is not supported by the hardware
+ *
+ */
+
+#ifdef RTE_ETHDEV_TX_PREPARE
+
+static inline uint16_t
+rte_eth_tx_prepare(uint8_t port_id, uint16_t queue_id,
+		struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	struct rte_eth_dev *dev;
+
+#ifdef RTE_LIBRTE_ETHDEV_DEBUG
+	if (!rte_eth_dev_is_valid_port(port_id)) {
+		RTE_PMD_DEBUG_TRACE("Invalid TX port_id=%d\n", port_id);
+		rte_errno = -EINVAL;
+		return 0;
+	}
+#endif
+
+	dev = &rte_eth_devices[port_id];
+
+#ifdef RTE_LIBRTE_ETHDEV_DEBUG
+	if (queue_id >= dev->data->nb_tx_queues) {
+		RTE_PMD_DEBUG_TRACE("Invalid TX queue_id=%d\n", queue_id);
+		rte_errno = -EINVAL;
+		return 0;
+	}
+#endif
+
+	if (!dev->tx_pkt_prepare)
+		return nb_pkts;
+
+	return (*dev->tx_pkt_prepare)(dev->data->tx_queues[queue_id],
+			tx_pkts, nb_pkts);
+}
+
+#else
+
+static inline uint16_t
+rte_eth_tx_prepare(__rte_unused uint8_t port_id, __rte_unused uint16_t queue_id,
+		__rte_unused struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	return nb_pkts;
+}
+
+#endif
+
 typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
 		void *userdata);
 
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index ead7c6e..39ee5ed 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -283,6 +283,19 @@ 
  */
 #define PKT_TX_OUTER_IPV6    (1ULL << 60)
 
+/**
+ * Bit Mask of all supported packet Tx offload features flags, which can be set
+ * for packet.
+ */
+#define PKT_TX_OFFLOAD_MASK (    \
+		PKT_TX_IP_CKSUM |        \
+		PKT_TX_L4_MASK |         \
+		PKT_TX_OUTER_IP_CKSUM |  \
+		PKT_TX_TCP_SEG |         \
+		PKT_TX_QINQ_PKT |        \
+		PKT_TX_VLAN_PKT |        \
+		PKT_TX_TUNNEL_MASK)
+
 #define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
 
 #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
@@ -1647,6 +1660,57 @@  static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf *tail
 }
 
 /**
+ * Validate general requirements for tx offload in mbuf.
+ *
+ * This function checks correctness and completeness of Tx offload settings.
+ *
+ * @param m
+ *   The packet mbuf to be validated.
+ * @return
+ *   0 if packet is valid
+ */
+static inline int
+rte_validate_tx_offload(const struct rte_mbuf *m)
+{
+	uint64_t ol_flags = m->ol_flags;
+	uint64_t inner_l3_offset = m->l2_len;
+
+	/* Does packet set any of available offloads? */
+	if (!(ol_flags & PKT_TX_OFFLOAD_MASK))
+		return 0;
+
+	if (ol_flags & PKT_TX_OUTER_IP_CKSUM)
+		inner_l3_offset += m->outer_l2_len + m->outer_l3_len;
+
+	/* Headers are fragmented */
+	if (rte_pktmbuf_data_len(m) < inner_l3_offset + m->l3_len + m->l4_len)
+		return -ENOTSUP;
+
+	/* IP checksum can be counted only for IPv4 packet */
+	if ((ol_flags & PKT_TX_IP_CKSUM) && (ol_flags & PKT_TX_IPV6))
+		return -EINVAL;
+
+	/* IP type not set when required */
+	if (ol_flags & (PKT_TX_L4_MASK | PKT_TX_TCP_SEG))
+		if (!(ol_flags & (PKT_TX_IPV4 | PKT_TX_IPV6)))
+			return -EINVAL;
+
+	/* Check requirements for TSO packet */
+	if (ol_flags & PKT_TX_TCP_SEG)
+		if ((m->tso_segsz == 0) ||
+				((ol_flags & PKT_TX_IPV4) &&
+				!(ol_flags & PKT_TX_IP_CKSUM)))
+			return -EINVAL;
+
+	/* PKT_TX_OUTER_IP_CKSUM set for non outer IPv4 packet. */
+	if ((ol_flags & PKT_TX_OUTER_IP_CKSUM) &&
+			!(ol_flags & PKT_TX_OUTER_IPV4))
+		return -EINVAL;
+
+	return 0;
+}
+
+/**
  * Dump an mbuf structure to a file.
  *
  * Dump all fields for the given packet mbuf and all its associated
diff --git a/lib/librte_net/rte_net.h b/lib/librte_net/rte_net.h
index d4156ae..85f356d 100644
--- a/lib/librte_net/rte_net.h
+++ b/lib/librte_net/rte_net.h
@@ -38,6 +38,11 @@ 
 extern "C" {
 #endif
 
+#include <rte_ip.h>
+#include <rte_udp.h>
+#include <rte_tcp.h>
+#include <rte_sctp.h>
+
 /**
  * Structure containing header lengths associated to a packet, filled
  * by rte_net_get_ptype().
@@ -86,6 +91,86 @@  struct rte_net_hdr_lens {
 uint32_t rte_net_get_ptype(const struct rte_mbuf *m,
 	struct rte_net_hdr_lens *hdr_lens, uint32_t layers);
 
+/**
+ * Prepare pseudo header checksum
+ *
+ * This function prepares pseudo header checksum for TSO and non-TSO tcp/udp in
+ * provided mbufs packet data.
+ *
+ * - for non-TSO tcp/udp packets full pseudo-header checksum is counted and set
+ *   in packet data,
+ * - for TSO the IP payload length is not included in pseudo header.
+ *
+ * This function expects that used headers are in the first data segment of
+ * mbuf, are not fragmented and can be safely modified.
+ *
+ * @param m
+ *   The packet mbuf to be fixed.
+ * @return
+ *   0 if checksum is initialized properly
+ */
+static inline int
+rte_net_intel_cksum_prepare(struct rte_mbuf *m)
+{
+	struct ipv4_hdr *ipv4_hdr;
+	struct ipv6_hdr *ipv6_hdr;
+	struct tcp_hdr *tcp_hdr;
+	struct udp_hdr *udp_hdr;
+	uint64_t ol_flags = m->ol_flags;
+	uint64_t inner_l3_offset = m->l2_len;
+
+	if (ol_flags & PKT_TX_OUTER_IP_CKSUM)
+		inner_l3_offset += m->outer_l2_len + m->outer_l3_len;
+
+	if ((ol_flags & PKT_TX_UDP_CKSUM) == PKT_TX_UDP_CKSUM) {
+		if (ol_flags & PKT_TX_IPV4) {
+			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
+					inner_l3_offset);
+
+			if (ol_flags & PKT_TX_IP_CKSUM)
+				ipv4_hdr->hdr_checksum = 0;
+
+			udp_hdr = (struct udp_hdr *)((char *)ipv4_hdr +
+					m->l3_len);
+			udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr,
+					ol_flags);
+		} else {
+			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
+					inner_l3_offset);
+			/* non-TSO udp */
+			udp_hdr = rte_pktmbuf_mtod_offset(m, struct udp_hdr *,
+					inner_l3_offset + m->l3_len);
+			udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr,
+					ol_flags);
+		}
+	} else if ((ol_flags & PKT_TX_TCP_CKSUM) ||
+			(ol_flags & PKT_TX_TCP_SEG)) {
+		if (ol_flags & PKT_TX_IPV4) {
+			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
+					inner_l3_offset);
+
+			if (ol_flags & PKT_TX_IP_CKSUM)
+				ipv4_hdr->hdr_checksum = 0;
+
+			/* non-TSO tcp or TSO */
+			tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr +
+					m->l3_len);
+			tcp_hdr->cksum = rte_ipv4_phdr_cksum(ipv4_hdr,
+					ol_flags);
+		} else {
+			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
+					inner_l3_offset);
+			/* non-TSO tcp or TSO */
+			tcp_hdr = rte_pktmbuf_mtod_offset(m, struct tcp_hdr *,
+					inner_l3_offset + m->l3_len);
+			tcp_hdr->cksum = rte_ipv6_phdr_cksum(ipv6_hdr,
+					ol_flags);
+		}
+	}
+
+	return 0;
+}
+
 #ifdef __cplusplus
 }
 #endif