[dpdk-dev,v2,3/4] examples: example showing use of callbacks.

Message ID 1423841989-9090-4-git-send-email-john.mcnamara@intel.com (mailing list archive)
State Superseded, archived
Headers

Commit Message

John McNamara Feb. 13, 2015, 3:39 p.m. UTC
  From: Richardson, Bruce <bruce.richardson@intel.com>

Example showing how callbacks can be used to insert a timestamp
into each packet on RX. On TX the timestamp is used to calculate
the packet latency through the app, in cycles.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 examples/rxtx_callbacks/Makefile   |   57 +++++++++
 examples/rxtx_callbacks/basicfwd.c |  222 ++++++++++++++++++++++++++++++++++++
 examples/rxtx_callbacks/basicfwd.h |   46 ++++++++
 3 files changed, 325 insertions(+), 0 deletions(-)
 create mode 100644 examples/rxtx_callbacks/Makefile
 create mode 100644 examples/rxtx_callbacks/basicfwd.c
 create mode 100644 examples/rxtx_callbacks/basicfwd.h
  

Comments

Thomas Monjalon Feb. 13, 2015, 4:02 p.m. UTC | #1
It appears you made some copy paste of an old example.
Please try to send something up to date.

> +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.

Old

> +#ifdef RTE_EXEC_ENV_BAREMETAL
> +#define MAIN _main
> +#else
> +#define MAIN main
> +#endif

There is no bare metal anymore.
  
Olivier Matz Feb. 16, 2015, 2:33 p.m. UTC | #2
Hi John,

On 02/13/2015 04:39 PM, John McNamara wrote:
> From: Richardson, Bruce <bruce.richardson@intel.com>
> 
> Example showing how callbacks can be used to insert a timestamp
> into each packet on RX. On TX the timestamp is used to calculate
> the packet latency through the app, in cycles.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>


I'm looking at the example and I don't understand what is the advantage
of having callbacks in ethdev layer, knowing that the application can
do the same job by a standard function call.

What is the advantage of having callbacks compared to:


for (port = 0; port < nb_ports; port++) {
	struct rte_mbuf *bufs[BURST_SIZE];
	const uint16_t nb_rx = rte_eth_rx_burst(port, 0,
			bufs, BURST_SIZE);
	if (unlikely(nb_rx == 0))
		continue;
	add_timestamp(bufs, nb_rx);

	const uint16_t nb_tx = rte_eth_tx_burst(port ^ 1, 0,
			bufs, nb_rx);
	calc_latency(bufs, nb_tx);

	if (unlikely(nb_tx < nb_rx)) {
		uint16_t buf;
		for (buf = nb_tx; buf < nb_rx; buf++)
			rte_pktmbuf_free(bufs[buf]);
	}
}


To me, doing like the code above has several advantages:

- code is more readable: the callback is explicitly invoked, so there is
  no risk to forget it
- code is faster: the functions calls can be inlined by the compiler
- easier to handle error cases in the callback function as the error
  code is accessible to the application
- there is no need to add code in ethdev api to do this
- if the application does not want to use callbacks (I suppose most
  applications), it won't have any performance impact

Regards,
Olivier
  
Bruce Richardson Feb. 16, 2015, 3:16 p.m. UTC | #3
On Mon, Feb 16, 2015 at 03:33:40PM +0100, Olivier MATZ wrote:
> Hi John,
> 
> On 02/13/2015 04:39 PM, John McNamara wrote:
> > From: Richardson, Bruce <bruce.richardson@intel.com>
> > 
> > Example showing how callbacks can be used to insert a timestamp
> > into each packet on RX. On TX the timestamp is used to calculate
> > the packet latency through the app, in cycles.
> > 
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> 
> I'm looking at the example and I don't understand what is the advantage
> of having callbacks in ethdev layer, knowing that the application can
> do the same job by a standard function call.
> 
> What is the advantage of having callbacks compared to:
> 
> 
> for (port = 0; port < nb_ports; port++) {
> 	struct rte_mbuf *bufs[BURST_SIZE];
> 	const uint16_t nb_rx = rte_eth_rx_burst(port, 0,
> 			bufs, BURST_SIZE);
> 	if (unlikely(nb_rx == 0))
> 		continue;
> 	add_timestamp(bufs, nb_rx);
> 
> 	const uint16_t nb_tx = rte_eth_tx_burst(port ^ 1, 0,
> 			bufs, nb_rx);
> 	calc_latency(bufs, nb_tx);
> 
> 	if (unlikely(nb_tx < nb_rx)) {
> 		uint16_t buf;
> 		for (buf = nb_tx; buf < nb_rx; buf++)
> 			rte_pktmbuf_free(bufs[buf]);
> 	}
> }
> 
> 
> To me, doing like the code above has several advantages:
> 
> - code is more readable: the callback is explicitly invoked, so there is
>   no risk to forget it
> - code is faster: the functions calls can be inlined by the compiler
> - easier to handle error cases in the callback function as the error
>   code is accessible to the application
> - there is no need to add code in ethdev api to do this
> - if the application does not want to use callbacks (I suppose most
>   applications), it won't have any performance impact
> 
> Regards,
> Olivier

In this specific instance, given that the application does little else, there
is no real advantage to using the callbacks - it's just to have a simple example
of how they can be used.

Where callbacks are really designed to be useful, is for extending or augmenting
hardware capabilities. Taking the example of sequence numbers - to use the most
trivial example - an application could be written to take advantage of sequence
numbers written to packets by the hardware which received them. However, if such
an application was to be used with a NIC which does not provide sequence numbering
capability, for example, anything using ixgbe driver, the application writer has
two choices - either modify his application code to check each packet for
a sequence number in the data path, and add it there post-rx, or alternatively,
to check the NIC capabilities at initialization time, and add a callback there
at initialization, if the hardware does not support it. In the latter case,
the main packet processing body of the application can be written as though
hardware always has sequence numbering capability, safe in the knowledge that
any hardware not supporting it will be back-filled by a software fallback at 
initialization-time.

By the same token, we could also look to extend hardware capabilities. For
different filtering or hashing capabilities, there can be limits in hardware
which are far less than what we need to use in software. Again, callbacks will
allow the data path to be written in a way that is oblivious to the underlying
hardware limits, because software will transparently fill in the gaps.

Hope this makes the use case clear.

Regards,
/Bruce
  
Thomas Monjalon Feb. 16, 2015, 5:34 p.m. UTC | #4
2015-02-16 15:16, Bruce Richardson:
> On Mon, Feb 16, 2015 at 03:33:40PM +0100, Olivier MATZ wrote:
> > Hi John,
> > 
> > On 02/13/2015 04:39 PM, John McNamara wrote:
> > > From: Richardson, Bruce <bruce.richardson@intel.com>
> > > 
> > > Example showing how callbacks can be used to insert a timestamp
> > > into each packet on RX. On TX the timestamp is used to calculate
> > > the packet latency through the app, in cycles.
> > > 
> > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > 
> > 
> > I'm looking at the example and I don't understand what is the advantage
> > of having callbacks in ethdev layer, knowing that the application can
> > do the same job by a standard function call.
> > 
> > What is the advantage of having callbacks compared to:
> > 
> > 
> > for (port = 0; port < nb_ports; port++) {
> > 	struct rte_mbuf *bufs[BURST_SIZE];
> > 	const uint16_t nb_rx = rte_eth_rx_burst(port, 0,
> > 			bufs, BURST_SIZE);
> > 	if (unlikely(nb_rx == 0))
> > 		continue;
> > 	add_timestamp(bufs, nb_rx);
> > 
> > 	const uint16_t nb_tx = rte_eth_tx_burst(port ^ 1, 0,
> > 			bufs, nb_rx);
> > 	calc_latency(bufs, nb_tx);
> > 
> > 	if (unlikely(nb_tx < nb_rx)) {
> > 		uint16_t buf;
> > 		for (buf = nb_tx; buf < nb_rx; buf++)
> > 			rte_pktmbuf_free(bufs[buf]);
> > 	}
> > }
> > 
> > 
> > To me, doing like the code above has several advantages:
> > 
> > - code is more readable: the callback is explicitly invoked, so there is
> >   no risk to forget it
> > - code is faster: the functions calls can be inlined by the compiler
> > - easier to handle error cases in the callback function as the error
> >   code is accessible to the application
> > - there is no need to add code in ethdev api to do this
> > - if the application does not want to use callbacks (I suppose most
> >   applications), it won't have any performance impact
> > 
> > Regards,
> > Olivier
> 
> In this specific instance, given that the application does little else, there
> is no real advantage to using the callbacks - it's just to have a simple example
> of how they can be used.
> 
> Where callbacks are really designed to be useful, is for extending or augmenting
> hardware capabilities. Taking the example of sequence numbers - to use the most
> trivial example - an application could be written to take advantage of sequence
> numbers written to packets by the hardware which received them. However, if such
> an application was to be used with a NIC which does not provide sequence numbering
> capability, for example, anything using ixgbe driver, the application writer has
> two choices - either modify his application code to check each packet for
> a sequence number in the data path, and add it there post-rx, or alternatively,
> to check the NIC capabilities at initialization time, and add a callback there
> at initialization, if the hardware does not support it. In the latter case,
> the main packet processing body of the application can be written as though
> hardware always has sequence numbering capability, safe in the knowledge that
> any hardware not supporting it will be back-filled by a software fallback at 
> initialization-time.
> 
> By the same token, we could also look to extend hardware capabilities. For
> different filtering or hashing capabilities, there can be limits in hardware
> which are far less than what we need to use in software. Again, callbacks will
> allow the data path to be written in a way that is oblivious to the underlying
> hardware limits, because software will transparently fill in the gaps.
> 
> Hope this makes the use case clear.

After thinking more about these callbacks, I realize these callbacks won't
help, as Olivier said.

With callback,
1/ application checks device capability
2/ application provides hardware emulation as DPDK callback
3/ application forgets previous steps
4/ application calls DPDK Rx
5/ DPDK calls callback (without calling optimization)

Without callback,
1/ application checks device capability
2/ application provides hardware emulation as internal function
3/ application set an internal device-flag to enable this function
4/ application calls DPDK Rx
5/ application calls the hardware emulation if flag is set

So the only difference is to keep persistent the device information in
the application instead of storing it as a function pointer in the
DPDK struct.
You can also be faster with this approach: at initialization time,
you can check that your NIC supports the feature and use a specific
mainloop that adds or not the sequence number without any runtime
test.

A callback could be justified for asynchronous events, or when
doing specific processing in the middle of the driver, for instance
when freeing a mbuf. But in this case it's exactly similar to do
the processing in the application after Rx (or before Tx).
  
Doherty, Declan Feb. 17, 2015, 12:17 p.m. UTC | #5
On 16/02/15 17:34, Thomas Monjalon wrote:
> 2015-02-16 15:16, Bruce Richardson:
>> On Mon, Feb 16, 2015 at 03:33:40PM +0100, Olivier MATZ wrote:
>>> Hi John,
>>>
>>> On 02/13/2015 04:39 PM, John McNamara wrote:
>>>> From: Richardson, Bruce <bruce.richardson@intel.com>
>>>>
>>>> Example showing how callbacks can be used to insert a timestamp
>>>> into each packet on RX. On TX the timestamp is used to calculate
>>>> the packet latency through the app, in cycles.
>>>>
>>>> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>>>
>>>
>>> I'm looking at the example and I don't understand what is the advantage
>>> of having callbacks in ethdev layer, knowing that the application can
>>> do the same job by a standard function call.
>>>
>>> What is the advantage of having callbacks compared to:
>>>
>>>
>>> for (port = 0; port < nb_ports; port++) {
>>> 	struct rte_mbuf *bufs[BURST_SIZE];
>>> 	const uint16_t nb_rx = rte_eth_rx_burst(port, 0,
>>> 			bufs, BURST_SIZE);
>>> 	if (unlikely(nb_rx == 0))
>>> 		continue;
>>> 	add_timestamp(bufs, nb_rx);
>>>
>>> 	const uint16_t nb_tx = rte_eth_tx_burst(port ^ 1, 0,
>>> 			bufs, nb_rx);
>>> 	calc_latency(bufs, nb_tx);
>>>
>>> 	if (unlikely(nb_tx < nb_rx)) {
>>> 		uint16_t buf;
>>> 		for (buf = nb_tx; buf < nb_rx; buf++)
>>> 			rte_pktmbuf_free(bufs[buf]);
>>> 	}
>>> }
>>>
>>>
>>> To me, doing like the code above has several advantages:
>>>
>>> - code is more readable: the callback is explicitly invoked, so there is
>>>    no risk to forget it
>>> - code is faster: the functions calls can be inlined by the compiler
>>> - easier to handle error cases in the callback function as the error
>>>    code is accessible to the application
>>> - there is no need to add code in ethdev api to do this
>>> - if the application does not want to use callbacks (I suppose most
>>>    applications), it won't have any performance impact
>>>
>>> Regards,
>>> Olivier
>>
>> In this specific instance, given that the application does little else, there
>> is no real advantage to using the callbacks - it's just to have a simple example
>> of how they can be used.
>>
>> Where callbacks are really designed to be useful, is for extending or augmenting
>> hardware capabilities. Taking the example of sequence numbers - to use the most
>> trivial example - an application could be written to take advantage of sequence
>> numbers written to packets by the hardware which received them. However, if such
>> an application was to be used with a NIC which does not provide sequence numbering
>> capability, for example, anything using ixgbe driver, the application writer has
>> two choices - either modify his application code to check each packet for
>> a sequence number in the data path, and add it there post-rx, or alternatively,
>> to check the NIC capabilities at initialization time, and add a callback there
>> at initialization, if the hardware does not support it. In the latter case,
>> the main packet processing body of the application can be written as though
>> hardware always has sequence numbering capability, safe in the knowledge that
>> any hardware not supporting it will be back-filled by a software fallback at
>> initialization-time.
>>
>> By the same token, we could also look to extend hardware capabilities. For
>> different filtering or hashing capabilities, there can be limits in hardware
>> which are far less than what we need to use in software. Again, callbacks will
>> allow the data path to be written in a way that is oblivious to the underlying
>> hardware limits, because software will transparently fill in the gaps.
>>
>> Hope this makes the use case clear.
>
> After thinking more about these callbacks, I realize these callbacks won't
> help, as Olivier said.
>
> With callback,
> 1/ application checks device capability
> 2/ application provides hardware emulation as DPDK callback
> 3/ application forgets previous steps
> 4/ application calls DPDK Rx
> 5/ DPDK calls callback (without calling optimization)
>
> Without callback,
> 1/ application checks device capability
> 2/ application provides hardware emulation as internal function
> 3/ application set an internal device-flag to enable this function
> 4/ application calls DPDK Rx
> 5/ application calls the hardware emulation if flag is set
>
> So the only difference is to keep persistent the device information in
> the application instead of storing it as a function pointer in the
> DPDK struct.
> You can also be faster with this approach: at initialization time,
> you can check that your NIC supports the feature and use a specific
> mainloop that adds or not the sequence number without any runtime
> test.
>
> A callback could be justified for asynchronous events, or when
> doing specific processing in the middle of the driver, for instance
> when freeing a mbuf. But in this case it's exactly similar to do
> the processing in the application after Rx (or before Tx).
>

I believe that the introduction of callbacks to the ethdev layer will be 
required for live migration.

For example, in the scenario were we have two ports bonded together in 
active backup mode, the primary slave being a hw port and the other 
slave a virtio port, in normal operation it would be desirable to 
leverage the available hw offload capabilities for maximum performance, 
but for these two devices to be bonded together then it is required that 
the both slave devices support the same set of offload features. In the 
occurrence of a planned or unplanned fail over the backup slave must 
provided the same offloads as the primary device, currently the offloads 
supported are the lowest common denominator of offload features of all 
slave devices but obviously this isn't desirable.


I think that we could extend the bonding library API to take a set of 
desired offloads as input parameters, then during the addition of slaves 
we would interrogate the supported hw offloads available, enable the 
desired ones and then register callbacks to implement the offloads which 
the slave device does not support in hw. This would negate the user 
application needing to have any knowledge of the under lying slave 
offload configuration, and it would be guaranteed than the offloads 
requested are happening irrespective of which slave is in use and allow 
migration of vm transparently to what is happening in the ethdev layer

Declan
  
Bruce Richardson Feb. 17, 2015, 12:25 p.m. UTC | #6
On Mon, Feb 16, 2015 at 06:34:37PM +0100, Thomas Monjalon wrote:
> 2015-02-16 15:16, Bruce Richardson:
> > On Mon, Feb 16, 2015 at 03:33:40PM +0100, Olivier MATZ wrote:
> > > Hi John,
> > > 
> > > On 02/13/2015 04:39 PM, John McNamara wrote:
> > > > From: Richardson, Bruce <bruce.richardson@intel.com>
> > > > 
> > > > Example showing how callbacks can be used to insert a timestamp
> > > > into each packet on RX. On TX the timestamp is used to calculate
> > > > the packet latency through the app, in cycles.
> > > > 
> > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > 
> > > 
> > > I'm looking at the example and I don't understand what is the advantage
> > > of having callbacks in ethdev layer, knowing that the application can
> > > do the same job by a standard function call.
> > > 
> > > What is the advantage of having callbacks compared to:
> > > 
> > > 
> > > for (port = 0; port < nb_ports; port++) {
> > > 	struct rte_mbuf *bufs[BURST_SIZE];
> > > 	const uint16_t nb_rx = rte_eth_rx_burst(port, 0,
> > > 			bufs, BURST_SIZE);
> > > 	if (unlikely(nb_rx == 0))
> > > 		continue;
> > > 	add_timestamp(bufs, nb_rx);
> > > 
> > > 	const uint16_t nb_tx = rte_eth_tx_burst(port ^ 1, 0,
> > > 			bufs, nb_rx);
> > > 	calc_latency(bufs, nb_tx);
> > > 
> > > 	if (unlikely(nb_tx < nb_rx)) {
> > > 		uint16_t buf;
> > > 		for (buf = nb_tx; buf < nb_rx; buf++)
> > > 			rte_pktmbuf_free(bufs[buf]);
> > > 	}
> > > }
> > > 
> > > 
> > > To me, doing like the code above has several advantages:
> > > 
> > > - code is more readable: the callback is explicitly invoked, so there is
> > >   no risk to forget it
> > > - code is faster: the functions calls can be inlined by the compiler
> > > - easier to handle error cases in the callback function as the error
> > >   code is accessible to the application
> > > - there is no need to add code in ethdev api to do this
> > > - if the application does not want to use callbacks (I suppose most
> > >   applications), it won't have any performance impact
> > > 
> > > Regards,
> > > Olivier
> > 
> > In this specific instance, given that the application does little else, there
> > is no real advantage to using the callbacks - it's just to have a simple example
> > of how they can be used.
> > 
> > Where callbacks are really designed to be useful, is for extending or augmenting
> > hardware capabilities. Taking the example of sequence numbers - to use the most
> > trivial example - an application could be written to take advantage of sequence
> > numbers written to packets by the hardware which received them. However, if such
> > an application was to be used with a NIC which does not provide sequence numbering
> > capability, for example, anything using ixgbe driver, the application writer has
> > two choices - either modify his application code to check each packet for
> > a sequence number in the data path, and add it there post-rx, or alternatively,
> > to check the NIC capabilities at initialization time, and add a callback there
> > at initialization, if the hardware does not support it. In the latter case,
> > the main packet processing body of the application can be written as though
> > hardware always has sequence numbering capability, safe in the knowledge that
> > any hardware not supporting it will be back-filled by a software fallback at 
> > initialization-time.
> > 
> > By the same token, we could also look to extend hardware capabilities. For
> > different filtering or hashing capabilities, there can be limits in hardware
> > which are far less than what we need to use in software. Again, callbacks will
> > allow the data path to be written in a way that is oblivious to the underlying
> > hardware limits, because software will transparently fill in the gaps.
> > 
> > Hope this makes the use case clear.
> 
> After thinking more about these callbacks, I realize these callbacks won't
> help, as Olivier said.
> 
> With callback,
> 1/ application checks device capability
> 2/ application provides hardware emulation as DPDK callback
> 3/ application forgets previous steps
> 4/ application calls DPDK Rx
> 5/ DPDK calls callback (without calling optimization)
> 
> Without callback,
> 1/ application checks device capability
> 2/ application provides hardware emulation as internal function
> 3/ application set an internal device-flag to enable this function
> 4/ application calls DPDK Rx
> 5/ application calls the hardware emulation if flag is set
> 
> So the only difference is to keep persistent the device information in
> the application instead of storing it as a function pointer in the
> DPDK struct.
> You can also be faster with this approach: at initialization time,
> you can check that your NIC supports the feature and use a specific
> mainloop that adds or not the sequence number without any runtime
> test.

That is assuming that all NICs are equal on your system. It's also assuming
that you only have a single point in your application where you call RX or
TX burst. In the case where you have a couple of different NICs on the system,
or where you want to write an application to take advantage of capabilities of
different NICs, the ability to resolve all these difference at initialization
time is useful. The main packet handling code can be written with just the
processing of packets in mind, rather than having to have a set of branches
after each RX burst call, or before each TX burst call, to "smooth out" the
different NIC capabilities. 

As for the option of maintaining different main loops for different NICs with
different capabilities - that sounds like a maintenance nightmare to
me, due to duplicated code! Callbacks is a far cleaner solution than that IMHO.

/Bruce

> 
> A callback could be justified for asynchronous events, or when
> doing specific processing in the middle of the driver, for instance
> when freeing a mbuf. But in this case it's exactly similar to do
> the processing in the application after Rx (or before Tx).
>
  
Olivier Matz Feb. 17, 2015, 1:28 p.m. UTC | #7
Hi Bruce,

On 02/17/2015 01:25 PM, Bruce Richardson wrote:
> On Mon, Feb 16, 2015 at 06:34:37PM +0100, Thomas Monjalon wrote:
>> 2015-02-16 15:16, Bruce Richardson:
>>> In this specific instance, given that the application does little else, there
>>> is no real advantage to using the callbacks - it's just to have a simple example
>>> of how they can be used.
>>>
>>> Where callbacks are really designed to be useful, is for extending or augmenting
>>> hardware capabilities. Taking the example of sequence numbers - to use the most
>>> trivial example - an application could be written to take advantage of sequence
>>> numbers written to packets by the hardware which received them. However, if such
>>> an application was to be used with a NIC which does not provide sequence numbering
>>> capability, for example, anything using ixgbe driver, the application writer has
>>> two choices - either modify his application code to check each packet for
>>> a sequence number in the data path, and add it there post-rx, or alternatively,
>>> to check the NIC capabilities at initialization time, and add a callback there
>>> at initialization, if the hardware does not support it. In the latter case,
>>> the main packet processing body of the application can be written as though
>>> hardware always has sequence numbering capability, safe in the knowledge that
>>> any hardware not supporting it will be back-filled by a software fallback at
>>> initialization-time.
>>>
>>> By the same token, we could also look to extend hardware capabilities. For
>>> different filtering or hashing capabilities, there can be limits in hardware
>>> which are far less than what we need to use in software. Again, callbacks will
>>> allow the data path to be written in a way that is oblivious to the underlying
>>> hardware limits, because software will transparently fill in the gaps.
>>>
>>> Hope this makes the use case clear.
>>
>> After thinking more about these callbacks, I realize these callbacks won't
>> help, as Olivier said.
>>
>> With callback,
>> 1/ application checks device capability
>> 2/ application provides hardware emulation as DPDK callback
>> 3/ application forgets previous steps
>> 4/ application calls DPDK Rx
>> 5/ DPDK calls callback (without calling optimization)
>>
>> Without callback,
>> 1/ application checks device capability
>> 2/ application provides hardware emulation as internal function
>> 3/ application set an internal device-flag to enable this function
>> 4/ application calls DPDK Rx
>> 5/ application calls the hardware emulation if flag is set
>>
>> So the only difference is to keep persistent the device information in
>> the application instead of storing it as a function pointer in the
>> DPDK struct.
>> You can also be faster with this approach: at initialization time,
>> you can check that your NIC supports the feature and use a specific
>> mainloop that adds or not the sequence number without any runtime
>> test.
>
> That is assuming that all NICs are equal on your system. It's also assuming
> that you only have a single point in your application where you call RX or
> TX burst. In the case where you have a couple of different NICs on the system,
> or where you want to write an application to take advantage of capabilities of
> different NICs, the ability to resolve all these difference at initialization
> time is useful. The main packet handling code can be written with just the
> processing of packets in mind, rather than having to have a set of branches
> after each RX burst call, or before each TX burst call, to "smooth out" the
> different NIC capabilities.
>
> As for the option of maintaining different main loops for different NICs with
> different capabilities - that sounds like a maintenance nightmare to
> me, due to duplicated code! Callbacks is a far cleaner solution than that IMHO.

Why not just provide a function like this:

   rte_do_unsupported_stuff_by_software(m[], m_count, wanted_features,
   	dev_feature_flags)

This function can be called (or not) from the application mainloop.
You don't need to maintain several mainloops (for each device) as
the specific work will be done depending on the given flags. And the
applications that do not require these features (most applications?)
are not penalized at all.

If you have several places where you call rx in your application
and you want to factorize it, you can have your own function that
calls rx plus the function that does the additional sw work.

Regards,
Olivier
  
Bruce Richardson Feb. 17, 2015, 1:50 p.m. UTC | #8
On Tue, Feb 17, 2015 at 02:28:02PM +0100, Olivier MATZ wrote:
> Hi Bruce,
> 
> On 02/17/2015 01:25 PM, Bruce Richardson wrote:
> >On Mon, Feb 16, 2015 at 06:34:37PM +0100, Thomas Monjalon wrote:
> >>2015-02-16 15:16, Bruce Richardson:
> >>>In this specific instance, given that the application does little else, there
> >>>is no real advantage to using the callbacks - it's just to have a simple example
> >>>of how they can be used.
> >>>
> >>>Where callbacks are really designed to be useful, is for extending or augmenting
> >>>hardware capabilities. Taking the example of sequence numbers - to use the most
> >>>trivial example - an application could be written to take advantage of sequence
> >>>numbers written to packets by the hardware which received them. However, if such
> >>>an application was to be used with a NIC which does not provide sequence numbering
> >>>capability, for example, anything using ixgbe driver, the application writer has
> >>>two choices - either modify his application code to check each packet for
> >>>a sequence number in the data path, and add it there post-rx, or alternatively,
> >>>to check the NIC capabilities at initialization time, and add a callback there
> >>>at initialization, if the hardware does not support it. In the latter case,
> >>>the main packet processing body of the application can be written as though
> >>>hardware always has sequence numbering capability, safe in the knowledge that
> >>>any hardware not supporting it will be back-filled by a software fallback at
> >>>initialization-time.
> >>>
> >>>By the same token, we could also look to extend hardware capabilities. For
> >>>different filtering or hashing capabilities, there can be limits in hardware
> >>>which are far less than what we need to use in software. Again, callbacks will
> >>>allow the data path to be written in a way that is oblivious to the underlying
> >>>hardware limits, because software will transparently fill in the gaps.
> >>>
> >>>Hope this makes the use case clear.
> >>
> >>After thinking more about these callbacks, I realize these callbacks won't
> >>help, as Olivier said.
> >>
> >>With callback,
> >>1/ application checks device capability
> >>2/ application provides hardware emulation as DPDK callback
> >>3/ application forgets previous steps
> >>4/ application calls DPDK Rx
> >>5/ DPDK calls callback (without calling optimization)
> >>
> >>Without callback,
> >>1/ application checks device capability
> >>2/ application provides hardware emulation as internal function
> >>3/ application set an internal device-flag to enable this function
> >>4/ application calls DPDK Rx
> >>5/ application calls the hardware emulation if flag is set
> >>
> >>So the only difference is to keep persistent the device information in
> >>the application instead of storing it as a function pointer in the
> >>DPDK struct.
> >>You can also be faster with this approach: at initialization time,
> >>you can check that your NIC supports the feature and use a specific
> >>mainloop that adds or not the sequence number without any runtime
> >>test.
> >
> >That is assuming that all NICs are equal on your system. It's also assuming
> >that you only have a single point in your application where you call RX or
> >TX burst. In the case where you have a couple of different NICs on the system,
> >or where you want to write an application to take advantage of capabilities of
> >different NICs, the ability to resolve all these difference at initialization
> >time is useful. The main packet handling code can be written with just the
> >processing of packets in mind, rather than having to have a set of branches
> >after each RX burst call, or before each TX burst call, to "smooth out" the
> >different NIC capabilities.
> >
> >As for the option of maintaining different main loops for different NICs with
> >different capabilities - that sounds like a maintenance nightmare to
> >me, due to duplicated code! Callbacks is a far cleaner solution than that IMHO.
> 
> Why not just provide a function like this:
> 
>   rte_do_unsupported_stuff_by_software(m[], m_count, wanted_features,
>   	dev_feature_flags)
> 
> This function can be called (or not) from the application mainloop.
> You don't need to maintain several mainloops (for each device) as
> the specific work will be done depending on the given flags. And the
> applications that do not require these features (most applications?)
> are not penalized at all.

Have you measured the performance hit due to this proposed change? In my tests
it's very, very small, even for the fastest vectorized path. If performance is
a real concern, I'm happy enough to have this as a compile-time option so that
those who can't take the small performance hit can avoid it.

/Bruce

> 
> If you have several places where you call rx in your application
> and you want to factorize it, you can have your own function that
> calls rx plus the function that does the additional sw work.
> 
> Regards,
> Olivier
>
  
Thomas Monjalon Feb. 17, 2015, 3:32 p.m. UTC | #9
2015-02-17 12:25, Bruce Richardson:
> On Mon, Feb 16, 2015 at 06:34:37PM +0100, Thomas Monjalon wrote:
> > 2015-02-16 15:16, Bruce Richardson:
> > > On Mon, Feb 16, 2015 at 03:33:40PM +0100, Olivier MATZ wrote:
> > > > Hi John,
> > > > 
> > > > On 02/13/2015 04:39 PM, John McNamara wrote:
> > > > > From: Richardson, Bruce <bruce.richardson@intel.com>
> > > > > 
> > > > > Example showing how callbacks can be used to insert a timestamp
> > > > > into each packet on RX. On TX the timestamp is used to calculate
> > > > > the packet latency through the app, in cycles.
> > > > > 
> > > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > > 
> > > > 
> > > > I'm looking at the example and I don't understand what is the advantage
> > > > of having callbacks in ethdev layer, knowing that the application can
> > > > do the same job by a standard function call.
> > > > 
> > > > What is the advantage of having callbacks compared to:
> > > > 
> > > > 
> > > > for (port = 0; port < nb_ports; port++) {
> > > > 	struct rte_mbuf *bufs[BURST_SIZE];
> > > > 	const uint16_t nb_rx = rte_eth_rx_burst(port, 0,
> > > > 			bufs, BURST_SIZE);
> > > > 	if (unlikely(nb_rx == 0))
> > > > 		continue;
> > > > 	add_timestamp(bufs, nb_rx);
> > > > 
> > > > 	const uint16_t nb_tx = rte_eth_tx_burst(port ^ 1, 0,
> > > > 			bufs, nb_rx);
> > > > 	calc_latency(bufs, nb_tx);
> > > > 
> > > > 	if (unlikely(nb_tx < nb_rx)) {
> > > > 		uint16_t buf;
> > > > 		for (buf = nb_tx; buf < nb_rx; buf++)
> > > > 			rte_pktmbuf_free(bufs[buf]);
> > > > 	}
> > > > }
> > > > 
> > > > 
> > > > To me, doing like the code above has several advantages:
> > > > 
> > > > - code is more readable: the callback is explicitly invoked, so there is
> > > >   no risk to forget it
> > > > - code is faster: the functions calls can be inlined by the compiler
> > > > - easier to handle error cases in the callback function as the error
> > > >   code is accessible to the application
> > > > - there is no need to add code in ethdev api to do this
> > > > - if the application does not want to use callbacks (I suppose most
> > > >   applications), it won't have any performance impact
> > > > 
> > > > Regards,
> > > > Olivier
> > > 
> > > In this specific instance, given that the application does little else, there
> > > is no real advantage to using the callbacks - it's just to have a simple example
> > > of how they can be used.
> > > 
> > > Where callbacks are really designed to be useful, is for extending or augmenting
> > > hardware capabilities. Taking the example of sequence numbers - to use the most
> > > trivial example - an application could be written to take advantage of sequence
> > > numbers written to packets by the hardware which received them. However, if such
> > > an application was to be used with a NIC which does not provide sequence numbering
> > > capability, for example, anything using ixgbe driver, the application writer has
> > > two choices - either modify his application code to check each packet for
> > > a sequence number in the data path, and add it there post-rx, or alternatively,
> > > to check the NIC capabilities at initialization time, and add a callback there
> > > at initialization, if the hardware does not support it. In the latter case,
> > > the main packet processing body of the application can be written as though
> > > hardware always has sequence numbering capability, safe in the knowledge that
> > > any hardware not supporting it will be back-filled by a software fallback at 
> > > initialization-time.
> > > 
> > > By the same token, we could also look to extend hardware capabilities. For
> > > different filtering or hashing capabilities, there can be limits in hardware
> > > which are far less than what we need to use in software. Again, callbacks will
> > > allow the data path to be written in a way that is oblivious to the underlying
> > > hardware limits, because software will transparently fill in the gaps.
> > > 
> > > Hope this makes the use case clear.
> > 
> > After thinking more about these callbacks, I realize these callbacks won't
> > help, as Olivier said.
> > 
> > With callback,
> > 1/ application checks device capability
> > 2/ application provides hardware emulation as DPDK callback
> > 3/ application forgets previous steps
> > 4/ application calls DPDK Rx
> > 5/ DPDK calls callback (without calling optimization)
> > 
> > Without callback,
> > 1/ application checks device capability
> > 2/ application provides hardware emulation as internal function
> > 3/ application set an internal device-flag to enable this function
> > 4/ application calls DPDK Rx
> > 5/ application calls the hardware emulation if flag is set
> > 
> > So the only difference is to keep persistent the device information in
> > the application instead of storing it as a function pointer in the
> > DPDK struct.
> > You can also be faster with this approach: at initialization time,
> > you can check that your NIC supports the feature and use a specific
> > mainloop that adds or not the sequence number without any runtime
> > test.
> 
> That is assuming that all NICs are equal on your system. It's also assuming
> that you only have a single point in your application where you call RX or
> TX burst. In the case where you have a couple of different NICs on the system,
> or where you want to write an application to take advantage of capabilities of
> different NICs, the ability to resolve all these difference at initialization
> time is useful. The main packet handling code can be written with just the
> processing of packets in mind, rather than having to have a set of branches
> after each RX burst call, or before each TX burst call, to "smooth out" the
> different NIC capabilities. 
> 
> As for the option of maintaining different main loops for different NICs with
> different capabilities - that sounds like a maintenance nightmare to
> me, due to duplicated code! Callbacks is a far cleaner solution than that IMHO.

If you really prefer using callbacks intead of direct calls, why not implementing
the callbacks hooks in your application by wrapping Rx and Tx burst functions?

> > A callback could be justified for asynchronous events, or when
> > doing specific processing in the middle of the driver, for instance
> > when freeing a mbuf. But in this case it's exactly similar to do
> > the processing in the application after Rx (or before Tx).
  
Neil Horman Feb. 17, 2015, 3:49 p.m. UTC | #10
On Tue, Feb 17, 2015 at 01:50:58PM +0000, Bruce Richardson wrote:
> On Tue, Feb 17, 2015 at 02:28:02PM +0100, Olivier MATZ wrote:
> > Hi Bruce,
> > 
> > On 02/17/2015 01:25 PM, Bruce Richardson wrote:
> > >On Mon, Feb 16, 2015 at 06:34:37PM +0100, Thomas Monjalon wrote:
> > >>2015-02-16 15:16, Bruce Richardson:
> > >>>In this specific instance, given that the application does little else, there
> > >>>is no real advantage to using the callbacks - it's just to have a simple example
> > >>>of how they can be used.
> > >>>
> > >>>Where callbacks are really designed to be useful, is for extending or augmenting
> > >>>hardware capabilities. Taking the example of sequence numbers - to use the most
> > >>>trivial example - an application could be written to take advantage of sequence
> > >>>numbers written to packets by the hardware which received them. However, if such
> > >>>an application was to be used with a NIC which does not provide sequence numbering
> > >>>capability, for example, anything using ixgbe driver, the application writer has
> > >>>two choices - either modify his application code to check each packet for
> > >>>a sequence number in the data path, and add it there post-rx, or alternatively,
> > >>>to check the NIC capabilities at initialization time, and add a callback there
> > >>>at initialization, if the hardware does not support it. In the latter case,
> > >>>the main packet processing body of the application can be written as though
> > >>>hardware always has sequence numbering capability, safe in the knowledge that
> > >>>any hardware not supporting it will be back-filled by a software fallback at
> > >>>initialization-time.
> > >>>
> > >>>By the same token, we could also look to extend hardware capabilities. For
> > >>>different filtering or hashing capabilities, there can be limits in hardware
> > >>>which are far less than what we need to use in software. Again, callbacks will
> > >>>allow the data path to be written in a way that is oblivious to the underlying
> > >>>hardware limits, because software will transparently fill in the gaps.
> > >>>
> > >>>Hope this makes the use case clear.
> > >>
> > >>After thinking more about these callbacks, I realize these callbacks won't
> > >>help, as Olivier said.
> > >>
> > >>With callback,
> > >>1/ application checks device capability
> > >>2/ application provides hardware emulation as DPDK callback
> > >>3/ application forgets previous steps
> > >>4/ application calls DPDK Rx
> > >>5/ DPDK calls callback (without calling optimization)
> > >>
> > >>Without callback,
> > >>1/ application checks device capability
> > >>2/ application provides hardware emulation as internal function
> > >>3/ application set an internal device-flag to enable this function
> > >>4/ application calls DPDK Rx
> > >>5/ application calls the hardware emulation if flag is set
> > >>
> > >>So the only difference is to keep persistent the device information in
> > >>the application instead of storing it as a function pointer in the
> > >>DPDK struct.
> > >>You can also be faster with this approach: at initialization time,
> > >>you can check that your NIC supports the feature and use a specific
> > >>mainloop that adds or not the sequence number without any runtime
> > >>test.
> > >
> > >That is assuming that all NICs are equal on your system. It's also assuming
> > >that you only have a single point in your application where you call RX or
> > >TX burst. In the case where you have a couple of different NICs on the system,
> > >or where you want to write an application to take advantage of capabilities of
> > >different NICs, the ability to resolve all these difference at initialization
> > >time is useful. The main packet handling code can be written with just the
> > >processing of packets in mind, rather than having to have a set of branches
> > >after each RX burst call, or before each TX burst call, to "smooth out" the
> > >different NIC capabilities.
> > >
> > >As for the option of maintaining different main loops for different NICs with
> > >different capabilities - that sounds like a maintenance nightmare to
> > >me, due to duplicated code! Callbacks is a far cleaner solution than that IMHO.
> > 
> > Why not just provide a function like this:
> > 
> >   rte_do_unsupported_stuff_by_software(m[], m_count, wanted_features,
> >   	dev_feature_flags)
> > 
> > This function can be called (or not) from the application mainloop.
> > You don't need to maintain several mainloops (for each device) as
> > the specific work will be done depending on the given flags. And the
> > applications that do not require these features (most applications?)
> > are not penalized at all.
> 
> Have you measured the performance hit due to this proposed change? In my tests
> it's very, very small, even for the fastest vectorized path. If performance is
> a real concern, I'm happy enough to have this as a compile-time option so that
> those who can't take the small performance hit can avoid it.
> 
How can you assert performance metrics on a patch like this?  The point of the
change is to allow a callback to an application defined function, the contents
of which are effectively arbitrary.  Not saying that its the wrong thing to do,
but you can't really claim performance is not impacted, because the details of
whats executed is outside your purview.
Neil

> /Bruce
> 
> > 
> > If you have several places where you call rx in your application
> > and you want to factorize it, you can have your own function that
> > calls rx plus the function that does the additional sw work.
> > 
> > Regards,
> > Olivier
> > 
>
  
Bruce Richardson Feb. 17, 2015, 3:58 p.m. UTC | #11
On Tue, Feb 17, 2015 at 04:32:01PM +0100, Thomas Monjalon wrote:
> 2015-02-17 12:25, Bruce Richardson:
> > On Mon, Feb 16, 2015 at 06:34:37PM +0100, Thomas Monjalon wrote:
> > > 2015-02-16 15:16, Bruce Richardson:
> > > > On Mon, Feb 16, 2015 at 03:33:40PM +0100, Olivier MATZ wrote:
> > > > > Hi John,
> > > > > 
> > > > > On 02/13/2015 04:39 PM, John McNamara wrote:
> > > > > > From: Richardson, Bruce <bruce.richardson@intel.com>
> > > > > > 
> > > > > > Example showing how callbacks can be used to insert a timestamp
> > > > > > into each packet on RX. On TX the timestamp is used to calculate
> > > > > > the packet latency through the app, in cycles.
> > > > > > 
> > > > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > > > 
> > > > > 
> > > > > I'm looking at the example and I don't understand what is the advantage
> > > > > of having callbacks in ethdev layer, knowing that the application can
> > > > > do the same job by a standard function call.
> > > > > 
> > > > > What is the advantage of having callbacks compared to:
> > > > > 
> > > > > 
> > > > > for (port = 0; port < nb_ports; port++) {
> > > > > 	struct rte_mbuf *bufs[BURST_SIZE];
> > > > > 	const uint16_t nb_rx = rte_eth_rx_burst(port, 0,
> > > > > 			bufs, BURST_SIZE);
> > > > > 	if (unlikely(nb_rx == 0))
> > > > > 		continue;
> > > > > 	add_timestamp(bufs, nb_rx);
> > > > > 
> > > > > 	const uint16_t nb_tx = rte_eth_tx_burst(port ^ 1, 0,
> > > > > 			bufs, nb_rx);
> > > > > 	calc_latency(bufs, nb_tx);
> > > > > 
> > > > > 	if (unlikely(nb_tx < nb_rx)) {
> > > > > 		uint16_t buf;
> > > > > 		for (buf = nb_tx; buf < nb_rx; buf++)
> > > > > 			rte_pktmbuf_free(bufs[buf]);
> > > > > 	}
> > > > > }
> > > > > 
> > > > > 
> > > > > To me, doing like the code above has several advantages:
> > > > > 
> > > > > - code is more readable: the callback is explicitly invoked, so there is
> > > > >   no risk to forget it
> > > > > - code is faster: the functions calls can be inlined by the compiler
> > > > > - easier to handle error cases in the callback function as the error
> > > > >   code is accessible to the application
> > > > > - there is no need to add code in ethdev api to do this
> > > > > - if the application does not want to use callbacks (I suppose most
> > > > >   applications), it won't have any performance impact
> > > > > 
> > > > > Regards,
> > > > > Olivier
> > > > 
> > > > In this specific instance, given that the application does little else, there
> > > > is no real advantage to using the callbacks - it's just to have a simple example
> > > > of how they can be used.
> > > > 
> > > > Where callbacks are really designed to be useful, is for extending or augmenting
> > > > hardware capabilities. Taking the example of sequence numbers - to use the most
> > > > trivial example - an application could be written to take advantage of sequence
> > > > numbers written to packets by the hardware which received them. However, if such
> > > > an application was to be used with a NIC which does not provide sequence numbering
> > > > capability, for example, anything using ixgbe driver, the application writer has
> > > > two choices - either modify his application code to check each packet for
> > > > a sequence number in the data path, and add it there post-rx, or alternatively,
> > > > to check the NIC capabilities at initialization time, and add a callback there
> > > > at initialization, if the hardware does not support it. In the latter case,
> > > > the main packet processing body of the application can be written as though
> > > > hardware always has sequence numbering capability, safe in the knowledge that
> > > > any hardware not supporting it will be back-filled by a software fallback at 
> > > > initialization-time.
> > > > 
> > > > By the same token, we could also look to extend hardware capabilities. For
> > > > different filtering or hashing capabilities, there can be limits in hardware
> > > > which are far less than what we need to use in software. Again, callbacks will
> > > > allow the data path to be written in a way that is oblivious to the underlying
> > > > hardware limits, because software will transparently fill in the gaps.
> > > > 
> > > > Hope this makes the use case clear.
> > > 
> > > After thinking more about these callbacks, I realize these callbacks won't
> > > help, as Olivier said.
> > > 
> > > With callback,
> > > 1/ application checks device capability
> > > 2/ application provides hardware emulation as DPDK callback
> > > 3/ application forgets previous steps
> > > 4/ application calls DPDK Rx
> > > 5/ DPDK calls callback (without calling optimization)
> > > 
> > > Without callback,
> > > 1/ application checks device capability
> > > 2/ application provides hardware emulation as internal function
> > > 3/ application set an internal device-flag to enable this function
> > > 4/ application calls DPDK Rx
> > > 5/ application calls the hardware emulation if flag is set
> > > 
> > > So the only difference is to keep persistent the device information in
> > > the application instead of storing it as a function pointer in the
> > > DPDK struct.
> > > You can also be faster with this approach: at initialization time,
> > > you can check that your NIC supports the feature and use a specific
> > > mainloop that adds or not the sequence number without any runtime
> > > test.
> > 
> > That is assuming that all NICs are equal on your system. It's also assuming
> > that you only have a single point in your application where you call RX or
> > TX burst. In the case where you have a couple of different NICs on the system,
> > or where you want to write an application to take advantage of capabilities of
> > different NICs, the ability to resolve all these difference at initialization
> > time is useful. The main packet handling code can be written with just the
> > processing of packets in mind, rather than having to have a set of branches
> > after each RX burst call, or before each TX burst call, to "smooth out" the
> > different NIC capabilities. 
> > 
> > As for the option of maintaining different main loops for different NICs with
> > different capabilities - that sounds like a maintenance nightmare to
> > me, due to duplicated code! Callbacks is a far cleaner solution than that IMHO.
> 
> If you really prefer using callbacks intead of direct calls, why not implementing
> the callbacks hooks in your application by wrapping Rx and Tx burst functions?
>

Because sometimes things are generally useful and are better supplied in a
standard library than forcing multiple applications to constantly re-invent the
wheel. 

Furthermore, if we enable the hooks in DPDK, it gives us a standard API
prototype to code against, allowing us to provide reference implementation 
callbacks to create smarter ethdevs that can be used as higher-level abstractions
inside applications.

We don't require applications to know what the underlying NIC driver is that
is being used to receive pkts - we take care of all that at initialization time
by using a function pointer to allow NIC specific calls to be referenced using the
rx_burst API. An application could be written to call directly into the driver
receive or transmit functions - and such an API could be made faster than the
existing indirect calls - but instead we set things up so that all NICs look
the same to the data-path, irrespective of type or speed. In the same way, this
feature allows us to set things up at initialization time so that all NICs look
the same to the datapath in terms of capabilities offered. We won't always do
so, but it is a worthwhile use case that brings the same benefits as the generic
RX and TX function pointers do - a datapath that is agnostic to underlying
hardware.

Going further, once NICs can be made to provide similar capabilities in terms
of offloads - at the ethdev layer - then additional libraries which use ethdevs,
such as link bonding, as Declan has highlighted, can make use of that very easily.
Having support for that in the application won't allow such use cases in libraries,
and having it in the ethdev layer allows it to be conveniently used by any other
libraries other than link bonding that may want this in future.

Can I actually also flip the discussion on it's head a bit? We have presented
a number of use cases where we see this functionality being useful, and we
have plans to build upon this in future to enable smarter ethdevs. Given that
this is not a large amount of code by any means, what is the compelling reason
why it should not be merged in, if it would be useful to at least some users of
DPDK?

/Bruce
  
Bruce Richardson Feb. 17, 2015, 4 p.m. UTC | #12
On Tue, Feb 17, 2015 at 10:49:24AM -0500, Neil Horman wrote:
> On Tue, Feb 17, 2015 at 01:50:58PM +0000, Bruce Richardson wrote:
> > On Tue, Feb 17, 2015 at 02:28:02PM +0100, Olivier MATZ wrote:
> > > Hi Bruce,
> > > 
> > > On 02/17/2015 01:25 PM, Bruce Richardson wrote:
> > > >On Mon, Feb 16, 2015 at 06:34:37PM +0100, Thomas Monjalon wrote:
> > > >>2015-02-16 15:16, Bruce Richardson:
> > > >>>In this specific instance, given that the application does little else, there
> > > >>>is no real advantage to using the callbacks - it's just to have a simple example
> > > >>>of how they can be used.
> > > >>>
> > > >>>Where callbacks are really designed to be useful, is for extending or augmenting
> > > >>>hardware capabilities. Taking the example of sequence numbers - to use the most
> > > >>>trivial example - an application could be written to take advantage of sequence
> > > >>>numbers written to packets by the hardware which received them. However, if such
> > > >>>an application was to be used with a NIC which does not provide sequence numbering
> > > >>>capability, for example, anything using ixgbe driver, the application writer has
> > > >>>two choices - either modify his application code to check each packet for
> > > >>>a sequence number in the data path, and add it there post-rx, or alternatively,
> > > >>>to check the NIC capabilities at initialization time, and add a callback there
> > > >>>at initialization, if the hardware does not support it. In the latter case,
> > > >>>the main packet processing body of the application can be written as though
> > > >>>hardware always has sequence numbering capability, safe in the knowledge that
> > > >>>any hardware not supporting it will be back-filled by a software fallback at
> > > >>>initialization-time.
> > > >>>
> > > >>>By the same token, we could also look to extend hardware capabilities. For
> > > >>>different filtering or hashing capabilities, there can be limits in hardware
> > > >>>which are far less than what we need to use in software. Again, callbacks will
> > > >>>allow the data path to be written in a way that is oblivious to the underlying
> > > >>>hardware limits, because software will transparently fill in the gaps.
> > > >>>
> > > >>>Hope this makes the use case clear.
> > > >>
> > > >>After thinking more about these callbacks, I realize these callbacks won't
> > > >>help, as Olivier said.
> > > >>
> > > >>With callback,
> > > >>1/ application checks device capability
> > > >>2/ application provides hardware emulation as DPDK callback
> > > >>3/ application forgets previous steps
> > > >>4/ application calls DPDK Rx
> > > >>5/ DPDK calls callback (without calling optimization)
> > > >>
> > > >>Without callback,
> > > >>1/ application checks device capability
> > > >>2/ application provides hardware emulation as internal function
> > > >>3/ application set an internal device-flag to enable this function
> > > >>4/ application calls DPDK Rx
> > > >>5/ application calls the hardware emulation if flag is set
> > > >>
> > > >>So the only difference is to keep persistent the device information in
> > > >>the application instead of storing it as a function pointer in the
> > > >>DPDK struct.
> > > >>You can also be faster with this approach: at initialization time,
> > > >>you can check that your NIC supports the feature and use a specific
> > > >>mainloop that adds or not the sequence number without any runtime
> > > >>test.
> > > >
> > > >That is assuming that all NICs are equal on your system. It's also assuming
> > > >that you only have a single point in your application where you call RX or
> > > >TX burst. In the case where you have a couple of different NICs on the system,
> > > >or where you want to write an application to take advantage of capabilities of
> > > >different NICs, the ability to resolve all these difference at initialization
> > > >time is useful. The main packet handling code can be written with just the
> > > >processing of packets in mind, rather than having to have a set of branches
> > > >after each RX burst call, or before each TX burst call, to "smooth out" the
> > > >different NIC capabilities.
> > > >
> > > >As for the option of maintaining different main loops for different NICs with
> > > >different capabilities - that sounds like a maintenance nightmare to
> > > >me, due to duplicated code! Callbacks is a far cleaner solution than that IMHO.
> > > 
> > > Why not just provide a function like this:
> > > 
> > >   rte_do_unsupported_stuff_by_software(m[], m_count, wanted_features,
> > >   	dev_feature_flags)
> > > 
> > > This function can be called (or not) from the application mainloop.
> > > You don't need to maintain several mainloops (for each device) as
> > > the specific work will be done depending on the given flags. And the
> > > applications that do not require these features (most applications?)
> > > are not penalized at all.
> > 
> > Have you measured the performance hit due to this proposed change? In my tests
> > it's very, very small, even for the fastest vectorized path. If performance is
> > a real concern, I'm happy enough to have this as a compile-time option so that
> > those who can't take the small performance hit can avoid it.
> > 
> How can you assert performance metrics on a patch like this?  The point of the
> change is to allow a callback to an application defined function, the contents
> of which are effectively arbitrary.  Not saying that its the wrong thing to do,
> but you can't really claim performance is not impacted, because the details of
> whats executed is outside your purview.
> Neil
>
I think the performance hit being referenced is a hit due to the patch itself
without any callbacks being in use. (That was certainly my assumption in replying)

/Bruce
  
Neil Horman Feb. 17, 2015, 4:08 p.m. UTC | #13
On Tue, Feb 17, 2015 at 04:00:56PM +0000, Bruce Richardson wrote:
> On Tue, Feb 17, 2015 at 10:49:24AM -0500, Neil Horman wrote:
> > On Tue, Feb 17, 2015 at 01:50:58PM +0000, Bruce Richardson wrote:
> > > On Tue, Feb 17, 2015 at 02:28:02PM +0100, Olivier MATZ wrote:
> > > > Hi Bruce,
> > > > 
> > > > On 02/17/2015 01:25 PM, Bruce Richardson wrote:
> > > > >On Mon, Feb 16, 2015 at 06:34:37PM +0100, Thomas Monjalon wrote:
> > > > >>2015-02-16 15:16, Bruce Richardson:
> > > > >>>In this specific instance, given that the application does little else, there
> > > > >>>is no real advantage to using the callbacks - it's just to have a simple example
> > > > >>>of how they can be used.
> > > > >>>
> > > > >>>Where callbacks are really designed to be useful, is for extending or augmenting
> > > > >>>hardware capabilities. Taking the example of sequence numbers - to use the most
> > > > >>>trivial example - an application could be written to take advantage of sequence
> > > > >>>numbers written to packets by the hardware which received them. However, if such
> > > > >>>an application was to be used with a NIC which does not provide sequence numbering
> > > > >>>capability, for example, anything using ixgbe driver, the application writer has
> > > > >>>two choices - either modify his application code to check each packet for
> > > > >>>a sequence number in the data path, and add it there post-rx, or alternatively,
> > > > >>>to check the NIC capabilities at initialization time, and add a callback there
> > > > >>>at initialization, if the hardware does not support it. In the latter case,
> > > > >>>the main packet processing body of the application can be written as though
> > > > >>>hardware always has sequence numbering capability, safe in the knowledge that
> > > > >>>any hardware not supporting it will be back-filled by a software fallback at
> > > > >>>initialization-time.
> > > > >>>
> > > > >>>By the same token, we could also look to extend hardware capabilities. For
> > > > >>>different filtering or hashing capabilities, there can be limits in hardware
> > > > >>>which are far less than what we need to use in software. Again, callbacks will
> > > > >>>allow the data path to be written in a way that is oblivious to the underlying
> > > > >>>hardware limits, because software will transparently fill in the gaps.
> > > > >>>
> > > > >>>Hope this makes the use case clear.
> > > > >>
> > > > >>After thinking more about these callbacks, I realize these callbacks won't
> > > > >>help, as Olivier said.
> > > > >>
> > > > >>With callback,
> > > > >>1/ application checks device capability
> > > > >>2/ application provides hardware emulation as DPDK callback
> > > > >>3/ application forgets previous steps
> > > > >>4/ application calls DPDK Rx
> > > > >>5/ DPDK calls callback (without calling optimization)
> > > > >>
> > > > >>Without callback,
> > > > >>1/ application checks device capability
> > > > >>2/ application provides hardware emulation as internal function
> > > > >>3/ application set an internal device-flag to enable this function
> > > > >>4/ application calls DPDK Rx
> > > > >>5/ application calls the hardware emulation if flag is set
> > > > >>
> > > > >>So the only difference is to keep persistent the device information in
> > > > >>the application instead of storing it as a function pointer in the
> > > > >>DPDK struct.
> > > > >>You can also be faster with this approach: at initialization time,
> > > > >>you can check that your NIC supports the feature and use a specific
> > > > >>mainloop that adds or not the sequence number without any runtime
> > > > >>test.
> > > > >
> > > > >That is assuming that all NICs are equal on your system. It's also assuming
> > > > >that you only have a single point in your application where you call RX or
> > > > >TX burst. In the case where you have a couple of different NICs on the system,
> > > > >or where you want to write an application to take advantage of capabilities of
> > > > >different NICs, the ability to resolve all these difference at initialization
> > > > >time is useful. The main packet handling code can be written with just the
> > > > >processing of packets in mind, rather than having to have a set of branches
> > > > >after each RX burst call, or before each TX burst call, to "smooth out" the
> > > > >different NIC capabilities.
> > > > >
> > > > >As for the option of maintaining different main loops for different NICs with
> > > > >different capabilities - that sounds like a maintenance nightmare to
> > > > >me, due to duplicated code! Callbacks is a far cleaner solution than that IMHO.
> > > > 
> > > > Why not just provide a function like this:
> > > > 
> > > >   rte_do_unsupported_stuff_by_software(m[], m_count, wanted_features,
> > > >   	dev_feature_flags)
> > > > 
> > > > This function can be called (or not) from the application mainloop.
> > > > You don't need to maintain several mainloops (for each device) as
> > > > the specific work will be done depending on the given flags. And the
> > > > applications that do not require these features (most applications?)
> > > > are not penalized at all.
> > > 
> > > Have you measured the performance hit due to this proposed change? In my tests
> > > it's very, very small, even for the fastest vectorized path. If performance is
> > > a real concern, I'm happy enough to have this as a compile-time option so that
> > > those who can't take the small performance hit can avoid it.
> > > 
> > How can you assert performance metrics on a patch like this?  The point of the
> > change is to allow a callback to an application defined function, the contents
> > of which are effectively arbitrary.  Not saying that its the wrong thing to do,
> > but you can't really claim performance is not impacted, because the details of
> > whats executed is outside your purview.
> > Neil
> >
> I think the performance hit being referenced is a hit due to the patch itself
> without any callbacks being in use. (That was certainly my assumption in replying)
> 
I figured it was, but thats still something of a misnomer.  Of course this
change on its own is negligible in its performance impact.  By itself, the
impact is that of a branch that is unlikely to be taken, which is to say almost
zero.  But thats not an actionable number because the only time that performance
is attainable if the user doesn't use it.  Since you're posing a patch that
makes application registered callbacks in a very fast path, I think its
important to state very clearly that these callbacks will have a significant
performance impact that individual applications will have to measure and be
cogniscent of.
Neil

> /Bruce
>
  
Bruce Richardson Feb. 17, 2015, 4:15 p.m. UTC | #14
On Tue, Feb 17, 2015 at 11:08:10AM -0500, Neil Horman wrote:
> On Tue, Feb 17, 2015 at 04:00:56PM +0000, Bruce Richardson wrote:
> > On Tue, Feb 17, 2015 at 10:49:24AM -0500, Neil Horman wrote:
> > > On Tue, Feb 17, 2015 at 01:50:58PM +0000, Bruce Richardson wrote:
> > > > On Tue, Feb 17, 2015 at 02:28:02PM +0100, Olivier MATZ wrote:
> > > > > Hi Bruce,
> > > > > 
> > > > > On 02/17/2015 01:25 PM, Bruce Richardson wrote:
> > > > > >On Mon, Feb 16, 2015 at 06:34:37PM +0100, Thomas Monjalon wrote:
> > > > > >>2015-02-16 15:16, Bruce Richardson:
> > > > > >>>In this specific instance, given that the application does little else, there
> > > > > >>>is no real advantage to using the callbacks - it's just to have a simple example
> > > > > >>>of how they can be used.
> > > > > >>>
> > > > > >>>Where callbacks are really designed to be useful, is for extending or augmenting
> > > > > >>>hardware capabilities. Taking the example of sequence numbers - to use the most
> > > > > >>>trivial example - an application could be written to take advantage of sequence
> > > > > >>>numbers written to packets by the hardware which received them. However, if such
> > > > > >>>an application was to be used with a NIC which does not provide sequence numbering
> > > > > >>>capability, for example, anything using ixgbe driver, the application writer has
> > > > > >>>two choices - either modify his application code to check each packet for
> > > > > >>>a sequence number in the data path, and add it there post-rx, or alternatively,
> > > > > >>>to check the NIC capabilities at initialization time, and add a callback there
> > > > > >>>at initialization, if the hardware does not support it. In the latter case,
> > > > > >>>the main packet processing body of the application can be written as though
> > > > > >>>hardware always has sequence numbering capability, safe in the knowledge that
> > > > > >>>any hardware not supporting it will be back-filled by a software fallback at
> > > > > >>>initialization-time.
> > > > > >>>
> > > > > >>>By the same token, we could also look to extend hardware capabilities. For
> > > > > >>>different filtering or hashing capabilities, there can be limits in hardware
> > > > > >>>which are far less than what we need to use in software. Again, callbacks will
> > > > > >>>allow the data path to be written in a way that is oblivious to the underlying
> > > > > >>>hardware limits, because software will transparently fill in the gaps.
> > > > > >>>
> > > > > >>>Hope this makes the use case clear.
> > > > > >>
> > > > > >>After thinking more about these callbacks, I realize these callbacks won't
> > > > > >>help, as Olivier said.
> > > > > >>
> > > > > >>With callback,
> > > > > >>1/ application checks device capability
> > > > > >>2/ application provides hardware emulation as DPDK callback
> > > > > >>3/ application forgets previous steps
> > > > > >>4/ application calls DPDK Rx
> > > > > >>5/ DPDK calls callback (without calling optimization)
> > > > > >>
> > > > > >>Without callback,
> > > > > >>1/ application checks device capability
> > > > > >>2/ application provides hardware emulation as internal function
> > > > > >>3/ application set an internal device-flag to enable this function
> > > > > >>4/ application calls DPDK Rx
> > > > > >>5/ application calls the hardware emulation if flag is set
> > > > > >>
> > > > > >>So the only difference is to keep persistent the device information in
> > > > > >>the application instead of storing it as a function pointer in the
> > > > > >>DPDK struct.
> > > > > >>You can also be faster with this approach: at initialization time,
> > > > > >>you can check that your NIC supports the feature and use a specific
> > > > > >>mainloop that adds or not the sequence number without any runtime
> > > > > >>test.
> > > > > >
> > > > > >That is assuming that all NICs are equal on your system. It's also assuming
> > > > > >that you only have a single point in your application where you call RX or
> > > > > >TX burst. In the case where you have a couple of different NICs on the system,
> > > > > >or where you want to write an application to take advantage of capabilities of
> > > > > >different NICs, the ability to resolve all these difference at initialization
> > > > > >time is useful. The main packet handling code can be written with just the
> > > > > >processing of packets in mind, rather than having to have a set of branches
> > > > > >after each RX burst call, or before each TX burst call, to "smooth out" the
> > > > > >different NIC capabilities.
> > > > > >
> > > > > >As for the option of maintaining different main loops for different NICs with
> > > > > >different capabilities - that sounds like a maintenance nightmare to
> > > > > >me, due to duplicated code! Callbacks is a far cleaner solution than that IMHO.
> > > > > 
> > > > > Why not just provide a function like this:
> > > > > 
> > > > >   rte_do_unsupported_stuff_by_software(m[], m_count, wanted_features,
> > > > >   	dev_feature_flags)
> > > > > 
> > > > > This function can be called (or not) from the application mainloop.
> > > > > You don't need to maintain several mainloops (for each device) as
> > > > > the specific work will be done depending on the given flags. And the
> > > > > applications that do not require these features (most applications?)
> > > > > are not penalized at all.
> > > > 
> > > > Have you measured the performance hit due to this proposed change? In my tests
> > > > it's very, very small, even for the fastest vectorized path. If performance is
> > > > a real concern, I'm happy enough to have this as a compile-time option so that
> > > > those who can't take the small performance hit can avoid it.
> > > > 
> > > How can you assert performance metrics on a patch like this?  The point of the
> > > change is to allow a callback to an application defined function, the contents
> > > of which are effectively arbitrary.  Not saying that its the wrong thing to do,
> > > but you can't really claim performance is not impacted, because the details of
> > > whats executed is outside your purview.
> > > Neil
> > >
> > I think the performance hit being referenced is a hit due to the patch itself
> > without any callbacks being in use. (That was certainly my assumption in replying)
> > 
> I figured it was, but thats still something of a misnomer.  Of course this
> change on its own is negligible in its performance impact.  By itself, the
> impact is that of a branch that is unlikely to be taken, which is to say almost
> zero.  But thats not an actionable number because the only time that performance
> is attainable if the user doesn't use it.  Since you're posing a patch that
> makes application registered callbacks in a very fast path, I think its
> important to state very clearly that these callbacks will have a significant
> performance impact that individual applications will have to measure and be
> cogniscent of.
> Neil
>
Yes, agreed.
But if the app were to directly implement the same functionality directly rather
than via callbacks, the performance would be about the same (sometimes better,
sometimes worse, I suspect, depending on how it's done).
  
Neil Horman Feb. 17, 2015, 7:27 p.m. UTC | #15
On Tue, Feb 17, 2015 at 04:15:09PM +0000, Bruce Richardson wrote:
> On Tue, Feb 17, 2015 at 11:08:10AM -0500, Neil Horman wrote:
> > On Tue, Feb 17, 2015 at 04:00:56PM +0000, Bruce Richardson wrote:
> > > On Tue, Feb 17, 2015 at 10:49:24AM -0500, Neil Horman wrote:
> > > > On Tue, Feb 17, 2015 at 01:50:58PM +0000, Bruce Richardson wrote:
> > > > > On Tue, Feb 17, 2015 at 02:28:02PM +0100, Olivier MATZ wrote:
> > > > > > Hi Bruce,
> > > > > > 
> > > > > > On 02/17/2015 01:25 PM, Bruce Richardson wrote:
> > > > > > >On Mon, Feb 16, 2015 at 06:34:37PM +0100, Thomas Monjalon wrote:
> > > > > > >>2015-02-16 15:16, Bruce Richardson:
> > > > > > >>>In this specific instance, given that the application does little else, there
> > > > > > >>>is no real advantage to using the callbacks - it's just to have a simple example
> > > > > > >>>of how they can be used.
> > > > > > >>>
> > > > > > >>>Where callbacks are really designed to be useful, is for extending or augmenting
> > > > > > >>>hardware capabilities. Taking the example of sequence numbers - to use the most
> > > > > > >>>trivial example - an application could be written to take advantage of sequence
> > > > > > >>>numbers written to packets by the hardware which received them. However, if such
> > > > > > >>>an application was to be used with a NIC which does not provide sequence numbering
> > > > > > >>>capability, for example, anything using ixgbe driver, the application writer has
> > > > > > >>>two choices - either modify his application code to check each packet for
> > > > > > >>>a sequence number in the data path, and add it there post-rx, or alternatively,
> > > > > > >>>to check the NIC capabilities at initialization time, and add a callback there
> > > > > > >>>at initialization, if the hardware does not support it. In the latter case,
> > > > > > >>>the main packet processing body of the application can be written as though
> > > > > > >>>hardware always has sequence numbering capability, safe in the knowledge that
> > > > > > >>>any hardware not supporting it will be back-filled by a software fallback at
> > > > > > >>>initialization-time.
> > > > > > >>>
> > > > > > >>>By the same token, we could also look to extend hardware capabilities. For
> > > > > > >>>different filtering or hashing capabilities, there can be limits in hardware
> > > > > > >>>which are far less than what we need to use in software. Again, callbacks will
> > > > > > >>>allow the data path to be written in a way that is oblivious to the underlying
> > > > > > >>>hardware limits, because software will transparently fill in the gaps.
> > > > > > >>>
> > > > > > >>>Hope this makes the use case clear.
> > > > > > >>
> > > > > > >>After thinking more about these callbacks, I realize these callbacks won't
> > > > > > >>help, as Olivier said.
> > > > > > >>
> > > > > > >>With callback,
> > > > > > >>1/ application checks device capability
> > > > > > >>2/ application provides hardware emulation as DPDK callback
> > > > > > >>3/ application forgets previous steps
> > > > > > >>4/ application calls DPDK Rx
> > > > > > >>5/ DPDK calls callback (without calling optimization)
> > > > > > >>
> > > > > > >>Without callback,
> > > > > > >>1/ application checks device capability
> > > > > > >>2/ application provides hardware emulation as internal function
> > > > > > >>3/ application set an internal device-flag to enable this function
> > > > > > >>4/ application calls DPDK Rx
> > > > > > >>5/ application calls the hardware emulation if flag is set
> > > > > > >>
> > > > > > >>So the only difference is to keep persistent the device information in
> > > > > > >>the application instead of storing it as a function pointer in the
> > > > > > >>DPDK struct.
> > > > > > >>You can also be faster with this approach: at initialization time,
> > > > > > >>you can check that your NIC supports the feature and use a specific
> > > > > > >>mainloop that adds or not the sequence number without any runtime
> > > > > > >>test.
> > > > > > >
> > > > > > >That is assuming that all NICs are equal on your system. It's also assuming
> > > > > > >that you only have a single point in your application where you call RX or
> > > > > > >TX burst. In the case where you have a couple of different NICs on the system,
> > > > > > >or where you want to write an application to take advantage of capabilities of
> > > > > > >different NICs, the ability to resolve all these difference at initialization
> > > > > > >time is useful. The main packet handling code can be written with just the
> > > > > > >processing of packets in mind, rather than having to have a set of branches
> > > > > > >after each RX burst call, or before each TX burst call, to "smooth out" the
> > > > > > >different NIC capabilities.
> > > > > > >
> > > > > > >As for the option of maintaining different main loops for different NICs with
> > > > > > >different capabilities - that sounds like a maintenance nightmare to
> > > > > > >me, due to duplicated code! Callbacks is a far cleaner solution than that IMHO.
> > > > > > 
> > > > > > Why not just provide a function like this:
> > > > > > 
> > > > > >   rte_do_unsupported_stuff_by_software(m[], m_count, wanted_features,
> > > > > >   	dev_feature_flags)
> > > > > > 
> > > > > > This function can be called (or not) from the application mainloop.
> > > > > > You don't need to maintain several mainloops (for each device) as
> > > > > > the specific work will be done depending on the given flags. And the
> > > > > > applications that do not require these features (most applications?)
> > > > > > are not penalized at all.
> > > > > 
> > > > > Have you measured the performance hit due to this proposed change? In my tests
> > > > > it's very, very small, even for the fastest vectorized path. If performance is
> > > > > a real concern, I'm happy enough to have this as a compile-time option so that
> > > > > those who can't take the small performance hit can avoid it.
> > > > > 
> > > > How can you assert performance metrics on a patch like this?  The point of the
> > > > change is to allow a callback to an application defined function, the contents
> > > > of which are effectively arbitrary.  Not saying that its the wrong thing to do,
> > > > but you can't really claim performance is not impacted, because the details of
> > > > whats executed is outside your purview.
> > > > Neil
> > > >
> > > I think the performance hit being referenced is a hit due to the patch itself
> > > without any callbacks being in use. (That was certainly my assumption in replying)
> > > 
> > I figured it was, but thats still something of a misnomer.  Of course this
> > change on its own is negligible in its performance impact.  By itself, the
> > impact is that of a branch that is unlikely to be taken, which is to say almost
> > zero.  But thats not an actionable number because the only time that performance
> > is attainable if the user doesn't use it.  Since you're posing a patch that
> > makes application registered callbacks in a very fast path, I think its
> > important to state very clearly that these callbacks will have a significant
> > performance impact that individual applications will have to measure and be
> > cogniscent of.
> > Neil
> >
> Yes, agreed.
> But if the app were to directly implement the same functionality directly rather
> than via callbacks, the performance would be about the same (sometimes better,
> sometimes worse, I suspect, depending on how it's done).
> 
No argument, but doing so makes it clearly apparent to the application developer
that they are adding cycles to a hot path.  That becomes much more obfuscated
when you register callbacks, and so it is imperitive to not make ambiguous
claims like "the performance impact is zero".
Neil
  

Patch

diff --git a/examples/rxtx_callbacks/Makefile b/examples/rxtx_callbacks/Makefile
new file mode 100644
index 0000000..4a5d99f
--- /dev/null
+++ b/examples/rxtx_callbacks/Makefile
@@ -0,0 +1,57 @@ 
+#   BSD LICENSE
+#
+#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Intel Corporation nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+ifeq ($(RTE_SDK),)
+$(error "Please define RTE_SDK environment variable")
+endif
+
+# Default target, can be overridden by command line or environment
+RTE_TARGET ?= x86_64-native-linuxapp-gcc
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# binary name
+APP = basicfwd
+
+# all source are stored in SRCS-y
+SRCS-y := basicfwd.c
+
+CFLAGS += $(WERROR_FLAGS)
+
+# workaround for a gcc bug with noreturn attribute
+# http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12603
+ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y)
+CFLAGS_main.o += -Wno-return-type
+endif
+
+EXTRA_CFLAGS += -O3 -g -Wfatal-errors
+
+include $(RTE_SDK)/mk/rte.extapp.mk
diff --git a/examples/rxtx_callbacks/basicfwd.c b/examples/rxtx_callbacks/basicfwd.c
new file mode 100644
index 0000000..0209bf4
--- /dev/null
+++ b/examples/rxtx_callbacks/basicfwd.c
@@ -0,0 +1,222 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <inttypes.h>
+#include <rte_eal.h>
+#include <rte_ethdev.h>
+#include <rte_cycles.h>
+#include <rte_lcore.h>
+#include <rte_mbuf.h>
+#include "basicfwd.h"
+
+#define RX_RING_SIZE 128
+#define TX_RING_SIZE 512
+
+#define NUM_MBUFS 8191
+#define MBUF_SIZE (1600 + sizeof(struct rte_mbuf) + RTE_PKTMBUF_HEADROOM)
+#define MBUF_CACHE_SIZE 250
+#define BURST_SIZE 32
+
+static const struct rte_eth_conf port_conf_default = {
+	.rxmode = { .max_rx_pkt_len = ETHER_MAX_LEN, },
+};
+
+static unsigned nb_ports;
+
+static struct {
+	uint64_t total_cycles;
+	uint64_t total_pkts;
+} latency_numbers;
+
+
+static uint16_t
+add_timestamps(uint8_t port __rte_unused, uint16_t qidx __rte_unused,
+		struct rte_mbuf **pkts, uint16_t nb_pkts, void *_ __rte_unused)
+{
+	unsigned i;
+	uint64_t now = rte_rdtsc();
+	for (i = 0; i < nb_pkts; i++)
+		pkts[i]->udata64 = now;
+	return nb_pkts;
+}
+
+static uint16_t
+calc_latency(uint8_t port __rte_unused, uint16_t qidx __rte_unused,
+		struct rte_mbuf **pkts, uint16_t nb_pkts, void *_ __rte_unused)
+{
+	uint64_t cycles = 0;
+	uint64_t now = rte_rdtsc();
+	unsigned i;
+	for (i = 0; i < nb_pkts; i++)
+		cycles += now - pkts[i]->udata64;
+	latency_numbers.total_cycles += cycles;
+	latency_numbers.total_pkts += nb_pkts;
+
+	if (latency_numbers.total_pkts > (100 * 1000 * 1000ULL)) {
+		printf("Latency = %"PRIu64" cycles\n",
+		latency_numbers.total_cycles / latency_numbers.total_pkts);
+		latency_numbers.total_cycles = latency_numbers.total_pkts = 0;
+	}
+	return nb_pkts;
+}
+
+/*
+ * Initialises a given port using global settings and with the rx buffers
+ * coming from the mbuf_pool passed as parameter
+ */
+static inline int
+port_init(uint8_t port, struct rte_mempool *mbuf_pool)
+{
+	struct rte_eth_conf port_conf = port_conf_default;
+	const uint16_t rx_rings = 1, tx_rings = 1;
+	int retval;
+	uint16_t q;
+
+	if (port >= rte_eth_dev_count())
+		return -1;
+
+	retval = rte_eth_dev_configure(port, rx_rings, tx_rings, &port_conf);
+	if (retval != 0)
+		return retval;
+
+	for (q = 0; q < rx_rings; q++) {
+		retval = rte_eth_rx_queue_setup(port, q, RX_RING_SIZE,
+				rte_eth_dev_socket_id(port), NULL, mbuf_pool);
+		if (retval < 0)
+			return retval;
+	}
+
+	for (q = 0; q < tx_rings; q++) {
+		retval = rte_eth_tx_queue_setup(port, q, TX_RING_SIZE,
+				rte_eth_dev_socket_id(port), NULL);
+		if (retval < 0)
+			return retval;
+	}
+
+	retval  = rte_eth_dev_start(port);
+	if (retval < 0)
+		return retval;
+
+	struct ether_addr addr;
+	rte_eth_macaddr_get(port, &addr);
+	printf("Port %u MAC: %02"PRIx8" %02"PRIx8" %02"PRIx8
+			" %02"PRIx8" %02"PRIx8" %02"PRIx8"\n",
+			(unsigned)port,
+			addr.addr_bytes[0], addr.addr_bytes[1],
+			addr.addr_bytes[2], addr.addr_bytes[3],
+			addr.addr_bytes[4], addr.addr_bytes[5]);
+
+	rte_eth_promiscuous_enable(port);
+	rte_eth_add_rx_callback(port, 0, add_timestamps, NULL);
+	rte_eth_add_tx_callback(port, 0, calc_latency, NULL);
+
+	return 0;
+}
+
+/*
+ * Main thread that does the work, reading from INPUT_PORT
+ * and writing to OUTPUT_PORT
+ */
+static  __attribute__((noreturn)) void
+lcore_main(void)
+{
+	uint8_t port;
+	for (port = 0; port < nb_ports; port++)
+		if (rte_eth_dev_socket_id(port) > 0 &&
+				rte_eth_dev_socket_id(port) !=
+						(int)rte_socket_id())
+			printf("WARNING, port %u is on remote NUMA node to "
+					"polling thread.\n\tPerformance will "
+					"not be optimal.\n", port);
+
+	printf("\nCore %u forwarding packets. [Ctrl+C to quit]\n",
+			rte_lcore_id());
+	for (;;) {
+		for (port = 0; port < nb_ports; port++) {
+			struct rte_mbuf *bufs[BURST_SIZE];
+			const uint16_t nb_rx = rte_eth_rx_burst(port, 0,
+					bufs, BURST_SIZE);
+			if (unlikely(nb_rx == 0))
+				continue;
+			const uint16_t nb_tx = rte_eth_tx_burst(port ^ 1, 0,
+					bufs, nb_rx);
+			if (unlikely(nb_tx < nb_rx)) {
+				uint16_t buf;
+				for (buf = nb_tx; buf < nb_rx; buf++)
+					rte_pktmbuf_free(bufs[buf]);
+			}
+		}
+	}
+}
+
+/* Main function, does initialisation and calls the per-lcore functions */
+int
+MAIN(int argc, char *argv[])
+{
+	struct rte_mempool *mbuf_pool;
+	uint8_t portid;
+
+	/* init EAL */
+	int ret = rte_eal_init(argc, argv);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
+	argc -= ret;
+	argv += ret;
+
+	nb_ports = rte_eth_dev_count();
+	if (nb_ports < 2 || (nb_ports & 1))
+		rte_exit(EXIT_FAILURE, "Error: number of ports must be even\n");
+
+	mbuf_pool = rte_mempool_create("MBUF_POOL", NUM_MBUFS * nb_ports,
+				       MBUF_SIZE, MBUF_CACHE_SIZE,
+				       sizeof(struct rte_pktmbuf_pool_private),
+				       rte_pktmbuf_pool_init, NULL,
+				       rte_pktmbuf_init, NULL,
+				       rte_socket_id(), 0);
+	if (mbuf_pool == NULL)
+		rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
+
+	/* initialize all ports */
+	for (portid = 0; portid < nb_ports; portid++)
+		if (port_init(portid, mbuf_pool) != 0)
+			rte_exit(EXIT_FAILURE, "Cannot init port %"PRIu8"\n",
+					portid);
+
+	if (rte_lcore_count() > 1)
+		printf("\nWARNING: Too much enabled lcores - App uses only 1 lcore\n");
+
+	/* call lcore_main on master core only */
+	lcore_main();
+	return 0;
+}
diff --git a/examples/rxtx_callbacks/basicfwd.h b/examples/rxtx_callbacks/basicfwd.h
new file mode 100644
index 0000000..3797b5d
--- /dev/null
+++ b/examples/rxtx_callbacks/basicfwd.h
@@ -0,0 +1,46 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef BASICFWD_H
+#define BASICFWD_H
+
+
+#ifdef RTE_EXEC_ENV_BAREMETAL
+#define MAIN _main
+#else
+#define MAIN main
+#endif
+
+int MAIN(int argc, char *argv[]);
+
+#endif /* BASICFWD_H */