[dpdk-dev,RFC] lib/librte_ether: consistent PMD batching behavior

Message ID 1484905876-60165-1-git-send-email-zhiyong.yang@intel.com
State RFC, archived
Delegated to: Thomas Monjalon
Headers show

Checks

Context Check Description
ci/Intel compilation success Compilation OK
ci/checkpatch success coding style OK

Commit Message

Yang, Zhiyong Jan. 20, 2017, 9:51 a.m.
The rte_eth_tx_burst() function in the file Rte_ethdev.h is invoked to
transmit output packets on the output queue for DPDK applications as
follows.

static inline uint16_t
rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
                 struct rte_mbuf **tx_pkts, uint16_t nb_pkts);

Note: The fourth parameter nb_pkts: The number of packets to transmit.
The rte_eth_tx_burst() function returns the number of packets it actually
sent. The return value equal to *nb_pkts* means that all packets have been
sent, and this is likely to signify that other output packets could be
immediately transmitted again. Applications that implement a "send as many
packets to transmit as possible" policy can check this specific case and
keep invoking the rte_eth_tx_burst() function until a value less than
*nb_pkts* is returned.

When you call TX only once in rte_eth_tx_burst, you may get different
behaviors from different PMDs. One problem that every DPDK user has to
face is that they need to take the policy into consideration at the app-
lication level when using any specific PMD to send the packets whether or
not it is necessary, which brings usage complexities and makes DPDK users
easily confused since they have to learn the details on TX function limit
of specific PMDs and have to handle the different return value: the number
of packets transmitted successfully for various PMDs. Some PMDs Tx func-
tions have a limit of sending at most 32 packets for every invoking, some
PMDs have another limit of at most 64 packets once, another ones have imp-
lemented to send as many packets to transmit as possible, etc. This will
easily cause wrong usage for DPDK users.

This patch proposes to implement the above policy in DPDK lib in order to
simplify the application implementation and avoid the incorrect invoking
as well. So, DPDK Users don't need to consider the implementation policy
and to write duplicated code at the application level again when sending
packets. In addition to it, the users don't need to know the difference of
specific PMD TX and can transmit the arbitrary number of packets as they
expect when invoking TX API rte_eth_tx_burst, then check the return value
to get the number of packets actually sent.

How to implement the policy in DPDK lib? Two solutions are proposed below.

Solution 1:
Implement the wrapper functions to remove some limits for each specific
PMDs as i40e_xmit_pkts_simple and ixgbe_xmit_pkts_simple do like that.

Solution 2:
Implement the policy in the function rte_eth_tx_burst() at the ethdev lay-
er in a more consistent batching way. Make best effort to send *nb_pkts*
packets with bursts of no more than 32 by default since many DPDK TX PMDs
are using this max TX burst size(32). In addition, one data member which
defines the max TX burst size such as "uint16_t max_tx_burst_pkts;"will be
added to rte_eth_dev_data, which drivers can override if they work with
bursts of 64 or other NB(thanks for Bruce <bruce.richardson@intel.com>'s
suggestion). This can reduce the performance impacting to the lowest limit.

I prefer the latter between the 2 solutions because it makes DPDK code more
consistent and easier and avoids to write too much duplicate logic in DPDK
source code. In addition, I think no or a little performance drop is
brought by solution 2. But ABI change will be introduced.

In fact, the current rte_eth_rx_burst() function is using the similar
mechanism and faces the same problem as rte_eth_tx_burst().

static inline uint16_t
rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
                 struct rte_mbuf **rx_pkts, const uint16_t nb_pkts);

Applications are responsible of implementing the policy "retrieve as many
received packets as possible", and check this specific case and keep
invoking the rte_eth_rx_burst() function until a value less than *nb_pkts*
is returned.

The patch proposes to apply the above method to rte_eth_rx_burst() as well.

In summary, The purpose of the RFC makes the job easier and more simple for
driver writers and avoids to write too much duplicate code at the applica-
tion level.

Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
---
 lib/librte_ether/rte_ethdev.h | 41 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

Comments

Andrew Rybchenko Jan. 20, 2017, 10:26 a.m. | #1
On 01/20/2017 12:51 PM, Zhiyong Yang wrote:
> The rte_eth_tx_burst() function in the file Rte_ethdev.h is invoked to
> transmit output packets on the output queue for DPDK applications as
> follows.
>
> static inline uint16_t
> rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
>                   struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
>
> Note: The fourth parameter nb_pkts: The number of packets to transmit.
> The rte_eth_tx_burst() function returns the number of packets it actually
> sent. The return value equal to *nb_pkts* means that all packets have been
> sent, and this is likely to signify that other output packets could be
> immediately transmitted again. Applications that implement a "send as many
> packets to transmit as possible" policy can check this specific case and
> keep invoking the rte_eth_tx_burst() function until a value less than
> *nb_pkts* is returned.
>
> When you call TX only once in rte_eth_tx_burst, you may get different
> behaviors from different PMDs. One problem that every DPDK user has to
> face is that they need to take the policy into consideration at the app-
> lication level when using any specific PMD to send the packets whether or
> not it is necessary, which brings usage complexities and makes DPDK users
> easily confused since they have to learn the details on TX function limit
> of specific PMDs and have to handle the different return value: the number
> of packets transmitted successfully for various PMDs. Some PMDs Tx func-
> tions have a limit of sending at most 32 packets for every invoking, some
> PMDs have another limit of at most 64 packets once, another ones have imp-
> lemented to send as many packets to transmit as possible, etc. This will
> easily cause wrong usage for DPDK users.
>
> This patch proposes to implement the above policy in DPDK lib in order to
> simplify the application implementation and avoid the incorrect invoking
> as well. So, DPDK Users don't need to consider the implementation policy
> and to write duplicated code at the application level again when sending
> packets. In addition to it, the users don't need to know the difference of
> specific PMD TX and can transmit the arbitrary number of packets as they
> expect when invoking TX API rte_eth_tx_burst, then check the return value
> to get the number of packets actually sent.
>
> How to implement the policy in DPDK lib? Two solutions are proposed below.
>
> Solution 1:
> Implement the wrapper functions to remove some limits for each specific
> PMDs as i40e_xmit_pkts_simple and ixgbe_xmit_pkts_simple do like that.

IMHO, the solution is a bit better since it:
  1. Does not affect other PMDs at all
  2. Could be a bit faster for the PMDs which require it since has no 
indirect
      function call on each iteration
  3. No ABI change

> Solution 2:
> Implement the policy in the function rte_eth_tx_burst() at the ethdev lay-
> er in a more consistent batching way. Make best effort to send *nb_pkts*
> packets with bursts of no more than 32 by default since many DPDK TX PMDs
> are using this max TX burst size(32). In addition, one data member which
> defines the max TX burst size such as "uint16_t max_tx_burst_pkts;"will be
> added to rte_eth_dev_data, which drivers can override if they work with
> bursts of 64 or other NB(thanks for Bruce <bruce.richardson@intel.com>'s
> suggestion). This can reduce the performance impacting to the lowest limit.

I see no noticeable difference in performance, so don't mind if this is 
finally choosen.
Just be sure that you update all PMDs to set reasonable default values, 
or may be
even better, set UINT16_MAX in generic place - 0 is a bad default here.
(Lost few seconds wondering why nothing is sent and cannot stop)

> I prefer the latter between the 2 solutions because it makes DPDK code more
> consistent and easier and avoids to write too much duplicate logic in DPDK
> source code. In addition, I think no or a little performance drop is
> brought by solution 2. But ABI change will be introduced.
>
> In fact, the current rte_eth_rx_burst() function is using the similar
> mechanism and faces the same problem as rte_eth_tx_burst().
>
> static inline uint16_t
> rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
>                   struct rte_mbuf **rx_pkts, const uint16_t nb_pkts);
>
> Applications are responsible of implementing the policy "retrieve as many
> received packets as possible", and check this specific case and keep
> invoking the rte_eth_rx_burst() function until a value less than *nb_pkts*
> is returned.
>
> The patch proposes to apply the above method to rte_eth_rx_burst() as well.
>
> In summary, The purpose of the RFC makes the job easier and more simple for
> driver writers and avoids to write too much duplicate code at the applica-
> tion level.
>
> Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
> ---
>   lib/librte_ether/rte_ethdev.h | 41 +++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 39 insertions(+), 2 deletions(-)
>
> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
> index 1c356c1..6fa83cf 100644
> --- a/lib/librte_ether/rte_ethdev.h
> +++ b/lib/librte_ether/rte_ethdev.h
> @@ -1712,6 +1712,9 @@ struct rte_eth_dev_data {
>   	uint32_t min_rx_buf_size;
>   	/**< Common rx buffer size handled by all queues */
>   
> +	uint16_t max_rx_burst_pkts;
> +	uint16_t max_tx_burst_pkts;
> +
>   	uint64_t rx_mbuf_alloc_failed; /**< RX ring mbuf allocation failures. */
>   	struct ether_addr* mac_addrs;/**< Device Ethernet Link address. */
>   	uint64_t mac_pool_sel[ETH_NUM_RECEIVE_MAC_ADDR];
> @@ -2695,11 +2698,15 @@ int rte_eth_dev_set_vlan_pvid(uint8_t port_id, uint16_t pvid, int on);
>    *   of pointers to *rte_mbuf* structures effectively supplied to the
>    *   *rx_pkts* array.
>    */
> +
>   static inline uint16_t
>   rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
>   		 struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
>   {
>   	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +	int16_t nb_rx = 0;
> +	uint16_t pkts = 0;
> +	uint16_t rx_nb_pkts = nb_pkts;
>   
>   #ifdef RTE_LIBRTE_ETHDEV_DEBUG
>   	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
> @@ -2710,8 +2717,20 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
>   		return 0;
>   	}
>   #endif
> -	int16_t nb_rx = (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id],
> +	if (likely(nb_pkts <= dev->data->max_rx_burst_pkts))
> +		return (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id],
>   			rx_pkts, nb_pkts);
> +	while (rx_nb_pkts) {
> +		uint16_t num_burst = RTE_MIN(nb_pkts,
> +					      dev->data->max_rx_burst_pkts);
> +
> +		pkts = (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id],
> +						&rx_pkts[nb_rx], num_burst);
> +		nb_rx += pkts;
> +		rx_nb_pkts -= pkts;
> +		if (pkts < num_burst)
> +			break;
> +	}
>   
>   #ifdef RTE_ETHDEV_RXTX_CALLBACKS
>   	struct rte_eth_rxtx_callback *cb = dev->post_rx_burst_cbs[queue_id];
> @@ -2833,11 +2852,13 @@ rte_eth_rx_descriptor_done(uint8_t port_id, uint16_t queue_id, uint16_t offset)
>    *   the transmit ring. The return value can be less than the value of the
>    *   *tx_pkts* parameter when the transmit ring is full or has been filled up.
>    */
> +
>   static inline uint16_t
>   rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
>   		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
>   {
>   	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +	uint16_t nb_tx = 0;
>   
>   #ifdef RTE_LIBRTE_ETHDEV_DEBUG
>   	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
> @@ -2860,8 +2881,24 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
>   		} while (cb != NULL);
>   	}
>   #endif
> +	if (likely(nb_pkts <= dev->data->max_tx_burst_pkts))
> +		return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id],
> +						tx_pkts, nb_pkts);
> +
> +	while (nb_pkts) {
> +		uint16_t num_burst = RTE_MIN(nb_pkts,
> +					     dev->data->max_tx_burst_pkts);
> +		uint16_t pkts;
> +
> +		pkts = (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id],
> +						&tx_pkts[nb_tx], num_burst);
> +		nb_tx += pkts;
> +		nb_pkts -= pkts;
> +		if (pkts < num_burst)
> +			break;
> +	}
>   
> -	return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, nb_pkts);
> +	return nb_tx;
>   }
>   
>   /**
Ananyev, Konstantin Jan. 20, 2017, 11:24 a.m. | #2
> 
> From: Andrew Rybchenko [mailto:arybchenko@solarflare.com]
> Sent: Friday, January 20, 2017 10:26 AM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org
> Cc: thomas.monjalon@6wind.com; Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [RFC] lib/librte_ether: consistent PMD batching behavior
> 
> On 01/20/2017 12:51 PM, Zhiyong Yang wrote:
> The rte_eth_tx_burst() function in the file Rte_ethdev.h is invoked to
> transmit output packets on the output queue for DPDK applications as
> follows.
> 
> static inline uint16_t
> rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
>                  struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
> 
> Note: The fourth parameter nb_pkts: The number of packets to transmit.
> The rte_eth_tx_burst() function returns the number of packets it actually
> sent. The return value equal to *nb_pkts* means that all packets have been
> sent, and this is likely to signify that other output packets could be
> immediately transmitted again. Applications that implement a "send as many
> packets to transmit as possible" policy can check this specific case and
> keep invoking the rte_eth_tx_burst() function until a value less than
> *nb_pkts* is returned.
> 
> When you call TX only once in rte_eth_tx_burst, you may get different
> behaviors from different PMDs. One problem that every DPDK user has to
> face is that they need to take the policy into consideration at the app-
> lication level when using any specific PMD to send the packets whether or
> not it is necessary, which brings usage complexities and makes DPDK users
> easily confused since they have to learn the details on TX function limit
> of specific PMDs and have to handle the different return value: the number
> of packets transmitted successfully for various PMDs. Some PMDs Tx func-
> tions have a limit of sending at most 32 packets for every invoking, some
> PMDs have another limit of at most 64 packets once, another ones have imp-
> lemented to send as many packets to transmit as possible, etc. This will
> easily cause wrong usage for DPDK users.
> 
> This patch proposes to implement the above policy in DPDK lib in order to
> simplify the application implementation and avoid the incorrect invoking
> as well. So, DPDK Users don't need to consider the implementation policy
> and to write duplicated code at the application level again when sending
> packets. In addition to it, the users don't need to know the difference of
> specific PMD TX and can transmit the arbitrary number of packets as they
> expect when invoking TX API rte_eth_tx_burst, then check the return value
> to get the number of packets actually sent.
> 
> How to implement the policy in DPDK lib? Two solutions are proposed below.
> 
> Solution 1:
> Implement the wrapper functions to remove some limits for each specific
> PMDs as i40e_xmit_pkts_simple and ixgbe_xmit_pkts_simple do like that.
> 
> > IMHO, the solution is a bit better since it:
> > 1. Does not affect other PMDs at all
> > 2. Could be a bit faster for the PMDs which require it since has no indirect
> >    function call on each iteration
> > 3. No ABI change

I also would prefer solution number 1 for the reasons outlined by Andrew above.
Also, IMO current limitation for number of packets to TX in some Intel PMD TX routines
are sort of artificial:
- they are not caused by any real HW limitations
- avoiding them at PMD level shouldn't cause any performance or functional degradation.
So I don't see any good reason why instead of fixing these limitations in
our own PMDs we are trying to push them to the upper (rte_ethdev) layer.

Konstantin

> 
> 
> Solution 2:
> Implement the policy in the function rte_eth_tx_burst() at the ethdev lay-
> er in a more consistent batching way. Make best effort to send *nb_pkts*
> packets with bursts of no more than 32 by default since many DPDK TX PMDs
> are using this max TX burst size(32). In addition, one data member which
> defines the max TX burst size such as "uint16_t max_tx_burst_pkts;"will be
> added to rte_eth_dev_data, which drivers can override if they work with
> bursts of 64 or other NB(thanks for Bruce <bruce.richardson@intel.com>'s
> suggestion). This can reduce the performance impacting to the lowest limit.
> 
> > I see no noticeable difference in performance, so don't mind if this is finally choosen.
> > Just be sure that you update all PMDs to set reasonable default values, or may be
> > even better, set UINT16_MAX in generic place - 0 is a bad default here.
> > (Lost few seconds wondering why nothing is sent and cannot stop)
> 
> 
> I prefer the latter between the 2 solutions because it makes DPDK code more
> consistent and easier and avoids to write too much duplicate logic in DPDK
> source code. In addition, I think no or a little performance drop is
> brought by solution 2. But ABI change will be introduced.
> 
> In fact, the current rte_eth_rx_burst() function is using the similar
> mechanism and faces the same problem as rte_eth_tx_burst().
> 
> static inline uint16_t
> rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
>                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts);
> 
> Applications are responsible of implementing the policy "retrieve as many
> received packets as possible", and check this specific case and keep
> invoking the rte_eth_rx_burst() function until a value less than *nb_pkts*
> is returned.
> 
> The patch proposes to apply the above method to rte_eth_rx_burst() as well.
> 
> In summary, The purpose of the RFC makes the job easier and more simple for
> driver writers and avoids to write too much duplicate code at the applica-
> tion level.
>
Bruce Richardson Jan. 20, 2017, 11:48 a.m. | #3
On Fri, Jan 20, 2017 at 11:24:40AM +0000, Ananyev, Konstantin wrote:
> > 
> > From: Andrew Rybchenko [mailto:arybchenko@solarflare.com]
> > Sent: Friday, January 20, 2017 10:26 AM
> > To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org
> > Cc: thomas.monjalon@6wind.com; Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>
> > Subject: Re: [dpdk-dev] [RFC] lib/librte_ether: consistent PMD batching behavior
> > 
> > On 01/20/2017 12:51 PM, Zhiyong Yang wrote:
> > The rte_eth_tx_burst() function in the file Rte_ethdev.h is invoked to
> > transmit output packets on the output queue for DPDK applications as
> > follows.
> > 
> > static inline uint16_t
> > rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
> >                  struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
> > 
> > Note: The fourth parameter nb_pkts: The number of packets to transmit.
> > The rte_eth_tx_burst() function returns the number of packets it actually
> > sent. The return value equal to *nb_pkts* means that all packets have been
> > sent, and this is likely to signify that other output packets could be
> > immediately transmitted again. Applications that implement a "send as many
> > packets to transmit as possible" policy can check this specific case and
> > keep invoking the rte_eth_tx_burst() function until a value less than
> > *nb_pkts* is returned.
> > 
> > When you call TX only once in rte_eth_tx_burst, you may get different
> > behaviors from different PMDs. One problem that every DPDK user has to
> > face is that they need to take the policy into consideration at the app-
> > lication level when using any specific PMD to send the packets whether or
> > not it is necessary, which brings usage complexities and makes DPDK users
> > easily confused since they have to learn the details on TX function limit
> > of specific PMDs and have to handle the different return value: the number
> > of packets transmitted successfully for various PMDs. Some PMDs Tx func-
> > tions have a limit of sending at most 32 packets for every invoking, some
> > PMDs have another limit of at most 64 packets once, another ones have imp-
> > lemented to send as many packets to transmit as possible, etc. This will
> > easily cause wrong usage for DPDK users.
> > 
> > This patch proposes to implement the above policy in DPDK lib in order to
> > simplify the application implementation and avoid the incorrect invoking
> > as well. So, DPDK Users don't need to consider the implementation policy
> > and to write duplicated code at the application level again when sending
> > packets. In addition to it, the users don't need to know the difference of
> > specific PMD TX and can transmit the arbitrary number of packets as they
> > expect when invoking TX API rte_eth_tx_burst, then check the return value
> > to get the number of packets actually sent.
> > 
> > How to implement the policy in DPDK lib? Two solutions are proposed below.
> > 
> > Solution 1:
> > Implement the wrapper functions to remove some limits for each specific
> > PMDs as i40e_xmit_pkts_simple and ixgbe_xmit_pkts_simple do like that.
> > 
> > > IMHO, the solution is a bit better since it:
> > > 1. Does not affect other PMDs at all
> > > 2. Could be a bit faster for the PMDs which require it since has no indirect
> > >    function call on each iteration
> > > 3. No ABI change
> 
> I also would prefer solution number 1 for the reasons outlined by Andrew above.
> Also, IMO current limitation for number of packets to TX in some Intel PMD TX routines
> are sort of artificial:
> - they are not caused by any real HW limitations
> - avoiding them at PMD level shouldn't cause any performance or functional degradation.
> So I don't see any good reason why instead of fixing these limitations in
> our own PMDs we are trying to push them to the upper (rte_ethdev) layer.
> 
> Konstantin
> 
The main advantage I see is that it should make it a bit easier for
driver writers, since they have a tighter set of constraints to work
with for packet RX and Tx. The routines only have to handle requests for
packets in the range 0-N, rather than not having an upper bound on the
request. It also then saves code duplicating with having multiple
drivers having the same outer-loop code for handling arbitrarily large
requests.

No big deal to me either way though.

/Bruce
Yang, Zhiyong Jan. 21, 2017, 4:07 a.m. | #4
> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Friday, January 20, 2017 7:25 PM
> To: Andrew Rybchenko <arybchenko@solarflare.com>; Yang, Zhiyong
> <zhiyong.yang@intel.com>; dev@dpdk.org
> Cc: thomas.monjalon@6wind.com; Richardson, Bruce
> <bruce.richardson@intel.com>
> Subject: RE: [dpdk-dev] [RFC] lib/librte_ether: consistent PMD batching
> behavior
> 
> >
> > From: Andrew Rybchenko [mailto:arybchenko@solarflare.com]
> > Sent: Friday, January 20, 2017 10:26 AM
> > To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org
> > Cc: thomas.monjalon@6wind.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>
> > Subject: Re: [dpdk-dev] [RFC] lib/librte_ether: consistent PMD
> > batching behavior
> >
> > On 01/20/2017 12:51 PM, Zhiyong Yang wrote:
> > The rte_eth_tx_burst() function in the file Rte_ethdev.h is invoked to
> > transmit output packets on the output queue for DPDK applications as
> > follows.
> >
> > static inline uint16_t
> > rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
> >                  struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
> >
> > Note: The fourth parameter nb_pkts: The number of packets to transmit.
> > The rte_eth_tx_burst() function returns the number of packets it
> > actually sent. The return value equal to *nb_pkts* means that all
> > packets have been sent, and this is likely to signify that other
> > output packets could be immediately transmitted again. Applications
> > that implement a "send as many packets to transmit as possible" policy
> > can check this specific case and keep invoking the rte_eth_tx_burst()
> > function until a value less than
> > *nb_pkts* is returned.
> >
> > When you call TX only once in rte_eth_tx_burst, you may get different
> > behaviors from different PMDs. One problem that every DPDK user has to
> > face is that they need to take the policy into consideration at the
> > app- lication level when using any specific PMD to send the packets
> > whether or not it is necessary, which brings usage complexities and
> > makes DPDK users easily confused since they have to learn the details
> > on TX function limit of specific PMDs and have to handle the different
> > return value: the number of packets transmitted successfully for
> > various PMDs. Some PMDs Tx func- tions have a limit of sending at most
> > 32 packets for every invoking, some PMDs have another limit of at most
> > 64 packets once, another ones have imp- lemented to send as many
> > packets to transmit as possible, etc. This will easily cause wrong usage for
> DPDK users.
> >
> > This patch proposes to implement the above policy in DPDK lib in order
> > to simplify the application implementation and avoid the incorrect
> > invoking as well. So, DPDK Users don't need to consider the
> > implementation policy and to write duplicated code at the application
> > level again when sending packets. In addition to it, the users don't
> > need to know the difference of specific PMD TX and can transmit the
> > arbitrary number of packets as they expect when invoking TX API
> > rte_eth_tx_burst, then check the return value to get the number of
> packets actually sent.
> >
> > How to implement the policy in DPDK lib? Two solutions are proposed
> below.
> >
> > Solution 1:
> > Implement the wrapper functions to remove some limits for each
> > specific PMDs as i40e_xmit_pkts_simple and ixgbe_xmit_pkts_simple do
> like that.
> >
> > > IMHO, the solution is a bit better since it:
> > > 1. Does not affect other PMDs at all
> > > 2. Could be a bit faster for the PMDs which require it since has no
> > >indirect
> > >    function call on each iteration
> > > 3. No ABI change
> 
> I also would prefer solution number 1 for the reasons outlined by Andrew
> above.
> Also, IMO current limitation for number of packets to TX in some Intel PMD
> TX routines are sort of artificial:
> - they are not caused by any real HW limitations
> - avoiding them at PMD level shouldn't cause any performance or functional
> degradation.
> So I don't see any good reason why instead of fixing these limitations in our
> own PMDs we are trying to push them to the upper (rte_ethdev) layer.
> 
> Konstantin

 Solution 1 indeed has advantages as Andrew and Konstantin said. 

Zhiyong
Yang, Zhiyong Jan. 21, 2017, 4:13 a.m. | #5
From: Andrew Rybchenko [mailto:arybchenko@solarflare.com]
Sent: Friday, January 20, 2017 6:26 PM
To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org
Cc: thomas.monjalon@6wind.com; Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>
Subject: Re: [dpdk-dev] [RFC] lib/librte_ether: consistent PMD batching behavior

On 01/20/2017 12:51 PM, Zhiyong Yang wrote:

The rte_eth_tx_burst() function in the file Rte_ethdev.h is invoked to

transmit output packets on the output queue for DPDK applications as

follows.



static inline uint16_t

rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,

                 struct rte_mbuf **tx_pkts, uint16_t nb_pkts);



Note: The fourth parameter nb_pkts: The number of packets to transmit.

The rte_eth_tx_burst() function returns the number of packets it actually

sent. The return value equal to *nb_pkts* means that all packets have been

sent, and this is likely to signify that other output packets could be

immediately transmitted again. Applications that implement a "send as many

packets to transmit as possible" policy can check this specific case and

keep invoking the rte_eth_tx_burst() function until a value less than

*nb_pkts* is returned.



When you call TX only once in rte_eth_tx_burst, you may get different

behaviors from different PMDs. One problem that every DPDK user has to

face is that they need to take the policy into consideration at the app-

lication level when using any specific PMD to send the packets whether or

not it is necessary, which brings usage complexities and makes DPDK users

easily confused since they have to learn the details on TX function limit

of specific PMDs and have to handle the different return value: the number

of packets transmitted successfully for various PMDs. Some PMDs Tx func-

tions have a limit of sending at most 32 packets for every invoking, some

PMDs have another limit of at most 64 packets once, another ones have imp-

lemented to send as many packets to transmit as possible, etc. This will

easily cause wrong usage for DPDK users.



This patch proposes to implement the above policy in DPDK lib in order to

simplify the application implementation and avoid the incorrect invoking

as well. So, DPDK Users don't need to consider the implementation policy

and to write duplicated code at the application level again when sending

packets. In addition to it, the users don't need to know the difference of

specific PMD TX and can transmit the arbitrary number of packets as they

expect when invoking TX API rte_eth_tx_burst, then check the return value

to get the number of packets actually sent.



How to implement the policy in DPDK lib? Two solutions are proposed below.



Solution 1:

Implement the wrapper functions to remove some limits for each specific

PMDs as i40e_xmit_pkts_simple and ixgbe_xmit_pkts_simple do like that.

IMHO, the solution is a bit better since it:
 1. Does not affect other PMDs at all
 2. Could be a bit faster for the PMDs which require it since has no indirect
     function call on each iteration
 3. No ABI change



Solution 2:

Implement the policy in the function rte_eth_tx_burst() at the ethdev lay-

er in a more consistent batching way. Make best effort to send *nb_pkts*

packets with bursts of no more than 32 by default since many DPDK TX PMDs

are using this max TX burst size(32). In addition, one data member which

defines the max TX burst size such as "uint16_t max_tx_burst_pkts;"will be

added to rte_eth_dev_data, which drivers can override if they work with

bursts of 64 or other NB(thanks for Bruce <bruce.richardson@intel.com><mailto:bruce.richardson@intel.com>'s

suggestion). This can reduce the performance impacting to the lowest limit.

I see no noticeable difference in performance, so don't mind if this is finally choosen.
Just be sure that you update all PMDs to set reasonable default values, or may be
even better, set UINT16_MAX in generic place - 0 is a bad default here.
(Lost few seconds wondering why nothing is sent and cannot stop)

Agree with you, 0 is not a good default value.   I recommend 32 by default here, of course, The driver writers
Can configure it as they expect before starting to sending packets.

Zhiyong
Adrien Mazarguil Jan. 23, 2017, 4:36 p.m. | #6
On Fri, Jan 20, 2017 at 11:48:22AM +0000, Bruce Richardson wrote:
> On Fri, Jan 20, 2017 at 11:24:40AM +0000, Ananyev, Konstantin wrote:
> > > 
> > > From: Andrew Rybchenko [mailto:arybchenko@solarflare.com]
> > > Sent: Friday, January 20, 2017 10:26 AM
> > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org
> > > Cc: thomas.monjalon@6wind.com; Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>
> > > Subject: Re: [dpdk-dev] [RFC] lib/librte_ether: consistent PMD batching behavior
> > > 
> > > On 01/20/2017 12:51 PM, Zhiyong Yang wrote:
> > > The rte_eth_tx_burst() function in the file Rte_ethdev.h is invoked to
> > > transmit output packets on the output queue for DPDK applications as
> > > follows.
> > > 
> > > static inline uint16_t
> > > rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
> > >                  struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
> > > 
> > > Note: The fourth parameter nb_pkts: The number of packets to transmit.
> > > The rte_eth_tx_burst() function returns the number of packets it actually
> > > sent. The return value equal to *nb_pkts* means that all packets have been
> > > sent, and this is likely to signify that other output packets could be
> > > immediately transmitted again. Applications that implement a "send as many
> > > packets to transmit as possible" policy can check this specific case and
> > > keep invoking the rte_eth_tx_burst() function until a value less than
> > > *nb_pkts* is returned.
> > > 
> > > When you call TX only once in rte_eth_tx_burst, you may get different
> > > behaviors from different PMDs. One problem that every DPDK user has to
> > > face is that they need to take the policy into consideration at the app-
> > > lication level when using any specific PMD to send the packets whether or
> > > not it is necessary, which brings usage complexities and makes DPDK users
> > > easily confused since they have to learn the details on TX function limit
> > > of specific PMDs and have to handle the different return value: the number
> > > of packets transmitted successfully for various PMDs. Some PMDs Tx func-
> > > tions have a limit of sending at most 32 packets for every invoking, some
> > > PMDs have another limit of at most 64 packets once, another ones have imp-
> > > lemented to send as many packets to transmit as possible, etc. This will
> > > easily cause wrong usage for DPDK users.
> > > 
> > > This patch proposes to implement the above policy in DPDK lib in order to
> > > simplify the application implementation and avoid the incorrect invoking
> > > as well. So, DPDK Users don't need to consider the implementation policy
> > > and to write duplicated code at the application level again when sending
> > > packets. In addition to it, the users don't need to know the difference of
> > > specific PMD TX and can transmit the arbitrary number of packets as they
> > > expect when invoking TX API rte_eth_tx_burst, then check the return value
> > > to get the number of packets actually sent.
> > > 
> > > How to implement the policy in DPDK lib? Two solutions are proposed below.
> > > 
> > > Solution 1:
> > > Implement the wrapper functions to remove some limits for each specific
> > > PMDs as i40e_xmit_pkts_simple and ixgbe_xmit_pkts_simple do like that.
> > > 
> > > > IMHO, the solution is a bit better since it:
> > > > 1. Does not affect other PMDs at all
> > > > 2. Could be a bit faster for the PMDs which require it since has no indirect
> > > >    function call on each iteration
> > > > 3. No ABI change
> > 
> > I also would prefer solution number 1 for the reasons outlined by Andrew above.
> > Also, IMO current limitation for number of packets to TX in some Intel PMD TX routines
> > are sort of artificial:
> > - they are not caused by any real HW limitations
> > - avoiding them at PMD level shouldn't cause any performance or functional degradation.
> > So I don't see any good reason why instead of fixing these limitations in
> > our own PMDs we are trying to push them to the upper (rte_ethdev) layer.

For what it's worth, I agree with Konstantin. Wrappers should be as thin as
possible on top of PMD functions, they are not helpers. We could define a
set of higher level functions for this purpose though.

In the meantime, exposing and documenting PMD limitations seems safe enough.

We could assert that RX/TX burst requests larger than the size of the target
queue are unlikely to be fully met (i.e. PMDs usually do not check for
completions in the middle of a TX burst).

> > Konstantin
> > 
> The main advantage I see is that it should make it a bit easier for
> driver writers, since they have a tighter set of constraints to work
> with for packet RX and Tx. The routines only have to handle requests for
> packets in the range 0-N, rather than not having an upper bound on the
> request. It also then saves code duplicating with having multiple
> drivers having the same outer-loop code for handling arbitrarily large
> requests.
> 
> No big deal to me either way though.
> 
> /Bruce

Right but there is a cost in doing so, as unlikely() as the additional code
is. We should leave that choice to applications.
Yang, Zhiyong Feb. 7, 2017, 7:50 a.m. | #7
Hi, Adrien:

	Sorry for the late reply  due to Chinese new year.

> -----Original Message-----

> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]

> Sent: Tuesday, January 24, 2017 12:36 AM

> To: Richardson, Bruce <bruce.richardson@intel.com>

> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Andrew

> Rybchenko <arybchenko@solarflare.com>; Yang, Zhiyong

> <zhiyong.yang@intel.com>; dev@dpdk.org; thomas.monjalon@6wind.com

> Subject: Re: [dpdk-dev] [RFC] lib/librte_ether: consistent PMD batching

> behavior

> 

> On Fri, Jan 20, 2017 at 11:48:22AM +0000, Bruce Richardson wrote:

> > On Fri, Jan 20, 2017 at 11:24:40AM +0000, Ananyev, Konstantin wrote:

> > > >

> > > > From: Andrew Rybchenko [mailto:arybchenko@solarflare.com]

> > > > Sent: Friday, January 20, 2017 10:26 AM

> > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org

> > > > Cc: thomas.monjalon@6wind.com; Richardson, Bruce

> > > > <bruce.richardson@intel.com>; Ananyev, Konstantin

> > > > <konstantin.ananyev@intel.com>

> > > > Subject: Re: [dpdk-dev] [RFC] lib/librte_ether: consistent PMD

> > > > batching behavior

> > > >

> > > > On 01/20/2017 12:51 PM, Zhiyong Yang wrote:

> > > > The rte_eth_tx_burst() function in the file Rte_ethdev.h is

> > > > invoked to transmit output packets on the output queue for DPDK

> > > > applications as follows.

> > > >

> > > > static inline uint16_t

> > > > rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,

> > > >                  struct rte_mbuf **tx_pkts, uint16_t nb_pkts);

> > > >

> > > > Note: The fourth parameter nb_pkts: The number of packets to

> transmit.

> > > > The rte_eth_tx_burst() function returns the number of packets it

> > > > actually sent. The return value equal to *nb_pkts* means that all

> > > > packets have been sent, and this is likely to signify that other

> > > > output packets could be immediately transmitted again.

> > > > Applications that implement a "send as many packets to transmit as

> > > > possible" policy can check this specific case and keep invoking

> > > > the rte_eth_tx_burst() function until a value less than

> > > > *nb_pkts* is returned.

> > > >

> > > > When you call TX only once in rte_eth_tx_burst, you may get

> > > > different behaviors from different PMDs. One problem that every

> > > > DPDK user has to face is that they need to take the policy into

> > > > consideration at the app- lication level when using any specific

> > > > PMD to send the packets whether or not it is necessary, which

> > > > brings usage complexities and makes DPDK users easily confused

> > > > since they have to learn the details on TX function limit of

> > > > specific PMDs and have to handle the different return value: the

> > > > number of packets transmitted successfully for various PMDs. Some

> > > > PMDs Tx func- tions have a limit of sending at most 32 packets for

> > > > every invoking, some PMDs have another limit of at most 64 packets

> > > > once, another ones have imp- lemented to send as many packets to

> transmit as possible, etc. This will easily cause wrong usage for DPDK users.

> > > >

> > > > This patch proposes to implement the above policy in DPDK lib in

> > > > order to simplify the application implementation and avoid the

> > > > incorrect invoking as well. So, DPDK Users don't need to consider

> > > > the implementation policy and to write duplicated code at the

> > > > application level again when sending packets. In addition to it,

> > > > the users don't need to know the difference of specific PMD TX and

> > > > can transmit the arbitrary number of packets as they expect when

> > > > invoking TX API rte_eth_tx_burst, then check the return value to get

> the number of packets actually sent.

> > > >

> > > > How to implement the policy in DPDK lib? Two solutions are proposed

> below.

> > > >

> > > > Solution 1:

> > > > Implement the wrapper functions to remove some limits for each

> > > > specific PMDs as i40e_xmit_pkts_simple and ixgbe_xmit_pkts_simple

> do like that.

> > > >

> > > > > IMHO, the solution is a bit better since it:

> > > > > 1. Does not affect other PMDs at all

> > > > > 2. Could be a bit faster for the PMDs which require it since has

> > > > >no indirect

> > > > >    function call on each iteration

> > > > > 3. No ABI change

> > >

> > > I also would prefer solution number 1 for the reasons outlined by Andrew

> above.

> > > Also, IMO current limitation for number of packets to TX in some

> > > Intel PMD TX routines are sort of artificial:

> > > - they are not caused by any real HW limitations

> > > - avoiding them at PMD level shouldn't cause any performance or

> functional degradation.

> > > So I don't see any good reason why instead of fixing these

> > > limitations in our own PMDs we are trying to push them to the upper

> (rte_ethdev) layer.

> 

> For what it's worth, I agree with Konstantin. Wrappers should be as thin as

> possible on top of PMD functions, they are not helpers. We could define a

> set of higher level functions for this purpose though.

> 

> In the meantime, exposing and documenting PMD limitations seems safe

> enough.

> 

> We could assert that RX/TX burst requests larger than the size of the target

> queue are unlikely to be fully met (i.e. PMDs usually do not check for

> completions in the middle of a TX burst).


As a tool,  it is very important for its users to easily consume it and make it work
in a right way.  Sort of artificial limits will make things look like a little confused  and
make some users probably get into trouble when writing drivers. 
Why do we correct it and make it easier?  :)

Zhiyong

Patch

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 1c356c1..6fa83cf 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -1712,6 +1712,9 @@  struct rte_eth_dev_data {
 	uint32_t min_rx_buf_size;
 	/**< Common rx buffer size handled by all queues */
 
+	uint16_t max_rx_burst_pkts;
+	uint16_t max_tx_burst_pkts;
+
 	uint64_t rx_mbuf_alloc_failed; /**< RX ring mbuf allocation failures. */
 	struct ether_addr* mac_addrs;/**< Device Ethernet Link address. */
 	uint64_t mac_pool_sel[ETH_NUM_RECEIVE_MAC_ADDR];
@@ -2695,11 +2698,15 @@  int rte_eth_dev_set_vlan_pvid(uint8_t port_id, uint16_t pvid, int on);
  *   of pointers to *rte_mbuf* structures effectively supplied to the
  *   *rx_pkts* array.
  */
+
 static inline uint16_t
 rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
 		 struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
 {
 	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	int16_t nb_rx = 0;
+	uint16_t pkts = 0;
+	uint16_t rx_nb_pkts = nb_pkts;
 
 #ifdef RTE_LIBRTE_ETHDEV_DEBUG
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
@@ -2710,8 +2717,20 @@  rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
 		return 0;
 	}
 #endif
-	int16_t nb_rx = (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id],
+	if (likely(nb_pkts <= dev->data->max_rx_burst_pkts))
+		return (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id],
 			rx_pkts, nb_pkts);
+	while (rx_nb_pkts) {
+		uint16_t num_burst = RTE_MIN(nb_pkts,
+					      dev->data->max_rx_burst_pkts);
+
+		pkts = (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id],
+						&rx_pkts[nb_rx], num_burst);
+		nb_rx += pkts;
+		rx_nb_pkts -= pkts;
+		if (pkts < num_burst)
+			break;
+	}
 
 #ifdef RTE_ETHDEV_RXTX_CALLBACKS
 	struct rte_eth_rxtx_callback *cb = dev->post_rx_burst_cbs[queue_id];
@@ -2833,11 +2852,13 @@  rte_eth_rx_descriptor_done(uint8_t port_id, uint16_t queue_id, uint16_t offset)
  *   the transmit ring. The return value can be less than the value of the
  *   *tx_pkts* parameter when the transmit ring is full or has been filled up.
  */
+
 static inline uint16_t
 rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
 		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 {
 	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	uint16_t nb_tx = 0;
 
 #ifdef RTE_LIBRTE_ETHDEV_DEBUG
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
@@ -2860,8 +2881,24 @@  rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
 		} while (cb != NULL);
 	}
 #endif
+	if (likely(nb_pkts <= dev->data->max_tx_burst_pkts))
+		return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id],
+						tx_pkts, nb_pkts);
+
+	while (nb_pkts) {
+		uint16_t num_burst = RTE_MIN(nb_pkts,
+					     dev->data->max_tx_burst_pkts);
+		uint16_t pkts;
+
+		pkts = (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id],
+						&tx_pkts[nb_tx], num_burst);
+		nb_tx += pkts;
+		nb_pkts -= pkts;
+		if (pkts < num_burst)
+			break;
+	}
 
-	return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, nb_pkts);
+	return nb_tx;
 }
 
 /**