[1/2] net/bonding: support Tx prepare for bonding

Message ID 1619171202-28486-2-git-send-email-tangchengchang@huawei.com (mailing list archive)
State Changes Requested, archived
Delegated to: Andrew Rybchenko
Headers
Series add Tx prepare support for bonding device |

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

Chengchang Tang April 23, 2021, 9:46 a.m. UTC
  To use the HW offloads capability (e.g. checksum and TSO) in the Tx
direction, the upper-layer users need to call rte_eth_dev_prepare to do
some adjustment to the packets before sending them (e.g. processing
pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
callback of the bond driver is not implemented. Therefore, related
offloads can not be used unless the upper layer users process the packet
properly in their own application. But it is bad for the
transplantability.

However, it is difficult to design the tx_prepare callback for bonding
driver. Because when a bonded device sends packets, the bonded device
allocates the packets to different slave devices based on the real-time
link status and bonding mode. That is, it is very difficult for the
bonding device to determine which slave device's prepare function should
be invoked. In addition, if the link status changes after the packets are
prepared, the packets may fail to be sent because packets allocation may
change.

So, in this patch, the tx_prepare callback of bonding driver is not
implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
tx_offloads can be processed correctly for all NIC devices in these modes.
If tx_prepare is not required in some cases, then slave PMDs tx_prepare
pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
In these cases, the impact on performance will be very limited. It is
the responsibility of the slave PMDs to decide when the real tx_prepare
needs to be used. The information from dev_config/queue_setup is
sufficient for them to make these decisions.

Note:
The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
because in broadcast mode, a packet needs to be sent by all slave ports.
Different PMDs process the packets differently in tx_prepare. As a result,
the sent packet may be incorrect.

Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
---
 drivers/net/bonding/rte_eth_bond.h     |  1 -
 drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
 2 files changed, 24 insertions(+), 5 deletions(-)

--
2.7.4
  

Comments

Andrew Rybchenko June 8, 2021, 9:49 a.m. UTC | #1
"for bonding" is redundant in the summary since it is already "net/bonding".

On 4/23/21 12:46 PM, Chengchang Tang wrote:
> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
> direction, the upper-layer users need to call rte_eth_dev_prepare to do
> some adjustment to the packets before sending them (e.g. processing
> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
> callback of the bond driver is not implemented. Therefore, related
> offloads can not be used unless the upper layer users process the packet
> properly in their own application. But it is bad for the
> transplantability.
> 
> However, it is difficult to design the tx_prepare callback for bonding
> driver. Because when a bonded device sends packets, the bonded device
> allocates the packets to different slave devices based on the real-time
> link status and bonding mode. That is, it is very difficult for the
> bonding device to determine which slave device's prepare function should
> be invoked. In addition, if the link status changes after the packets are
> prepared, the packets may fail to be sent because packets allocation may
> change.
> 
> So, in this patch, the tx_prepare callback of bonding driver is not
> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
> tx_offloads can be processed correctly for all NIC devices in these modes.
> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
> In these cases, the impact on performance will be very limited. It is
> the responsibility of the slave PMDs to decide when the real tx_prepare
> needs to be used. The information from dev_config/queue_setup is
> sufficient for them to make these decisions.
> 
> Note:
> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
> because in broadcast mode, a packet needs to be sent by all slave ports.
> Different PMDs process the packets differently in tx_prepare. As a result,
> the sent packet may be incorrect.
> 
> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
> ---
>  drivers/net/bonding/rte_eth_bond.h     |  1 -
>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>  2 files changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
> index 874aa91..1e6cc6d 100644
> --- a/drivers/net/bonding/rte_eth_bond.h
> +++ b/drivers/net/bonding/rte_eth_bond.h
> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>  int
>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
> 
> -
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
> index 2e9cea5..84af348 100644
> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>  	/* Send packet burst on each slave device */
>  	for (i = 0; i < num_of_slaves; i++) {
>  		if (slave_nb_pkts[i] > 0) {
> +			int nb_prep_pkts;
> +
> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
> +					bd_tx_q->queue_id, slave_bufs[i],
> +					slave_nb_pkts[i]);
> +

Shouldn't it be called iff queue Tx offloads are not zero?
It will allow to decrease performance degradation if no
Tx offloads are enabled. Same in all cases below.

>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> -					slave_bufs[i], slave_nb_pkts[i]);
> +					slave_bufs[i], nb_prep_pkts);

In fact it is a problem here and really big problems.
Tx prepare may fail and return less packets. Tx prepare
of some packet may always fail. If application tries to
send packets in a loop until success, it will be a
forever loop here. Since application calls Tx burst,
it is 100% legal behaviour of the function to return 0
if Tx ring is full. It is not an error indication.
However, in the case of Tx prepare it is an error
indication.

Should we change Tx burst description and enforce callers
to check for rte_errno? It sounds like a major change...

[snip]
  
Chengchang Tang June 9, 2021, 6:42 a.m. UTC | #2
Hi, Andrew and Ferruh

On 2021/6/8 17:49, Andrew Rybchenko wrote:
> "for bonding" is redundant in the summary since it is already "net/bonding".
> 
> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
>> some adjustment to the packets before sending them (e.g. processing
>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
>> callback of the bond driver is not implemented. Therefore, related
>> offloads can not be used unless the upper layer users process the packet
>> properly in their own application. But it is bad for the
>> transplantability.
>>
>> However, it is difficult to design the tx_prepare callback for bonding
>> driver. Because when a bonded device sends packets, the bonded device
>> allocates the packets to different slave devices based on the real-time
>> link status and bonding mode. That is, it is very difficult for the
>> bonding device to determine which slave device's prepare function should
>> be invoked. In addition, if the link status changes after the packets are
>> prepared, the packets may fail to be sent because packets allocation may
>> change.
>>
>> So, in this patch, the tx_prepare callback of bonding driver is not
>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
>> tx_offloads can be processed correctly for all NIC devices in these modes.
>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
>> In these cases, the impact on performance will be very limited. It is
>> the responsibility of the slave PMDs to decide when the real tx_prepare
>> needs to be used. The information from dev_config/queue_setup is
>> sufficient for them to make these decisions.
>>
>> Note:
>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
>> because in broadcast mode, a packet needs to be sent by all slave ports.
>> Different PMDs process the packets differently in tx_prepare. As a result,
>> the sent packet may be incorrect.
>>
>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>> ---
>>  drivers/net/bonding/rte_eth_bond.h     |  1 -
>>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>>  2 files changed, 24 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
>> index 874aa91..1e6cc6d 100644
>> --- a/drivers/net/bonding/rte_eth_bond.h
>> +++ b/drivers/net/bonding/rte_eth_bond.h
>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>>  int
>>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
>>
>> -
>>  #ifdef __cplusplus
>>  }
>>  #endif
>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>> index 2e9cea5..84af348 100644
>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>>  	/* Send packet burst on each slave device */
>>  	for (i = 0; i < num_of_slaves; i++) {
>>  		if (slave_nb_pkts[i] > 0) {
>> +			int nb_prep_pkts;
>> +
>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
>> +					bd_tx_q->queue_id, slave_bufs[i],
>> +					slave_nb_pkts[i]);
>> +
> 
> Shouldn't it be called iff queue Tx offloads are not zero?
> It will allow to decrease performance degradation if no
> Tx offloads are enabled. Same in all cases below.

Regarding this point, it has been discussed in the previous RFC:
https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/

According to the TX_OFFLOAD status of the current device, PMDs can determine
whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
to NULL, so that the actual tx_prepare processing will be skipped directly in
rte_eth_tx_prepare().

> 
>>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>> -					slave_bufs[i], slave_nb_pkts[i]);
>> +					slave_bufs[i], nb_prep_pkts);
> 
> In fact it is a problem here and really big problems.
> Tx prepare may fail and return less packets. Tx prepare
> of some packet may always fail. If application tries to
> send packets in a loop until success, it will be a
> forever loop here. Since application calls Tx burst,
> it is 100% legal behaviour of the function to return 0
> if Tx ring is full. It is not an error indication.
> However, in the case of Tx prepare it is an error
> indication.
> 
> Should we change Tx burst description and enforce callers
> to check for rte_errno? It sounds like a major change...
> 

I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
But what about the failure caused by other reasons? At present, it is possible
for some PMDs to fail during tx_burst due to other reasons. In this case,
repeated tries to send will also fail.

I'm not sure if all PMDs need to support the behavior of sending packets in a
loop until it succeeds. If not, I think the current problem can be reminded to
the user by adding a description to the bonding. If it is necessary, I think the
description of tx_burst should also add related instructions, so that the developers
of PMDs can better understand how tx_burst should be designed, such as putting all
hardware-related constraint checks into tx_prepare. And another prerequisite for
the above behavior is that the packets must be prepared (i.e. checked by
rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
to use rte_eth_tx_prepare() in more scenarios.

What's Ferruh's opinion on this?

> [snip]
> 
> .
>
  
Andrew Rybchenko June 9, 2021, 9:35 a.m. UTC | #3
On 6/9/21 9:42 AM, Chengchang Tang wrote:
> Hi, Andrew and Ferruh
> 
> On 2021/6/8 17:49, Andrew Rybchenko wrote:
>> "for bonding" is redundant in the summary since it is already "net/bonding".
>>
>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
>>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
>>> some adjustment to the packets before sending them (e.g. processing
>>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
>>> callback of the bond driver is not implemented. Therefore, related
>>> offloads can not be used unless the upper layer users process the packet
>>> properly in their own application. But it is bad for the
>>> transplantability.
>>>
>>> However, it is difficult to design the tx_prepare callback for bonding
>>> driver. Because when a bonded device sends packets, the bonded device
>>> allocates the packets to different slave devices based on the real-time
>>> link status and bonding mode. That is, it is very difficult for the
>>> bonding device to determine which slave device's prepare function should
>>> be invoked. In addition, if the link status changes after the packets are
>>> prepared, the packets may fail to be sent because packets allocation may
>>> change.
>>>
>>> So, in this patch, the tx_prepare callback of bonding driver is not
>>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
>>> tx_offloads can be processed correctly for all NIC devices in these modes.
>>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
>>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
>>> In these cases, the impact on performance will be very limited. It is
>>> the responsibility of the slave PMDs to decide when the real tx_prepare
>>> needs to be used. The information from dev_config/queue_setup is
>>> sufficient for them to make these decisions.
>>>
>>> Note:
>>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
>>> because in broadcast mode, a packet needs to be sent by all slave ports.
>>> Different PMDs process the packets differently in tx_prepare. As a result,
>>> the sent packet may be incorrect.
>>>
>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>>> ---
>>>  drivers/net/bonding/rte_eth_bond.h     |  1 -
>>>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>>>  2 files changed, 24 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
>>> index 874aa91..1e6cc6d 100644
>>> --- a/drivers/net/bonding/rte_eth_bond.h
>>> +++ b/drivers/net/bonding/rte_eth_bond.h
>>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>>>  int
>>>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
>>>
>>> -
>>>  #ifdef __cplusplus
>>>  }
>>>  #endif
>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>>> index 2e9cea5..84af348 100644
>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>>>  	/* Send packet burst on each slave device */
>>>  	for (i = 0; i < num_of_slaves; i++) {
>>>  		if (slave_nb_pkts[i] > 0) {
>>> +			int nb_prep_pkts;
>>> +
>>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
>>> +					bd_tx_q->queue_id, slave_bufs[i],
>>> +					slave_nb_pkts[i]);
>>> +
>>
>> Shouldn't it be called iff queue Tx offloads are not zero?
>> It will allow to decrease performance degradation if no
>> Tx offloads are enabled. Same in all cases below.
> 
> Regarding this point, it has been discussed in the previous RFC:
> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
> 
> According to the TX_OFFLOAD status of the current device, PMDs can determine
> whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
> to NULL, so that the actual tx_prepare processing will be skipped directly in
> rte_eth_tx_prepare().

I still think that the following is right:
No Tx offloads at all => Tx prepare is not necessary

Am I wrong?

>>
>>>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>> -					slave_bufs[i], slave_nb_pkts[i]);
>>> +					slave_bufs[i], nb_prep_pkts);
>>
>> In fact it is a problem here and really big problems.
>> Tx prepare may fail and return less packets. Tx prepare
>> of some packet may always fail. If application tries to
>> send packets in a loop until success, it will be a
>> forever loop here. Since application calls Tx burst,
>> it is 100% legal behaviour of the function to return 0
>> if Tx ring is full. It is not an error indication.
>> However, in the case of Tx prepare it is an error
>> indication.
>>
>> Should we change Tx burst description and enforce callers
>> to check for rte_errno? It sounds like a major change...
>>
> 
> I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
> But what about the failure caused by other reasons? At present, it is possible
> for some PMDs to fail during tx_burst due to other reasons. In this case,
> repeated tries to send will also fail.

If so, packet should be simply dropped by Tx burst and Tx burst
should move on. If a packet cannot be transmitted, it must be
dropped (counted) and Tx burst should move to the next packet.

> I'm not sure if all PMDs need to support the behavior of sending packets in a
> loop until it succeeds. If not, I think the current problem can be reminded to
> the user by adding a description to the bonding. If it is necessary, I think the
> description of tx_burst should also add related instructions, so that the developers
> of PMDs can better understand how tx_burst should be designed, such as putting all
> hardware-related constraint checks into tx_prepare. And another prerequisite for
> the above behavior is that the packets must be prepared (i.e. checked by
> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
> to use rte_eth_tx_prepare() in more scenarios.

IMHO any PMD specific behaviour is a nightmare to application
developer and must be avoided. Ideally application should not
care if it is running on top of tap, virtio, failsafe or
bonding. It should talk to ethdev in terms of ethdev API that's
it. I know that net/bonding is designed that application should
know about it, but IMHO the places where it requires the
knowledge must be minimized to make applications more portable
across various PMDs/HW.

I think that the only sensible solution for above problem is
to skip a packet which prepare dislikes. count it as dropped
and try to prepare/transmit subsequent packets.

It is an interesting effect of the Tx prepare just before
Tx burst inside bonding PMD. If Tx burst fails to send
something because ring is full, a number of packets will
be processed by Tx prepare again and again. I guess it is
unavoidable.

> What's Ferruh's opinion on this?
> 
>> [snip]
  
Ananyev, Konstantin June 9, 2021, 10:25 a.m. UTC | #4
> > On 4/23/21 12:46 PM, Chengchang Tang wrote:
> >> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
> >> direction, the upper-layer users need to call rte_eth_dev_prepare to do
> >> some adjustment to the packets before sending them (e.g. processing
> >> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
> >> callback of the bond driver is not implemented. Therefore, related
> >> offloads can not be used unless the upper layer users process the packet
> >> properly in their own application. But it is bad for the
> >> transplantability.
> >>
> >> However, it is difficult to design the tx_prepare callback for bonding
> >> driver. Because when a bonded device sends packets, the bonded device
> >> allocates the packets to different slave devices based on the real-time
> >> link status and bonding mode. That is, it is very difficult for the
> >> bonding device to determine which slave device's prepare function should
> >> be invoked. In addition, if the link status changes after the packets are
> >> prepared, the packets may fail to be sent because packets allocation may
> >> change.
> >>
> >> So, in this patch, the tx_prepare callback of bonding driver is not
> >> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
> >> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
> >> tx_offloads can be processed correctly for all NIC devices in these modes.
> >> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
> >> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
> >> In these cases, the impact on performance will be very limited. It is
> >> the responsibility of the slave PMDs to decide when the real tx_prepare
> >> needs to be used. The information from dev_config/queue_setup is
> >> sufficient for them to make these decisions.
> >>
> >> Note:
> >> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
> >> because in broadcast mode, a packet needs to be sent by all slave ports.
> >> Different PMDs process the packets differently in tx_prepare. As a result,
> >> the sent packet may be incorrect.
> >>
> >> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
> >> ---
> >>  drivers/net/bonding/rte_eth_bond.h     |  1 -
> >>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
> >>  2 files changed, 24 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
> >> index 874aa91..1e6cc6d 100644
> >> --- a/drivers/net/bonding/rte_eth_bond.h
> >> +++ b/drivers/net/bonding/rte_eth_bond.h
> >> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
> >>  int
> >>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
> >>
> >> -
> >>  #ifdef __cplusplus
> >>  }
> >>  #endif
> >> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
> >> index 2e9cea5..84af348 100644
> >> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
> >> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
> >> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
> >>  	/* Send packet burst on each slave device */
> >>  	for (i = 0; i < num_of_slaves; i++) {
> >>  		if (slave_nb_pkts[i] > 0) {
> >> +			int nb_prep_pkts;
> >> +
> >> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
> >> +					bd_tx_q->queue_id, slave_bufs[i],
> >> +					slave_nb_pkts[i]);
> >> +
> >
> > Shouldn't it be called iff queue Tx offloads are not zero?
> > It will allow to decrease performance degradation if no
> > Tx offloads are enabled. Same in all cases below.
> 
> Regarding this point, it has been discussed in the previous RFC:
> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
> 
> According to the TX_OFFLOAD status of the current device, PMDs can determine
> whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
> to NULL, so that the actual tx_prepare processing will be skipped directly in
> rte_eth_tx_prepare().
> 
> >
> >>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> >> -					slave_bufs[i], slave_nb_pkts[i]);
> >> +					slave_bufs[i], nb_prep_pkts);
> >
> > In fact it is a problem here and really big problems.
> > Tx prepare may fail and return less packets. Tx prepare
> > of some packet may always fail. If application tries to
> > send packets in a loop until success, it will be a
> > forever loop here. Since application calls Tx burst,
> > it is 100% legal behaviour of the function to return 0
> > if Tx ring is full. It is not an error indication.
> > However, in the case of Tx prepare it is an error
> > indication.

Yes, that sounds like a problem and existing apps might be affected.

> >
> > Should we change Tx burst description and enforce callers
> > to check for rte_errno? It sounds like a major change...
> >

Agree, rte_errno for tx_burst() is probably a simplest and sanest way,
but yes, it is a change in behaviour and apps will need to be updated.  
Another option for bond PMD - just silently free mbufs for which prepare()
fails (and probably update some stats counter).
Again it is a change in behaviour, but now just for one PMD, with tx offloads enabled.
Also as, I can see some tx_burst() function for that PMD already free packets silently:
bond_ethdev_tx_burst_alb(), bond_ethdev_tx_burst_broadcast().

Actually another question - why the patch adds tx_prepare() only to some
TX modes but not all?
Is that itended? 

> 
> I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
> But what about the failure caused by other reasons? At present, it is possible
> for some PMDs to fail during tx_burst due to other reasons. In this case,
> repeated tries to send will also fail.
> 
> I'm not sure if all PMDs need to support the behavior of sending packets in a
> loop until it succeeds. If not, I think the current problem can be reminded to
> the user by adding a description to the bonding. If it is necessary, I think the
> description of tx_burst should also add related instructions, so that the developers
> of PMDs can better understand how tx_burst should be designed, such as putting all
> hardware-related constraint checks into tx_prepare. And another prerequisite for
> the above behavior is that the packets must be prepared (i.e. checked by
> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
> to use rte_eth_tx_prepare() in more scenarios.
> 
> What's Ferruh's opinion on this?
> 
> > [snip]
> >
> > .
> >
  
Chengchang Tang June 10, 2021, 6:46 a.m. UTC | #5
On 2021/6/9 18:25, Ananyev, Konstantin wrote:
>>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>>>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
>>>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
>>>> some adjustment to the packets before sending them (e.g. processing
>>>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
>>>> callback of the bond driver is not implemented. Therefore, related
>>>> offloads can not be used unless the upper layer users process the packet
>>>> properly in their own application. But it is bad for the
>>>> transplantability.
>>>>
>>>> However, it is difficult to design the tx_prepare callback for bonding
>>>> driver. Because when a bonded device sends packets, the bonded device
>>>> allocates the packets to different slave devices based on the real-time
>>>> link status and bonding mode. That is, it is very difficult for the
>>>> bonding device to determine which slave device's prepare function should
>>>> be invoked. In addition, if the link status changes after the packets are
>>>> prepared, the packets may fail to be sent because packets allocation may
>>>> change.
>>>>
>>>> So, in this patch, the tx_prepare callback of bonding driver is not
>>>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>>>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
>>>> tx_offloads can be processed correctly for all NIC devices in these modes.
>>>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
>>>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
>>>> In these cases, the impact on performance will be very limited. It is
>>>> the responsibility of the slave PMDs to decide when the real tx_prepare
>>>> needs to be used. The information from dev_config/queue_setup is
>>>> sufficient for them to make these decisions.
>>>>
>>>> Note:
>>>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
>>>> because in broadcast mode, a packet needs to be sent by all slave ports.
>>>> Different PMDs process the packets differently in tx_prepare. As a result,
>>>> the sent packet may be incorrect.
>>>>
>>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>>>> ---
>>>>  drivers/net/bonding/rte_eth_bond.h     |  1 -
>>>>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>>>>  2 files changed, 24 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
>>>> index 874aa91..1e6cc6d 100644
>>>> --- a/drivers/net/bonding/rte_eth_bond.h
>>>> +++ b/drivers/net/bonding/rte_eth_bond.h
>>>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>>>>  int
>>>>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
>>>>
>>>> -
>>>>  #ifdef __cplusplus
>>>>  }
>>>>  #endif
>>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> index 2e9cea5..84af348 100644
>>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>>>>  	/* Send packet burst on each slave device */
>>>>  	for (i = 0; i < num_of_slaves; i++) {
>>>>  		if (slave_nb_pkts[i] > 0) {
>>>> +			int nb_prep_pkts;
>>>> +
>>>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
>>>> +					bd_tx_q->queue_id, slave_bufs[i],
>>>> +					slave_nb_pkts[i]);
>>>> +
>>>
>>> Shouldn't it be called iff queue Tx offloads are not zero?
>>> It will allow to decrease performance degradation if no
>>> Tx offloads are enabled. Same in all cases below.
>>
>> Regarding this point, it has been discussed in the previous RFC:
>> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
>>
>> According to the TX_OFFLOAD status of the current device, PMDs can determine
>> whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
>> to NULL, so that the actual tx_prepare processing will be skipped directly in
>> rte_eth_tx_prepare().
>>
>>>
>>>>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>> -					slave_bufs[i], slave_nb_pkts[i]);
>>>> +					slave_bufs[i], nb_prep_pkts);
>>>
>>> In fact it is a problem here and really big problems.
>>> Tx prepare may fail and return less packets. Tx prepare
>>> of some packet may always fail. If application tries to
>>> send packets in a loop until success, it will be a
>>> forever loop here. Since application calls Tx burst,
>>> it is 100% legal behaviour of the function to return 0
>>> if Tx ring is full. It is not an error indication.
>>> However, in the case of Tx prepare it is an error
>>> indication.
> 
> Yes, that sounds like a problem and existing apps might be affected.
> 
>>>
>>> Should we change Tx burst description and enforce callers
>>> to check for rte_errno? It sounds like a major change...
>>>
> 
> Agree, rte_errno for tx_burst() is probably a simplest and sanest way,
> but yes, it is a change in behaviour and apps will need to be updated.  
> Another option for bond PMD - just silently free mbufs for which prepare()
> fails (and probably update some stats counter).
> Again it is a change in behaviour, but now just for one PMD, with tx offloads enabled.
> Also as, I can see some tx_burst() function for that PMD already free packets silently:
> bond_ethdev_tx_burst_alb(), bond_ethdev_tx_burst_broadcast().
> 
> Actually another question - why the patch adds tx_prepare() only to some
> TX modes but not all?
> Is that itended? 
> 

Yes. Currently, I have no ideal to perform tx_prepare() in broadcast mode with limited
impact on performance. In broadcast mode, same packets will be send in several devices.
In this process, we only update the ref_cnt of mbufs, but no copy of packets. As we know,
tx_prepare() may change the data, so it may cause some problem if we perform tx_prepare()
several times on the same packet.

>>
>> I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
>> But what about the failure caused by other reasons? At present, it is possible
>> for some PMDs to fail during tx_burst due to other reasons. In this case,
>> repeated tries to send will also fail.
>>
>> I'm not sure if all PMDs need to support the behavior of sending packets in a
>> loop until it succeeds. If not, I think the current problem can be reminded to
>> the user by adding a description to the bonding. If it is necessary, I think the
>> description of tx_burst should also add related instructions, so that the developers
>> of PMDs can better understand how tx_burst should be designed, such as putting all
>> hardware-related constraint checks into tx_prepare. And another prerequisite for
>> the above behavior is that the packets must be prepared (i.e. checked by
>> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
>> to use rte_eth_tx_prepare() in more scenarios.
>>
>> What's Ferruh's opinion on this?
>>
>>> [snip]
>>>
>>> .
>>>
>
  
Chengchang Tang June 10, 2021, 7:32 a.m. UTC | #6
On 2021/6/9 17:35, Andrew Rybchenko wrote:
> On 6/9/21 9:42 AM, Chengchang Tang wrote:
>> Hi, Andrew and Ferruh
>>
>> On 2021/6/8 17:49, Andrew Rybchenko wrote:
>>> "for bonding" is redundant in the summary since it is already "net/bonding".
>>>
>>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>>>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
>>>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
>>>> some adjustment to the packets before sending them (e.g. processing
>>>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
>>>> callback of the bond driver is not implemented. Therefore, related
>>>> offloads can not be used unless the upper layer users process the packet
>>>> properly in their own application. But it is bad for the
>>>> transplantability.
>>>>
>>>> However, it is difficult to design the tx_prepare callback for bonding
>>>> driver. Because when a bonded device sends packets, the bonded device
>>>> allocates the packets to different slave devices based on the real-time
>>>> link status and bonding mode. That is, it is very difficult for the
>>>> bonding device to determine which slave device's prepare function should
>>>> be invoked. In addition, if the link status changes after the packets are
>>>> prepared, the packets may fail to be sent because packets allocation may
>>>> change.
>>>>
>>>> So, in this patch, the tx_prepare callback of bonding driver is not
>>>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>>>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
>>>> tx_offloads can be processed correctly for all NIC devices in these modes.
>>>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
>>>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
>>>> In these cases, the impact on performance will be very limited. It is
>>>> the responsibility of the slave PMDs to decide when the real tx_prepare
>>>> needs to be used. The information from dev_config/queue_setup is
>>>> sufficient for them to make these decisions.
>>>>
>>>> Note:
>>>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
>>>> because in broadcast mode, a packet needs to be sent by all slave ports.
>>>> Different PMDs process the packets differently in tx_prepare. As a result,
>>>> the sent packet may be incorrect.
>>>>
>>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>>>> ---
>>>>  drivers/net/bonding/rte_eth_bond.h     |  1 -
>>>>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>>>>  2 files changed, 24 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
>>>> index 874aa91..1e6cc6d 100644
>>>> --- a/drivers/net/bonding/rte_eth_bond.h
>>>> +++ b/drivers/net/bonding/rte_eth_bond.h
>>>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>>>>  int
>>>>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
>>>>
>>>> -
>>>>  #ifdef __cplusplus
>>>>  }
>>>>  #endif
>>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> index 2e9cea5..84af348 100644
>>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>>>>  	/* Send packet burst on each slave device */
>>>>  	for (i = 0; i < num_of_slaves; i++) {
>>>>  		if (slave_nb_pkts[i] > 0) {
>>>> +			int nb_prep_pkts;
>>>> +
>>>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
>>>> +					bd_tx_q->queue_id, slave_bufs[i],
>>>> +					slave_nb_pkts[i]);
>>>> +
>>>
>>> Shouldn't it be called iff queue Tx offloads are not zero?
>>> It will allow to decrease performance degradation if no
>>> Tx offloads are enabled. Same in all cases below.
>>
>> Regarding this point, it has been discussed in the previous RFC:
>> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
>>
>> According to the TX_OFFLOAD status of the current device, PMDs can determine
>> whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
>> to NULL, so that the actual tx_prepare processing will be skipped directly in
>> rte_eth_tx_prepare().
> 
> I still think that the following is right:
> No Tx offloads at all => Tx prepare is not necessary
> 
> Am I wrong?
> 

Let PMDs determine whether tx_prepare() need be done could reduce the performance
loss in more scenarios. For example, some offload do not need a Tx prepare, and PMDs
could set tx_prepare to NULL in this scenario. Even if rte_eth_tx_prepare() is called,
it will not perform the tx_prepare callback, and then return. In this case, there is
only one judgment logic. If we judge whether tx_offloads are not zero, one more logical
judgment is added.

Of course, some PMDs currently do not optimize tx_prepare, which may have a performance
impact. We hope to force them to optimize tx_prepare in this way, just like they optimize
tx_burst. This makes it easier for users to use tx_prepare(), and no longer need to
consider that using tx_prepare() will introduce unnecessary performance degradation.

IMHO tx_prepare() should be extended to all scenarios for use, and the impact on
performance should be optimized by PMDs. Let the application consider when it should be
used and when it should not be used, in many cases it will not be used and then introduced
some problem.

>>>
>>>>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>> -					slave_bufs[i], slave_nb_pkts[i]);
>>>> +					slave_bufs[i], nb_prep_pkts);
>>>
>>> In fact it is a problem here and really big problems.
>>> Tx prepare may fail and return less packets. Tx prepare
>>> of some packet may always fail. If application tries to
>>> send packets in a loop until success, it will be a
>>> forever loop here. Since application calls Tx burst,
>>> it is 100% legal behaviour of the function to return 0
>>> if Tx ring is full. It is not an error indication.
>>> However, in the case of Tx prepare it is an error
>>> indication.
>>>
>>> Should we change Tx burst description and enforce callers
>>> to check for rte_errno? It sounds like a major change...
>>>
>>
>> I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
>> But what about the failure caused by other reasons? At present, it is possible
>> for some PMDs to fail during tx_burst due to other reasons. In this case,
>> repeated tries to send will also fail.
> 
> If so, packet should be simply dropped by Tx burst and Tx burst
> should move on. If a packet cannot be transmitted, it must be
> dropped (counted) and Tx burst should move to the next packet.
> 
>> I'm not sure if all PMDs need to support the behavior of sending packets in a
>> loop until it succeeds. If not, I think the current problem can be reminded to
>> the user by adding a description to the bonding. If it is necessary, I think the
>> description of tx_burst should also add related instructions, so that the developers
>> of PMDs can better understand how tx_burst should be designed, such as putting all
>> hardware-related constraint checks into tx_prepare. And another prerequisite for
>> the above behavior is that the packets must be prepared (i.e. checked by
>> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
>> to use rte_eth_tx_prepare() in more scenarios.
> 
> IMHO any PMD specific behaviour is a nightmare to application
> developer and must be avoided. Ideally application should not
> care if it is running on top of tap, virtio, failsafe or
> bonding. It should talk to ethdev in terms of ethdev API that's
> it. I know that net/bonding is designed that application should
> know about it, but IMHO the places where it requires the
> knowledge must be minimized to make applications more portable
> across various PMDs/HW.
> 
> I think that the only sensible solution for above problem is
> to skip a packet which prepare dislikes. count it as dropped
> and try to prepare/transmit subsequent packets.

Agree, I will fix this in the next version.

> 
> It is an interesting effect of the Tx prepare just before
> Tx burst inside bonding PMD. If Tx burst fails to send
> something because ring is full, a number of packets will
> be processed by Tx prepare again and again. I guess it is
> unavoidable.
> 
>> What's Ferruh's opinion on this?
>>
>>> [snip]
> 
> 
> .
>
  
Ananyev, Konstantin June 14, 2021, 11:36 a.m. UTC | #7
> On 2021/6/9 18:25, Ananyev, Konstantin wrote:
> >>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
> >>>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
> >>>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
> >>>> some adjustment to the packets before sending them (e.g. processing
> >>>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
> >>>> callback of the bond driver is not implemented. Therefore, related
> >>>> offloads can not be used unless the upper layer users process the packet
> >>>> properly in their own application. But it is bad for the
> >>>> transplantability.
> >>>>
> >>>> However, it is difficult to design the tx_prepare callback for bonding
> >>>> driver. Because when a bonded device sends packets, the bonded device
> >>>> allocates the packets to different slave devices based on the real-time
> >>>> link status and bonding mode. That is, it is very difficult for the
> >>>> bonding device to determine which slave device's prepare function should
> >>>> be invoked. In addition, if the link status changes after the packets are
> >>>> prepared, the packets may fail to be sent because packets allocation may
> >>>> change.
> >>>>
> >>>> So, in this patch, the tx_prepare callback of bonding driver is not
> >>>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
> >>>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
> >>>> tx_offloads can be processed correctly for all NIC devices in these modes.
> >>>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
> >>>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
> >>>> In these cases, the impact on performance will be very limited. It is
> >>>> the responsibility of the slave PMDs to decide when the real tx_prepare
> >>>> needs to be used. The information from dev_config/queue_setup is
> >>>> sufficient for them to make these decisions.
> >>>>
> >>>> Note:
> >>>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
> >>>> because in broadcast mode, a packet needs to be sent by all slave ports.
> >>>> Different PMDs process the packets differently in tx_prepare. As a result,
> >>>> the sent packet may be incorrect.
> >>>>
> >>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
> >>>> ---
> >>>>  drivers/net/bonding/rte_eth_bond.h     |  1 -
> >>>>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
> >>>>  2 files changed, 24 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
> >>>> index 874aa91..1e6cc6d 100644
> >>>> --- a/drivers/net/bonding/rte_eth_bond.h
> >>>> +++ b/drivers/net/bonding/rte_eth_bond.h
> >>>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
> >>>>  int
> >>>>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
> >>>>
> >>>> -
> >>>>  #ifdef __cplusplus
> >>>>  }
> >>>>  #endif
> >>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
> >>>> index 2e9cea5..84af348 100644
> >>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
> >>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
> >>>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
> >>>>  	/* Send packet burst on each slave device */
> >>>>  	for (i = 0; i < num_of_slaves; i++) {
> >>>>  		if (slave_nb_pkts[i] > 0) {
> >>>> +			int nb_prep_pkts;
> >>>> +
> >>>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
> >>>> +					bd_tx_q->queue_id, slave_bufs[i],
> >>>> +					slave_nb_pkts[i]);
> >>>> +
> >>>
> >>> Shouldn't it be called iff queue Tx offloads are not zero?
> >>> It will allow to decrease performance degradation if no
> >>> Tx offloads are enabled. Same in all cases below.
> >>
> >> Regarding this point, it has been discussed in the previous RFC:
> >> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
> >>
> >> According to the TX_OFFLOAD status of the current device, PMDs can determine
> >> whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
> >> to NULL, so that the actual tx_prepare processing will be skipped directly in
> >> rte_eth_tx_prepare().
> >>
> >>>
> >>>>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> >>>> -					slave_bufs[i], slave_nb_pkts[i]);
> >>>> +					slave_bufs[i], nb_prep_pkts);
> >>>
> >>> In fact it is a problem here and really big problems.
> >>> Tx prepare may fail and return less packets. Tx prepare
> >>> of some packet may always fail. If application tries to
> >>> send packets in a loop until success, it will be a
> >>> forever loop here. Since application calls Tx burst,
> >>> it is 100% legal behaviour of the function to return 0
> >>> if Tx ring is full. It is not an error indication.
> >>> However, in the case of Tx prepare it is an error
> >>> indication.
> >
> > Yes, that sounds like a problem and existing apps might be affected.
> >
> >>>
> >>> Should we change Tx burst description and enforce callers
> >>> to check for rte_errno? It sounds like a major change...
> >>>
> >
> > Agree, rte_errno for tx_burst() is probably a simplest and sanest way,
> > but yes, it is a change in behaviour and apps will need to be updated.
> > Another option for bond PMD - just silently free mbufs for which prepare()
> > fails (and probably update some stats counter).
> > Again it is a change in behaviour, but now just for one PMD, with tx offloads enabled.
> > Also as, I can see some tx_burst() function for that PMD already free packets silently:
> > bond_ethdev_tx_burst_alb(), bond_ethdev_tx_burst_broadcast().
> >
> > Actually another question - why the patch adds tx_prepare() only to some
> > TX modes but not all?
> > Is that itended?
> >
> 
> Yes. Currently, I have no ideal to perform tx_prepare() in broadcast mode with limited
> impact on performance. In broadcast mode, same packets will be send in several devices.
> In this process, we only update the ref_cnt of mbufs, but no copy of packets.
> As we know,
> tx_prepare() may change the data, so it may cause some problem if we perform tx_prepare()
> several times on the same packet.

You mean tx_prepare() for second dev can void changes made by tx_prepare() for first dev?
I suppose in theory it is possible, even if it is probably not the case right now in practise
(at least I am not aware about such cases).
Actually that's an interesting topic - same can happen even with user implementing multicast
on his own (see examples/ipv4_multicast/). 
I think these new limitations have to be documented clearly (at least).
Also probably  we need extra changes fo bond device dev_confgiure()/dev_get_info():
to check currently selected mode and based on that allow/reject tx offloads.
The question arises (again) how to figure out for which tx offloads dev->tx_prepare()
modifies the packet, for which not? 
Any thoughts here? 

> 
> >>
> >> I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
> >> But what about the failure caused by other reasons? At present, it is possible
> >> for some PMDs to fail during tx_burst due to other reasons. In this case,
> >> repeated tries to send will also fail.
> >>
> >> I'm not sure if all PMDs need to support the behavior of sending packets in a
> >> loop until it succeeds. If not, I think the current problem can be reminded to
> >> the user by adding a description to the bonding. If it is necessary, I think the
> >> description of tx_burst should also add related instructions, so that the developers
> >> of PMDs can better understand how tx_burst should be designed, such as putting all
> >> hardware-related constraint checks into tx_prepare. And another prerequisite for
> >> the above behavior is that the packets must be prepared (i.e. checked by
> >> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
> >> to use rte_eth_tx_prepare() in more scenarios.
> >>
> >> What's Ferruh's opinion on this?
> >>
> >>> [snip]
> >>>
> >>> .
> >>>
> >
  
Andrew Rybchenko June 14, 2021, 2:16 p.m. UTC | #8
On 6/10/21 10:32 AM, Chengchang Tang wrote:
> On 2021/6/9 17:35, Andrew Rybchenko wrote:
>> On 6/9/21 9:42 AM, Chengchang Tang wrote:
>>> Hi, Andrew and Ferruh
>>>
>>> On 2021/6/8 17:49, Andrew Rybchenko wrote:
>>>> "for bonding" is redundant in the summary since it is already "net/bonding".
>>>>
>>>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>>>>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
>>>>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
>>>>> some adjustment to the packets before sending them (e.g. processing
>>>>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
>>>>> callback of the bond driver is not implemented. Therefore, related
>>>>> offloads can not be used unless the upper layer users process the packet
>>>>> properly in their own application. But it is bad for the
>>>>> transplantability.
>>>>>
>>>>> However, it is difficult to design the tx_prepare callback for bonding
>>>>> driver. Because when a bonded device sends packets, the bonded device
>>>>> allocates the packets to different slave devices based on the real-time
>>>>> link status and bonding mode. That is, it is very difficult for the
>>>>> bonding device to determine which slave device's prepare function should
>>>>> be invoked. In addition, if the link status changes after the packets are
>>>>> prepared, the packets may fail to be sent because packets allocation may
>>>>> change.
>>>>>
>>>>> So, in this patch, the tx_prepare callback of bonding driver is not
>>>>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>>>>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
>>>>> tx_offloads can be processed correctly for all NIC devices in these modes.
>>>>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
>>>>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
>>>>> In these cases, the impact on performance will be very limited. It is
>>>>> the responsibility of the slave PMDs to decide when the real tx_prepare
>>>>> needs to be used. The information from dev_config/queue_setup is
>>>>> sufficient for them to make these decisions.
>>>>>
>>>>> Note:
>>>>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
>>>>> because in broadcast mode, a packet needs to be sent by all slave ports.
>>>>> Different PMDs process the packets differently in tx_prepare. As a result,
>>>>> the sent packet may be incorrect.
>>>>>
>>>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>>>>> ---
>>>>>   drivers/net/bonding/rte_eth_bond.h     |  1 -
>>>>>   drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>>>>>   2 files changed, 24 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
>>>>> index 874aa91..1e6cc6d 100644
>>>>> --- a/drivers/net/bonding/rte_eth_bond.h
>>>>> +++ b/drivers/net/bonding/rte_eth_bond.h
>>>>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>>>>>   int
>>>>>   rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
>>>>>
>>>>> -
>>>>>   #ifdef __cplusplus
>>>>>   }
>>>>>   #endif
>>>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>> index 2e9cea5..84af348 100644
>>>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>>>>>   	/* Send packet burst on each slave device */
>>>>>   	for (i = 0; i < num_of_slaves; i++) {
>>>>>   		if (slave_nb_pkts[i] > 0) {
>>>>> +			int nb_prep_pkts;
>>>>> +
>>>>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
>>>>> +					bd_tx_q->queue_id, slave_bufs[i],
>>>>> +					slave_nb_pkts[i]);
>>>>> +
>>>>
>>>> Shouldn't it be called iff queue Tx offloads are not zero?
>>>> It will allow to decrease performance degradation if no
>>>> Tx offloads are enabled. Same in all cases below.
>>>
>>> Regarding this point, it has been discussed in the previous RFC:
>>> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
>>>
>>> According to the TX_OFFLOAD status of the current device, PMDs can determine
>>> whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
>>> to NULL, so that the actual tx_prepare processing will be skipped directly in
>>> rte_eth_tx_prepare().
>>
>> I still think that the following is right:
>> No Tx offloads at all => Tx prepare is not necessary
>>
>> Am I wrong?
>>
> 
> Let PMDs determine whether tx_prepare() need be done could reduce the performance
> loss in more scenarios. For example, some offload do not need a Tx prepare, and PMDs
> could set tx_prepare to NULL in this scenario. Even if rte_eth_tx_prepare() is called,
> it will not perform the tx_prepare callback, and then return. In this case, there is
> only one judgment logic. If we judge whether tx_offloads are not zero, one more logical
> judgment is added.

I'll wait for net/bonding maintainers decision here.

IMHO all above assumptions should be proven by performance measurements.

> Of course, some PMDs currently do not optimize tx_prepare, which may have a performance
> impact. We hope to force them to optimize tx_prepare in this way, just like they optimize
> tx_burst. This makes it easier for users to use tx_prepare(), and no longer need to
> consider that using tx_prepare() will introduce unnecessary performance degradation.
> 
> IMHO tx_prepare() should be extended to all scenarios for use, and the impact on
> performance should be optimized by PMDs. Let the application consider when it should be
> used and when it should not be used, in many cases it will not be used and then introduced
> some problem.
> 
>>>>
>>>>>   			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>> -					slave_bufs[i], slave_nb_pkts[i]);
>>>>> +					slave_bufs[i], nb_prep_pkts);
>>>>
>>>> In fact it is a problem here and really big problems.
>>>> Tx prepare may fail and return less packets. Tx prepare
>>>> of some packet may always fail. If application tries to
>>>> send packets in a loop until success, it will be a
>>>> forever loop here. Since application calls Tx burst,
>>>> it is 100% legal behaviour of the function to return 0
>>>> if Tx ring is full. It is not an error indication.
>>>> However, in the case of Tx prepare it is an error
>>>> indication.
>>>>
>>>> Should we change Tx burst description and enforce callers
>>>> to check for rte_errno? It sounds like a major change...
>>>>
>>>
>>> I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
>>> But what about the failure caused by other reasons? At present, it is possible
>>> for some PMDs to fail during tx_burst due to other reasons. In this case,
>>> repeated tries to send will also fail.
>>
>> If so, packet should be simply dropped by Tx burst and Tx burst
>> should move on. If a packet cannot be transmitted, it must be
>> dropped (counted) and Tx burst should move to the next packet.
>>
>>> I'm not sure if all PMDs need to support the behavior of sending packets in a
>>> loop until it succeeds. If not, I think the current problem can be reminded to
>>> the user by adding a description to the bonding. If it is necessary, I think the
>>> description of tx_burst should also add related instructions, so that the developers
>>> of PMDs can better understand how tx_burst should be designed, such as putting all
>>> hardware-related constraint checks into tx_prepare. And another prerequisite for
>>> the above behavior is that the packets must be prepared (i.e. checked by
>>> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
>>> to use rte_eth_tx_prepare() in more scenarios.
>>
>> IMHO any PMD specific behaviour is a nightmare to application
>> developer and must be avoided. Ideally application should not
>> care if it is running on top of tap, virtio, failsafe or
>> bonding. It should talk to ethdev in terms of ethdev API that's
>> it. I know that net/bonding is designed that application should
>> know about it, but IMHO the places where it requires the
>> knowledge must be minimized to make applications more portable
>> across various PMDs/HW.
>>
>> I think that the only sensible solution for above problem is
>> to skip a packet which prepare dislikes. count it as dropped
>> and try to prepare/transmit subsequent packets.
> 
> Agree, I will fix this in the next version.
> 
>>
>> It is an interesting effect of the Tx prepare just before
>> Tx burst inside bonding PMD. If Tx burst fails to send
>> something because ring is full, a number of packets will
>> be processed by Tx prepare again and again. I guess it is
>> unavoidable.
>>
>>> What's Ferruh's opinion on this?
>>>
>>>> [snip]
  
humin (Q) May 24, 2022, 12:11 p.m. UTC | #9
Hi, Andrew,

在 2021/6/8 17:49, Andrew Rybchenko 写道:
> "for bonding" is redundant in the summary since it is already "net/bonding".
> 
> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
>> some adjustment to the packets before sending them (e.g. processing
>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
>> callback of the bond driver is not implemented. Therefore, related
>> offloads can not be used unless the upper layer users process the packet
>> properly in their own application. But it is bad for the
>> transplantability.
>>
>> However, it is difficult to design the tx_prepare callback for bonding
>> driver. Because when a bonded device sends packets, the bonded device
>> allocates the packets to different slave devices based on the real-time
>> link status and bonding mode. That is, it is very difficult for the
>> bonding device to determine which slave device's prepare function should
>> be invoked. In addition, if the link status changes after the packets are
>> prepared, the packets may fail to be sent because packets allocation may
>> change.
>>
>> So, in this patch, the tx_prepare callback of bonding driver is not
>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
>> tx_offloads can be processed correctly for all NIC devices in these modes.
>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
>> In these cases, the impact on performance will be very limited. It is
>> the responsibility of the slave PMDs to decide when the real tx_prepare
>> needs to be used. The information from dev_config/queue_setup is
>> sufficient for them to make these decisions.
>>
>> Note:
>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
>> because in broadcast mode, a packet needs to be sent by all slave ports.
>> Different PMDs process the packets differently in tx_prepare. As a result,
>> the sent packet may be incorrect.
>>
>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>> ---
>>   drivers/net/bonding/rte_eth_bond.h     |  1 -
>>   drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>>   2 files changed, 24 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
>> index 874aa91..1e6cc6d 100644
>> --- a/drivers/net/bonding/rte_eth_bond.h
>> +++ b/drivers/net/bonding/rte_eth_bond.h
>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>>   int
>>   rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
>>
>> -
>>   #ifdef __cplusplus
>>   }
>>   #endif
>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>> index 2e9cea5..84af348 100644
>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>>   	/* Send packet burst on each slave device */
>>   	for (i = 0; i < num_of_slaves; i++) {
>>   		if (slave_nb_pkts[i] > 0) {
>> +			int nb_prep_pkts;
>> +
>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
>> +					bd_tx_q->queue_id, slave_bufs[i],
>> +					slave_nb_pkts[i]);
>> +
> 
> Shouldn't it be called iff queue Tx offloads are not zero?
> It will allow to decrease performance degradation if no
> Tx offloads are enabled. Same in all cases below.
> 
>>   			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>> -					slave_bufs[i], slave_nb_pkts[i]);
>> +					slave_bufs[i], nb_prep_pkts);
> 
> In fact it is a problem here and really big problems.
> Tx prepare may fail and return less packets. Tx prepare
> of some packet may always fail. If application tries to
> send packets in a loop until success, it will be a
> forever loop here. Since application calls Tx burst,
> it is 100% legal behaviour of the function to return 0
> if Tx ring is full. It is not an error indication.
> However, in the case of Tx prepare it is an error
> indication.
Just regardless of this patch, I think the problem already exit in
'rte_eth_tx_burst'.
For example, there exits one 'bad' rte_mbuf in tx_pkts[].
set net driver 'fm10k' as an example, in 'fm10k_xmit_pkts',
when one rte_mbuf is 'bad'(mb->nb_segs == 0), it will break
and return the pkt num( < nb_pkts) which successfully xmited.

The code is like :
"
uint16_t
fm10k_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
	uint16_t nb_pkts)
{
	....
	for (count = 0; count < nb_pkts; ++count) {
		/* sanity check to make sure the mbuf is valid */
		if ((mb->nb_segs == 0) ||
		    ((mb->nb_segs > 1) && (mb->next == NULL)))
			break;
	}
	return count;
}

"
So, if APP send packets in a loop until success, it will be
also forever loop here.

And one solution to fix it is depending on APP itself, like what testpmd
has done: it adds delay time, like that:
"
	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_rx);
	/*
	 * Retry if necessary
	 */
	if (unlikely(nb_tx < nb_rx) && fs->retry_enabled) {
		retry = 0;
		while (nb_tx < nb_rx && retry++ < burst_tx_retry_num) {
			rte_delay_us(burst_tx_delay_time);
			nb_tx += rte_eth_tx_burst(fs->tx_port, fs->tx_queue,
					&pkts_burst[nb_tx], nb_rx - nb_tx);
		}
	}

"

what I mean, this patch does not introduce new 'bugs' to 
'rte_eth_tx_burst'. And also, the known bug in retry situation can be 
fixed in APP.

> 
> Should we change Tx burst description and enforce callers
> to check for rte_errno? It sounds like a major change...
> 
> [snip]
> .
>
  

Patch

diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
index 874aa91..1e6cc6d 100644
--- a/drivers/net/bonding/rte_eth_bond.h
+++ b/drivers/net/bonding/rte_eth_bond.h
@@ -343,7 +343,6 @@  rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
 int
 rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);

-
 #ifdef __cplusplus
 }
 #endif
diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
index 2e9cea5..84af348 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -606,8 +606,14 @@  bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
 	/* Send packet burst on each slave device */
 	for (i = 0; i < num_of_slaves; i++) {
 		if (slave_nb_pkts[i] > 0) {
+			int nb_prep_pkts;
+
+			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
+					bd_tx_q->queue_id, slave_bufs[i],
+					slave_nb_pkts[i]);
+
 			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
-					slave_bufs[i], slave_nb_pkts[i]);
+					slave_bufs[i], nb_prep_pkts);

 			/* if tx burst fails move packets to end of bufs */
 			if (unlikely(num_tx_slave < slave_nb_pkts[i])) {
@@ -632,6 +638,7 @@  bond_ethdev_tx_burst_active_backup(void *queue,
 {
 	struct bond_dev_private *internals;
 	struct bond_tx_queue *bd_tx_q;
+	int nb_prep_pkts;

 	bd_tx_q = (struct bond_tx_queue *)queue;
 	internals = bd_tx_q->dev_private;
@@ -639,8 +646,11 @@  bond_ethdev_tx_burst_active_backup(void *queue,
 	if (internals->active_slave_count < 1)
 		return 0;

+	nb_prep_pkts = rte_eth_tx_prepare(internals->current_primary_port,
+				bd_tx_q->queue_id, bufs, nb_pkts);
+
 	return rte_eth_tx_burst(internals->current_primary_port, bd_tx_q->queue_id,
-			bufs, nb_pkts);
+			bufs, nb_prep_pkts);
 }

 static inline uint16_t
@@ -939,6 +949,8 @@  bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	}

 	for (i = 0; i < num_of_slaves; i++) {
+		int nb_prep_pkts;
+
 		rte_eth_macaddr_get(slaves[i], &active_slave_addr);
 		for (j = num_tx_total; j < nb_pkts; j++) {
 			if (j + 3 < nb_pkts)
@@ -955,9 +967,12 @@  bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #endif
 		}

-		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+		nb_prep_pkts = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
 				bufs + num_tx_total, nb_pkts - num_tx_total);

+		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+				bufs + num_tx_total, nb_prep_pkts);
+
 		if (num_tx_total == nb_pkts)
 			break;
 	}
@@ -1159,12 +1174,17 @@  tx_burst_balance(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,

 	/* Send packet burst on each slave device */
 	for (i = 0; i < slave_count; i++) {
+		int nb_prep_pkts;
+
 		if (slave_nb_bufs[i] == 0)
 			continue;
+		nb_prep_pkts = rte_eth_tx_prepare(slave_port_ids[i],
+				bd_tx_q->queue_id, slave_bufs[i],
+				slave_nb_bufs[i]);

 		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
 				bd_tx_q->queue_id, slave_bufs[i],
-				slave_nb_bufs[i]);
+				nb_prep_pkts);

 		total_tx_count += slave_tx_count;