[v2,1/3] net/bonding: support Tx prepare

Message ID 20220725040842.35027-2-fengchengwen@huawei.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series add Tx prepare support for bonding driver |

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

fengchengwen July 25, 2022, 4:08 a.m. UTC
  Normally, to use the HW offloads capability (e.g. checksum and TSO) in
the Tx direction, the application needs to call rte_eth_dev_prepare to
do some adjustment with the packets before sending them (e.g. processing
pseudo headers when Tx checksum offload enabled). But, the tx_prepare
callback of the bonding driver is not implemented. Therefore, the
sent packets may have errors (e.g. checksum errors).

However, it is difficult to design the tx_prepare callback for bonding
driver. Because when a bonded device sends packets, the bonded device
allocates the packets to different slave devices based on the real-time
link status and bonding mode. That is, it is very difficult for the
bonded device to determine which slave device's prepare function should
be invoked.

So, in this patch, the tx_prepare callback of bonding driver is not
implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
all the fast path packets in mode 0, 1, 2, 4, 5, 6 (mode 3 is not
included, see[1]). In this way, all tx_offloads can be processed
correctly for all NIC devices in these modes.

As previously discussed (see V1), if the tx_prepare fails, the bonding
driver will free the cossesponding packets internally, and only the
packets of the tx_prepare OK are xmit.

To minimize performance impact, this patch adds one new
'tx_prepare_enabled' field, and corresponding control and get API:
rte_eth_bond_tx_prepare_set() and rte_eth_bond_tx_prepare_get().

[1]: In bond mode 3 (broadcast), a packet needs to be sent by all slave
ports. Different slave PMDs process the packets differently in
tx_prepare. If call tx_prepare before each slave port sending, the sent
packet may be incorrect.

Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 .../link_bonding_poll_mode_drv_lib.rst        |  5 ++
 doc/guides/rel_notes/release_22_11.rst        |  6 ++
 drivers/net/bonding/eth_bond_private.h        |  1 +
 drivers/net/bonding/rte_eth_bond.h            | 24 +++++++
 drivers/net/bonding/rte_eth_bond_api.c        | 32 +++++++++
 drivers/net/bonding/rte_eth_bond_pmd.c        | 65 ++++++++++++++++---
 drivers/net/bonding/version.map               |  5 ++
 7 files changed, 130 insertions(+), 8 deletions(-)
  

Comments

Ferruh Yigit Sept. 13, 2022, 10:22 a.m. UTC | #1
On 7/25/2022 5:08 AM, Chengwen Feng wrote:

> 
> Normally, to use the HW offloads capability (e.g. checksum and TSO) in
> the Tx direction, the application needs to call rte_eth_dev_prepare to
> do some adjustment with the packets before sending them (e.g. processing
> pseudo headers when Tx checksum offload enabled). But, the tx_prepare
> callback of the bonding driver is not implemented. Therefore, the
> sent packets may have errors (e.g. checksum errors).
> 
> However, it is difficult to design the tx_prepare callback for bonding
> driver. Because when a bonded device sends packets, the bonded device
> allocates the packets to different slave devices based on the real-time
> link status and bonding mode. That is, it is very difficult for the
> bonded device to determine which slave device's prepare function should
> be invoked.
> 
> So, in this patch, the tx_prepare callback of bonding driver is not
> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
> all the fast path packets in mode 0, 1, 2, 4, 5, 6 (mode 3 is not
> included, see[1]). In this way, all tx_offloads can be processed
> correctly for all NIC devices in these modes.
> 
> As previously discussed (see V1), if the tx_prepare fails, the bonding
> driver will free the cossesponding packets internally, and only the
> packets of the tx_prepare OK are xmit.
> 

Please provide link to discussion you refer to.

> To minimize performance impact, this patch adds one new
> 'tx_prepare_enabled' field, and corresponding control and get API:
> rte_eth_bond_tx_prepare_set() and rte_eth_bond_tx_prepare_get().
> 
> [1]: In bond mode 3 (broadcast), a packet needs to be sent by all slave
> ports. Different slave PMDs process the packets differently in
> tx_prepare. If call tx_prepare before each slave port sending, the sent
> packet may be incorrect.
> 
> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>

<...>

> +static inline uint16_t
> +bond_ethdev_tx_wrap(struct bond_tx_queue *bd_tx_q, uint16_t slave_port_id,
> +                   struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> +{
> +       struct bond_dev_private *internals = bd_tx_q->dev_private;
> +       uint16_t queue_id = bd_tx_q->queue_id;
> +       struct rte_mbuf *fail_pkts[nb_pkts];
> +       uint8_t fail_mark[nb_pkts];
> +       uint16_t nb_pre, index;
> +       uint16_t fail_cnt = 0;
> +       int i;
> +
> +       if (!internals->tx_prepare_enabled)
> +               goto tx_burst;
> +
> +       nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id, tx_pkts, nb_pkts);
> +       if (nb_pre == nb_pkts)
> +               goto tx_burst;
> +
> +       fail_pkts[fail_cnt++] = tx_pkts[nb_pre];
> +       memset(fail_mark, 0, sizeof(fail_mark));
> +       fail_mark[nb_pre] = 1;
> +       for (i = nb_pre + 1; i < nb_pkts; /* update in inner loop */) {
> +               nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id,
> +                                           tx_pkts + i, nb_pkts - i);


I assume intention is to make this as transparent as possible to the 
user, that is why you are using a wrapper that combines 
`rte_eth_tx_prepare()` & `rte_eth_tx_burst()` APIs. But for other PMDs 
`rte_eth_tx_burst()` is called explicitly by the application.

Path is also adding two new bonding specific APIs to enable/disable Tx 
prepare.
Instead if you leave calling `rte_eth_tx_prepare()` decision to user, 
there will be no need for the enable/disable Tx prepare APIs and the 
wrapper.

The `tx_pkt_prepare()` implementation in bonding can do the mode check, 
call Tx prepare for all slaves and apply failure recovery, as done in 
this wrapper function, what do you think, will it work?
  
Chas Williams Sept. 13, 2022, 3:08 p.m. UTC | #2
On 9/13/22 06:22, Ferruh Yigit wrote:
> On 7/25/2022 5:08 AM, Chengwen Feng wrote:
>
>
> I assume intention is to make this as transparent as possible to the
> user, that is why you are using a wrapper that combines
> `rte_eth_tx_prepare()` & `rte_eth_tx_burst()` APIs. But for other PMDs
> `rte_eth_tx_burst()` is called explicitly by the application.
>
> Path is also adding two new bonding specific APIs to enable/disable Tx
> prepare.
> Instead if you leave calling `rte_eth_tx_prepare()` decision to user,
> there will be no need for the enable/disable Tx prepare APIs and the
> wrapper.
>
> The `tx_pkt_prepare()` implementation in bonding can do the mode check,
> call Tx prepare for all slaves and apply failure recovery, as done in
> this wrapper function, what do you think, will it work?
>

If you leave calling tx_prepare to the user, you need to perform a
hash calculation to determine an output device. Since the output
devices might be different types, you really need to know which
device's tx_prepare needs to be called. You potentially will calculate
the packet hash twice. While you could be clever with storing the hash
in the mbuf somewhere, that's likely more complicated.
  
fengchengwen Sept. 14, 2022, 12:46 a.m. UTC | #3
Hi, Ferruh

On 2022/9/13 18:22, Ferruh Yigit wrote:
> On 7/25/2022 5:08 AM, Chengwen Feng wrote:
> 
>>
>> Normally, to use the HW offloads capability (e.g. checksum and TSO) in
>> the Tx direction, the application needs to call rte_eth_dev_prepare to
>> do some adjustment with the packets before sending them (e.g. processing
>> pseudo headers when Tx checksum offload enabled). But, the tx_prepare
>> callback of the bonding driver is not implemented. Therefore, the
>> sent packets may have errors (e.g. checksum errors).
>>
>> However, it is difficult to design the tx_prepare callback for bonding
>> driver. Because when a bonded device sends packets, the bonded device
>> allocates the packets to different slave devices based on the real-time
>> link status and bonding mode. That is, it is very difficult for the
>> bonded device to determine which slave device's prepare function should
>> be invoked.
>>
>> So, in this patch, the tx_prepare callback of bonding driver is not
>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>> all the fast path packets in mode 0, 1, 2, 4, 5, 6 (mode 3 is not
>> included, see[1]). In this way, all tx_offloads can be processed
>> correctly for all NIC devices in these modes.
>>
>> As previously discussed (see V1), if the tx_prepare fails, the bonding
>> driver will free the cossesponding packets internally, and only the
>> packets of the tx_prepare OK are xmit.
>>
> 
> Please provide link to discussion you refer to.

https://inbox.dpdk.org/dev/1618571071-5927-2-git-send-email-tangchengchang@huawei.com/

Should I push a new version for it?

> 
>> To minimize performance impact, this patch adds one new
>> 'tx_prepare_enabled' field, and corresponding control and get API:
>> rte_eth_bond_tx_prepare_set() and rte_eth_bond_tx_prepare_get().
>>
>> [1]: In bond mode 3 (broadcast), a packet needs to be sent by all slave
>> ports. Different slave PMDs process the packets differently in
>> tx_prepare. If call tx_prepare before each slave port sending, the sent
>> packet may be incorrect.
>>
>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> 
> <...>
> 
>> +static inline uint16_t
>> +bond_ethdev_tx_wrap(struct bond_tx_queue *bd_tx_q, uint16_t slave_port_id,
>> +                   struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
>> +{
>> +       struct bond_dev_private *internals = bd_tx_q->dev_private;
>> +       uint16_t queue_id = bd_tx_q->queue_id;
>> +       struct rte_mbuf *fail_pkts[nb_pkts];
>> +       uint8_t fail_mark[nb_pkts];
>> +       uint16_t nb_pre, index;
>> +       uint16_t fail_cnt = 0;
>> +       int i;
>> +
>> +       if (!internals->tx_prepare_enabled)
>> +               goto tx_burst;
>> +
>> +       nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id, tx_pkts, nb_pkts);
>> +       if (nb_pre == nb_pkts)
>> +               goto tx_burst;
>> +
>> +       fail_pkts[fail_cnt++] = tx_pkts[nb_pre];
>> +       memset(fail_mark, 0, sizeof(fail_mark));
>> +       fail_mark[nb_pre] = 1;
>> +       for (i = nb_pre + 1; i < nb_pkts; /* update in inner loop */) {
>> +               nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id,
>> +                                           tx_pkts + i, nb_pkts - i);
> 
> 
> I assume intention is to make this as transparent as possible to the user, that is why you are using a wrapper that combines `rte_eth_tx_prepare()` & `rte_eth_tx_burst()` APIs. But for other PMDs `rte_eth_tx_burst()` is called explicitly by the application.
> 
> Path is also adding two new bonding specific APIs to enable/disable Tx prepare.
> Instead if you leave calling `rte_eth_tx_prepare()` decision to user, there will be no need for the enable/disable Tx prepare APIs and the wrapper.
> 
> The `tx_pkt_prepare()` implementation in bonding can do the mode check, call Tx prepare for all slaves and apply failure recovery, as done in this wrapper function, what do you think, will it work?

I see Chas Williams also reply this thread, thanks.

The main problem is hard to design a tx_prepare for bonding device:
1. as Chas Williams said, there maybe twice hash calc to get target slave
   devices.
2. also more important, if the slave devices have changes(e.g. slave device
   link down or remove), and if the changes happens between bond-tx-prepare and
   bond-tx-burst, the output slave will changes, and this may lead to checksum
   failed. (Note: a bond device with slave devices may from different vendors,
   and slave devices may have different requirements, e.g. slave-A support calc
   IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
   pre-calc).

Current design cover the above two scenarios by using in-place tx-prepare. and
in addition, bond devices are not transparent to applications, I think it's a
practical method to provide tx-prepare support in this way.

> 
> .
  
Chas Williams Sept. 14, 2022, 4:59 p.m. UTC | #4
On 9/13/22 20:46, fengchengwen wrote:
> 
> The main problem is hard to design a tx_prepare for bonding device:
> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>     devices.
> 2. also more important, if the slave devices have changes(e.g. slave device
>     link down or remove), and if the changes happens between bond-tx-prepare and
>     bond-tx-burst, the output slave will changes, and this may lead to checksum
>     failed. (Note: a bond device with slave devices may from different vendors,
>     and slave devices may have different requirements, e.g. slave-A support calc
>     IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>     pre-calc).
> 
> Current design cover the above two scenarios by using in-place tx-prepare. and
> in addition, bond devices are not transparent to applications, I think it's a
> practical method to provide tx-prepare support in this way.
> 


I don't think you need to export an enable/disable routine for the use of
rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
implemented. You are just trading one branch in DPDK librte_eth_dev for a
branch in drivers/net/bonding.

I think you missed fixing tx_machine in 802.3ad support. We have been using
the following patch locally which I never got around to submitting.


 From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
From: "Charles (Chas) Williams" <chwillia@ciena.com>
Date: Tue, 3 May 2022 16:52:37 -0400
Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst

Some PMDs might require a call to rte_eth_tx_prepare before sending the
packets for transmission. Typically, the prepare step handles the VLAN
headers, but it may need to do other things.

Signed-off-by: Chas Williams <chwillia@ciena.com>
---
  drivers/net/bonding/rte_eth_bond_8023ad.c | 16 +++++++-
  drivers/net/bonding/rte_eth_bond_pmd.c    | 50 +++++++++++++++++++----
  2 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c b/drivers/net/bonding/rte_eth_bond_8023ad.c
index b3cddd8a20..0680be8c06 100644
--- a/drivers/net/bonding/rte_eth_bond_8023ad.c
+++ b/drivers/net/bonding/rte_eth_bond_8023ad.c
@@ -636,9 +636,15 @@ tx_machine(struct bond_dev_private *internals, uint16_t slave_id)
  			return;
  		}
  	} else {
-		uint16_t pkts_sent = rte_eth_tx_burst(slave_id,
+		uint16_t pkts_sent;
+		uint16_t nb_prep;
+
+		nb_prep = rte_eth_tx_prepare(slave_id,
  				internals->mode4.dedicated_queues.tx_qid,
  				&lacp_pkt, 1);
+		pkts_sent = rte_eth_tx_burst(slave_id,
+				internals->mode4.dedicated_queues.tx_qid,
+				&lacp_pkt, nb_prep);
  		if (pkts_sent != 1) {
  			rte_pktmbuf_free(lacp_pkt);
  			set_warning_flags(port, WRN_TX_QUEUE_FULL);
@@ -1371,9 +1377,15 @@ bond_mode_8023ad_handle_slow_pkt(struct bond_dev_private *internals,
  			}
  		} else {
  			/* Send packet directly to the slow queue */
-			uint16_t tx_count = rte_eth_tx_burst(slave_id,
+			uint16_t tx_count;
+			uint16_t nb_prep;
+
+			nb_prep = rte_eth_tx_prepare(slave_id,
  					internals->mode4.dedicated_queues.tx_qid,
  					&pkt, 1);
+			tx_count = rte_eth_tx_burst(slave_id,
+					internals->mode4.dedicated_queues.tx_qid,
+					&pkt, nb_prep);
  			if (tx_count != 1) {
  				/* reset timer */
  				port->rx_marker_timer = 0;
diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
index b305b6a35b..c27073e66f 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -602,8 +602,12 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
  	/* Send packet burst on each slave device */
  	for (i = 0; i < num_of_slaves; i++) {
  		if (slave_nb_pkts[i] > 0) {
-			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+			uint16_t nb_prep;
+
+			nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
  					slave_bufs[i], slave_nb_pkts[i]);
+			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+					slave_bufs[i], nb_prep);
  
  			/* if tx burst fails move packets to end of bufs */
  			if (unlikely(num_tx_slave < slave_nb_pkts[i])) {
@@ -628,6 +632,7 @@ bond_ethdev_tx_burst_active_backup(void *queue,
  {
  	struct bond_dev_private *internals;
  	struct bond_tx_queue *bd_tx_q;
+	uint16_t nb_prep;
  
  	bd_tx_q = (struct bond_tx_queue *)queue;
  	internals = bd_tx_q->dev_private;
@@ -635,8 +640,10 @@ bond_ethdev_tx_burst_active_backup(void *queue,
  	if (internals->active_slave_count < 1)
  		return 0;
  
-	return rte_eth_tx_burst(internals->current_primary_port, bd_tx_q->queue_id,
-			bufs, nb_pkts);
+	nb_prep = rte_eth_tx_prepare(internals->current_primary_port,
+				     bd_tx_q->queue_id, bufs, nb_pkts);
+	return rte_eth_tx_burst(internals->current_primary_port,
+				bd_tx_q->queue_id, bufs, nb_prep);
  }
  
  static inline uint16_t
@@ -935,6 +942,8 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
  	}
  
  	for (i = 0; i < num_of_slaves; i++) {
+		uint16_t nb_prep;
+
  		rte_eth_macaddr_get(slaves[i], &active_slave_addr);
  		for (j = num_tx_total; j < nb_pkts; j++) {
  			if (j + 3 < nb_pkts)
@@ -951,8 +960,10 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
  #endif
  		}
  
-		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+		nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
  				bufs + num_tx_total, nb_pkts - num_tx_total);
+		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+				bufs + num_tx_total, nb_prep);
  
  		if (num_tx_total == nb_pkts)
  			break;
@@ -1064,8 +1075,12 @@ bond_ethdev_tx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
  	/* Send ARP packets on proper slaves */
  	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
  		if (slave_bufs_pkts[i] > 0) {
-			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id,
+			uint16_t nb_prep;
+
+			nb_prep = rte_eth_tx_prepare(i, bd_tx_q->queue_id,
  					slave_bufs[i], slave_bufs_pkts[i]);
+			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id,
+					slave_bufs[i], nb_prep);
  			for (j = 0; j < slave_bufs_pkts[i] - num_send; j++) {
  				bufs[nb_pkts - 1 - num_not_send - j] =
  						slave_bufs[i][nb_pkts - 1 - j];
@@ -1088,8 +1103,12 @@ bond_ethdev_tx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
  	/* Send update packets on proper slaves */
  	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
  		if (update_bufs_pkts[i] > 0) {
-			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id, update_bufs[i],
+			uint16_t nb_prep;
+
+			nb_prep = rte_eth_tx_prepare(i, bd_tx_q->queue_id, update_bufs[i],
  					update_bufs_pkts[i]);
+			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id, update_bufs[i],
+					nb_prep);
  			for (j = num_send; j < update_bufs_pkts[i]; j++) {
  				rte_pktmbuf_free(update_bufs[i][j]);
  			}
@@ -1155,12 +1174,17 @@ tx_burst_balance(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
  
  	/* Send packet burst on each slave device */
  	for (i = 0; i < slave_count; i++) {
+		uint16_t nb_prep;
+
  		if (slave_nb_bufs[i] == 0)
  			continue;
  
-		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+		nb_prep = rte_eth_tx_prepare(slave_port_ids[i],
  				bd_tx_q->queue_id, slave_bufs[i],
  				slave_nb_bufs[i]);
+		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+				bd_tx_q->queue_id, slave_bufs[i],
+				nb_prep);
  
  		total_tx_count += slave_tx_count;
  
@@ -1243,8 +1267,12 @@ tx_burst_8023ad(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
  
  		if (rte_ring_dequeue(port->tx_ring,
  				     (void **)&ctrl_pkt) != -ENOENT) {
-			slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+			uint16_t nb_prep;
+
+			nb_prep = rte_eth_tx_prepare(slave_port_ids[i],
  					bd_tx_q->queue_id, &ctrl_pkt, 1);
+			slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+					bd_tx_q->queue_id, &ctrl_pkt, nb_prep);
  			/*
  			 * re-enqueue LAG control plane packets to buffering
  			 * ring if transmission fails so the packet isn't lost.
@@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
  
  	/* Transmit burst on each active slave */
  	for (i = 0; i < num_of_slaves; i++) {
-		slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+		uint16_t nb_prep;
+
+		nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
  					bufs, nb_pkts);
+		slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+					bufs, nb_prep);
  
  		if (unlikely(slave_tx_total[i] < nb_pkts))
  			tx_failed_flag = 1;
  
fengchengwen Sept. 17, 2022, 2:35 a.m. UTC | #5
Hi Chas,

On 2022/9/15 0:59, Chas Williams wrote:
> On 9/13/22 20:46, fengchengwen wrote:
>>
>> The main problem is hard to design a tx_prepare for bonding device:
>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>     devices.
>> 2. also more important, if the slave devices have changes(e.g. slave device
>>     link down or remove), and if the changes happens between bond-tx-prepare and
>>     bond-tx-burst, the output slave will changes, and this may lead to checksum
>>     failed. (Note: a bond device with slave devices may from different vendors,
>>     and slave devices may have different requirements, e.g. slave-A support calc
>>     IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>     pre-calc).
>>
>> Current design cover the above two scenarios by using in-place tx-prepare. and
>> in addition, bond devices are not transparent to applications, I think it's a
>> practical method to provide tx-prepare support in this way.
>>
> 
> 
> I don't think you need to export an enable/disable routine for the use of
> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
> implemented. You are just trading one branch in DPDK librte_eth_dev for a
> branch in drivers/net/bonding.

Our first patch was just like yours (just add tx-prepare default), but community
is concerned about impacting performance.

As a trade-off, I think we can add the enable/disable API.

> 
> I think you missed fixing tx_machine in 802.3ad support. We have been using
> the following patch locally which I never got around to submitting.

You are right, I will send V3 fix it.

> 
> 
> From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
> From: "Charles (Chas) Williams" <chwillia@ciena.com>
> Date: Tue, 3 May 2022 16:52:37 -0400
> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
> 
> Some PMDs might require a call to rte_eth_tx_prepare before sending the
> packets for transmission. Typically, the prepare step handles the VLAN
> headers, but it may need to do other things.
> 
> Signed-off-by: Chas Williams <chwillia@ciena.com>

...

>               * ring if transmission fails so the packet isn't lost.
> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>  
>      /* Transmit burst on each active slave */
>      for (i = 0; i < num_of_slaves; i++) {
> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> +        uint16_t nb_prep;
> +
> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>                      bufs, nb_pkts);
> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> +                    bufs, nb_prep);

The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
the packet data is sent and edited at the same time. Is this likely to cause problems ?

>  
>          if (unlikely(slave_tx_total[i] < nb_pkts))
>              tx_failed_flag = 1;
  
Chas Williams Sept. 17, 2022, 1:38 p.m. UTC | #6
On 9/16/22 22:35, fengchengwen wrote:
> Hi Chas,
> 
> On 2022/9/15 0:59, Chas Williams wrote:
>> On 9/13/22 20:46, fengchengwen wrote:
>>>
>>> The main problem is hard to design a tx_prepare for bonding device:
>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>>      devices.
>>> 2. also more important, if the slave devices have changes(e.g. slave device
>>>      link down or remove), and if the changes happens between bond-tx-prepare and
>>>      bond-tx-burst, the output slave will changes, and this may lead to checksum
>>>      failed. (Note: a bond device with slave devices may from different vendors,
>>>      and slave devices may have different requirements, e.g. slave-A support calc
>>>      IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>>      pre-calc).
>>>
>>> Current design cover the above two scenarios by using in-place tx-prepare. and
>>> in addition, bond devices are not transparent to applications, I think it's a
>>> practical method to provide tx-prepare support in this way.
>>>
>>
>>
>> I don't think you need to export an enable/disable routine for the use of
>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
>> branch in drivers/net/bonding.
> 
> Our first patch was just like yours (just add tx-prepare default), but community
> is concerned about impacting performance.
> 
> As a trade-off, I think we can add the enable/disable API.

IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
performance adversly, that is not a bonding problem. All applications
should be calling rte_eth_dev_tx_prepare. There's no defined API
to determine if rte_eth_dev_tx_prepare should be called. Therefore,
applications should always call rte_eth_dev_tx_prepare. Regardless,
as I previously mentioned, you are just trading the location of
the branch, especially in the bonding case.

If rte_eth_dev_tx_prepare is causing a performance drop, then that API
should be improved or rewritten. There are PMD that require you to use
that API. Locally, we had maintained a patch to eliminate the use of
rte_eth_dev_tx_prepare. However, that has been getting harder and harder
to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
was marginal.

> 
>>
>> I think you missed fixing tx_machine in 802.3ad support. We have been using
>> the following patch locally which I never got around to submitting.
> 
> You are right, I will send V3 fix it.
> 
>>
>>
>>  From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
>> From: "Charles (Chas) Williams" <chwillia@ciena.com>
>> Date: Tue, 3 May 2022 16:52:37 -0400
>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
>>
>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
>> packets for transmission. Typically, the prepare step handles the VLAN
>> headers, but it may need to do other things.
>>
>> Signed-off-by: Chas Williams <chwillia@ciena.com>
> 
> ...
> 
>>                * ring if transmission fails so the packet isn't lost.
>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>>   
>>       /* Transmit burst on each active slave */
>>       for (i = 0; i < num_of_slaves; i++) {
>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>> +        uint16_t nb_prep;
>> +
>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>>                       bufs, nb_pkts);
>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>> +                    bufs, nb_prep);
> 
> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
> the packet data is sent and edited at the same time. Is this likely to cause problems ?

This routine is already broken. You can't just increment the refcount
and send the packet into a PMD's transmit routine. Nothing guarantees
that a transmit routine will not modify the packet. Many PMDs perform an
rte_vlan_insert. You should at least perform a clone of the packet so
that the mbuf headers aren't mangled by each PMD. Just to be safe you
should perform a partial deep copy of the packet headers in case some
PMD does an rte_vlan_insert and the other PMDs in the bonding group do
not need an rte_vlan_insert.

So doing a blind rte_eth_dev_tx_preprare isn't making anything much
worse.

> 
>>   
>>           if (unlikely(slave_tx_total[i] < nb_pkts))
>>               tx_failed_flag = 1;
  
Konstantin Ananyev Sept. 19, 2022, 2:07 p.m. UTC | #7
> 
> On 9/16/22 22:35, fengchengwen wrote:
> > Hi Chas,
> >
> > On 2022/9/15 0:59, Chas Williams wrote:
> >> On 9/13/22 20:46, fengchengwen wrote:
> >>>
> >>> The main problem is hard to design a tx_prepare for bonding device:
> >>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
> >>>      devices.
> >>> 2. also more important, if the slave devices have changes(e.g. slave device
> >>>      link down or remove), and if the changes happens between bond-tx-prepare and
> >>>      bond-tx-burst, the output slave will changes, and this may lead to checksum
> >>>      failed. (Note: a bond device with slave devices may from different vendors,
> >>>      and slave devices may have different requirements, e.g. slave-A support calc
> >>>      IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
> >>>      pre-calc).
> >>>
> >>> Current design cover the above two scenarios by using in-place tx-prepare. and
> >>> in addition, bond devices are not transparent to applications, I think it's a
> >>> practical method to provide tx-prepare support in this way.
> >>>
> >>
> >>
> >> I don't think you need to export an enable/disable routine for the use of
> >> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
> >> implemented. You are just trading one branch in DPDK librte_eth_dev for a
> >> branch in drivers/net/bonding.
> >
> > Our first patch was just like yours (just add tx-prepare default), but community
> > is concerned about impacting performance.
> >
> > As a trade-off, I think we can add the enable/disable API.
> 
> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
> performance adversly, that is not a bonding problem. All applications
> should be calling rte_eth_dev_tx_prepare. There's no defined API
> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
> applications should always call rte_eth_dev_tx_prepare. Regardless,
> as I previously mentioned, you are just trading the location of
> the branch, especially in the bonding case.
> 
> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
> should be improved or rewritten. There are PMD that require you to use
> that API. Locally, we had maintained a patch to eliminate the use of
> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
> was marginal.
> 
> >
> >>
> >> I think you missed fixing tx_machine in 802.3ad support. We have been using
> >> the following patch locally which I never got around to submitting.
> >
> > You are right, I will send V3 fix it.
> >
> >>
> >>
> >>  From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
> >> From: "Charles (Chas) Williams" <chwillia@ciena.com>
> >> Date: Tue, 3 May 2022 16:52:37 -0400
> >> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
> >>
> >> Some PMDs might require a call to rte_eth_tx_prepare before sending the
> >> packets for transmission. Typically, the prepare step handles the VLAN
> >> headers, but it may need to do other things.
> >>
> >> Signed-off-by: Chas Williams <chwillia@ciena.com>
> >
> > ...
> >
> >>                * ring if transmission fails so the packet isn't lost.
> >> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
> >>
> >>       /* Transmit burst on each active slave */
> >>       for (i = 0; i < num_of_slaves; i++) {
> >> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> >> +        uint16_t nb_prep;
> >> +
> >> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
> >>                       bufs, nb_pkts);
> >> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> >> +                    bufs, nb_prep);
> >
> > The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
> > the packet data is sent and edited at the same time. Is this likely to cause problems ?
> 
> This routine is already broken. You can't just increment the refcount
> and send the packet into a PMD's transmit routine. Nothing guarantees
> that a transmit routine will not modify the packet. Many PMDs perform an
> rte_vlan_insert. 

Hmm interesting.... 
My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
(except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
was introduced.

> You should at least perform a clone of the packet so
> that the mbuf headers aren't mangled by each PMD. Just to be safe you
> should perform a partial deep copy of the packet headers in case some
> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
> not need an rte_vlan_insert.
> 
> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
> worse.
> 
> >
> >>
> >>           if (unlikely(slave_tx_total[i] < nb_pkts))
> >>               tx_failed_flag = 1;
  
Chas Williams Sept. 19, 2022, 11:02 p.m. UTC | #8
On 9/19/22 10:07, Konstantin Ananyev wrote:
> 
>>
>> On 9/16/22 22:35, fengchengwen wrote:
>>> Hi Chas,
>>>
>>> On 2022/9/15 0:59, Chas Williams wrote:
>>>> On 9/13/22 20:46, fengchengwen wrote:
>>>>>
>>>>> The main problem is hard to design a tx_prepare for bonding device:
>>>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>>>>       devices.
>>>>> 2. also more important, if the slave devices have changes(e.g. slave device
>>>>>       link down or remove), and if the changes happens between bond-tx-prepare and
>>>>>       bond-tx-burst, the output slave will changes, and this may lead to checksum
>>>>>       failed. (Note: a bond device with slave devices may from different vendors,
>>>>>       and slave devices may have different requirements, e.g. slave-A support calc
>>>>>       IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>>>>       pre-calc).
>>>>>
>>>>> Current design cover the above two scenarios by using in-place tx-prepare. and
>>>>> in addition, bond devices are not transparent to applications, I think it's a
>>>>> practical method to provide tx-prepare support in this way.
>>>>>
>>>>
>>>>
>>>> I don't think you need to export an enable/disable routine for the use of
>>>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
>>>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
>>>> branch in drivers/net/bonding.
>>>
>>> Our first patch was just like yours (just add tx-prepare default), but community
>>> is concerned about impacting performance.
>>>
>>> As a trade-off, I think we can add the enable/disable API.
>>
>> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
>> performance adversly, that is not a bonding problem. All applications
>> should be calling rte_eth_dev_tx_prepare. There's no defined API
>> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
>> applications should always call rte_eth_dev_tx_prepare. Regardless,
>> as I previously mentioned, you are just trading the location of
>> the branch, especially in the bonding case.
>>
>> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
>> should be improved or rewritten. There are PMD that require you to use
>> that API. Locally, we had maintained a patch to eliminate the use of
>> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
>> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
>> was marginal.
>>
>>>
>>>>
>>>> I think you missed fixing tx_machine in 802.3ad support. We have been using
>>>> the following patch locally which I never got around to submitting.
>>>
>>> You are right, I will send V3 fix it.
>>>
>>>>
>>>>
>>>>   From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
>>>> From: "Charles (Chas) Williams" <chwillia@ciena.com>
>>>> Date: Tue, 3 May 2022 16:52:37 -0400
>>>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
>>>>
>>>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
>>>> packets for transmission. Typically, the prepare step handles the VLAN
>>>> headers, but it may need to do other things.
>>>>
>>>> Signed-off-by: Chas Williams <chwillia@ciena.com>
>>>
>>> ...
>>>
>>>>                 * ring if transmission fails so the packet isn't lost.
>>>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>>>>
>>>>        /* Transmit burst on each active slave */
>>>>        for (i = 0; i < num_of_slaves; i++) {
>>>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>> +        uint16_t nb_prep;
>>>> +
>>>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>>>>                        bufs, nb_pkts);
>>>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>> +                    bufs, nb_prep);
>>>
>>> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
>>> the packet data is sent and edited at the same time. Is this likely to cause problems ?
>>
>> This routine is already broken. You can't just increment the refcount
>> and send the packet into a PMD's transmit routine. Nothing guarantees
>> that a transmit routine will not modify the packet. Many PMDs perform an
>> rte_vlan_insert.
> 
> Hmm interesting....
> My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
> (except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
> While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
> was introduced.

Is that documented anywhere? It's been my experience that the device PMD
can do practically anything and you need to protect yourself.  Currently,
the af_packet, dpaa2, and vhost driver call rte_vlan_insert. Before 2019,
the virtio driver also used to call rte_vlan_insert during its transmit
path. Of course, rte_vlan_insert modifies the packet data and the mbuf
header. Regardless, it looks like rte_eth_dev_tx_prepare should always be
called. Handling that correctly in broadcast mode probably means always
make a deep copy of the packet, or check to see if all the members are
the same PMD type. If so, you can just call prepare once. You could track
the mismatched nature during additional/removal of the members. Or just
assume people aren't going to mismatch bonding members.

  
>> You should at least perform a clone of the packet so
>> that the mbuf headers aren't mangled by each PMD. Just to be safe you
>> should perform a partial deep copy of the packet headers in case some
>> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
>> not need an rte_vlan_insert.
>>
>> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
>> worse.
>>
>>>
>>>>
>>>>            if (unlikely(slave_tx_total[i] < nb_pkts))
>>>>                tx_failed_flag = 1;
  
fengchengwen Sept. 22, 2022, 2:12 a.m. UTC | #9
On 2022/9/20 7:02, Chas Williams wrote:
> 
> 
> On 9/19/22 10:07, Konstantin Ananyev wrote:
>>
>>>
>>> On 9/16/22 22:35, fengchengwen wrote:
>>>> Hi Chas,
>>>>
>>>> On 2022/9/15 0:59, Chas Williams wrote:
>>>>> On 9/13/22 20:46, fengchengwen wrote:
>>>>>>
>>>>>> The main problem is hard to design a tx_prepare for bonding device:
>>>>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>>>>>       devices.
>>>>>> 2. also more important, if the slave devices have changes(e.g. slave device
>>>>>>       link down or remove), and if the changes happens between bond-tx-prepare and
>>>>>>       bond-tx-burst, the output slave will changes, and this may lead to checksum
>>>>>>       failed. (Note: a bond device with slave devices may from different vendors,
>>>>>>       and slave devices may have different requirements, e.g. slave-A support calc
>>>>>>       IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>>>>>       pre-calc).
>>>>>>
>>>>>> Current design cover the above two scenarios by using in-place tx-prepare. and
>>>>>> in addition, bond devices are not transparent to applications, I think it's a
>>>>>> practical method to provide tx-prepare support in this way.
>>>>>>
>>>>>
>>>>>
>>>>> I don't think you need to export an enable/disable routine for the use of
>>>>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
>>>>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
>>>>> branch in drivers/net/bonding.
>>>>
>>>> Our first patch was just like yours (just add tx-prepare default), but community
>>>> is concerned about impacting performance.
>>>>
>>>> As a trade-off, I think we can add the enable/disable API.
>>>
>>> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
>>> performance adversly, that is not a bonding problem. All applications
>>> should be calling rte_eth_dev_tx_prepare. There's no defined API
>>> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
>>> applications should always call rte_eth_dev_tx_prepare. Regardless,
>>> as I previously mentioned, you are just trading the location of
>>> the branch, especially in the bonding case.
>>>
>>> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
>>> should be improved or rewritten. There are PMD that require you to use
>>> that API. Locally, we had maintained a patch to eliminate the use of
>>> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
>>> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
>>> was marginal.
>>>
>>>>
>>>>>
>>>>> I think you missed fixing tx_machine in 802.3ad support. We have been using
>>>>> the following patch locally which I never got around to submitting.
>>>>
>>>> You are right, I will send V3 fix it.
>>>>
>>>>>
>>>>>
>>>>>   From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
>>>>> From: "Charles (Chas) Williams" <chwillia@ciena.com>
>>>>> Date: Tue, 3 May 2022 16:52:37 -0400
>>>>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
>>>>>
>>>>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
>>>>> packets for transmission. Typically, the prepare step handles the VLAN
>>>>> headers, but it may need to do other things.
>>>>>
>>>>> Signed-off-by: Chas Williams <chwillia@ciena.com>
>>>>
>>>> ...
>>>>
>>>>>                 * ring if transmission fails so the packet isn't lost.
>>>>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>>>>>
>>>>>        /* Transmit burst on each active slave */
>>>>>        for (i = 0; i < num_of_slaves; i++) {
>>>>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>> +        uint16_t nb_prep;
>>>>> +
>>>>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>>>>>                        bufs, nb_pkts);
>>>>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>> +                    bufs, nb_prep);
>>>>
>>>> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
>>>> the packet data is sent and edited at the same time. Is this likely to cause problems ?
>>>
>>> This routine is already broken. You can't just increment the refcount
>>> and send the packet into a PMD's transmit routine. Nothing guarantees
>>> that a transmit routine will not modify the packet. Many PMDs perform an
>>> rte_vlan_insert.
>>
>> Hmm interesting....
>> My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
>> (except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
>> While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
>> was introduced.
> 
> Is that documented anywhere? It's been my experience that the device PMD
> can do practically anything and you need to protect yourself.  Currently,
> the af_packet, dpaa2, and vhost driver call rte_vlan_insert. Before 2019,
> the virtio driver also used to call rte_vlan_insert during its transmit
> path. Of course, rte_vlan_insert modifies the packet data and the mbuf
> header. Regardless, it looks like rte_eth_dev_tx_prepare should always be
> called. Handling that correctly in broadcast mode probably means always
> make a deep copy of the packet, or check to see if all the members are
> the same PMD type. If so, you can just call prepare once. You could track
> the mismatched nature during additional/removal of the members. Or just
> assume people aren't going to mismatch bonding members.

the rte_eth_tx_prepare has notes:
    * Since this function can modify packet data, provided mbufs must be safely
    * writable (e.g. modified data cannot be in shared segment).
but rte_eth_tx_burst have not such requirement.

except above examples of rte_vlan_insert, there are also some PMDs modify mbuf's header
and data, e.g. hns3/ark/bnxt will invoke rte_pktmbuf_append in case of the pkt-len too small.

I prefer the rte_eth_tx_burst add such restricts: the PMD should not modify the mbuf except refcnt==1.
so that application could rely on there explicit definition to do business.


As for this bonding scenario, we have three alternatives:
1) as Chas provided patch, always do tx-prepare before tx-burst. it was simple, but have: it
may modify the mbuf but application could not detect (unless especial documents)
2) my patch, application could invoke the prepare_enable/disable to control whether to do prepare.
3) implement bonding PMD's tx-prepare, it do tx-preare for each slave, but existing some problem:
if the slave device changes (e.g. add new device), some packet errors may occur because we have not
do prepare for the new add device.

note1: the above 1/2 both violate rte_eth_tx_burst's requirement, so we should especial document.
note2: we can do some optimization for 3, e.g. if the same driver name is detected on multiple slave
       devices, here only need to perform tx-prepare once. but the problem above descripe still exist
       because of dynamic slave devices at runtime.

hope for more discuess. @Ferruh @Chas @Humin @Konstantin

> 
>  
>>> You should at least perform a clone of the packet so
>>> that the mbuf headers aren't mangled by each PMD. Just to be safe you
>>> should perform a partial deep copy of the packet headers in case some
>>> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
>>> not need an rte_vlan_insert.
>>>
>>> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
>>> worse.
>>>
>>>>
>>>>>
>>>>>            if (unlikely(slave_tx_total[i] < nb_pkts))
>>>>>                tx_failed_flag = 1;
> .
  
Chas Williams Sept. 25, 2022, 10:32 a.m. UTC | #10
On 9/21/22 22:12, fengchengwen wrote:
> 
> 
> On 2022/9/20 7:02, Chas Williams wrote:
>>
>>
>> On 9/19/22 10:07, Konstantin Ananyev wrote:
>>>
>>>>
>>>> On 9/16/22 22:35, fengchengwen wrote:
>>>>> Hi Chas,
>>>>>
>>>>> On 2022/9/15 0:59, Chas Williams wrote:
>>>>>> On 9/13/22 20:46, fengchengwen wrote:
>>>>>>>
>>>>>>> The main problem is hard to design a tx_prepare for bonding device:
>>>>>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>>>>>>        devices.
>>>>>>> 2. also more important, if the slave devices have changes(e.g. slave device
>>>>>>>        link down or remove), and if the changes happens between bond-tx-prepare and
>>>>>>>        bond-tx-burst, the output slave will changes, and this may lead to checksum
>>>>>>>        failed. (Note: a bond device with slave devices may from different vendors,
>>>>>>>        and slave devices may have different requirements, e.g. slave-A support calc
>>>>>>>        IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>>>>>>        pre-calc).
>>>>>>>
>>>>>>> Current design cover the above two scenarios by using in-place tx-prepare. and
>>>>>>> in addition, bond devices are not transparent to applications, I think it's a
>>>>>>> practical method to provide tx-prepare support in this way.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> I don't think you need to export an enable/disable routine for the use of
>>>>>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
>>>>>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
>>>>>> branch in drivers/net/bonding.
>>>>>
>>>>> Our first patch was just like yours (just add tx-prepare default), but community
>>>>> is concerned about impacting performance.
>>>>>
>>>>> As a trade-off, I think we can add the enable/disable API.
>>>>
>>>> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
>>>> performance adversly, that is not a bonding problem. All applications
>>>> should be calling rte_eth_dev_tx_prepare. There's no defined API
>>>> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
>>>> applications should always call rte_eth_dev_tx_prepare. Regardless,
>>>> as I previously mentioned, you are just trading the location of
>>>> the branch, especially in the bonding case.
>>>>
>>>> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
>>>> should be improved or rewritten. There are PMD that require you to use
>>>> that API. Locally, we had maintained a patch to eliminate the use of
>>>> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
>>>> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
>>>> was marginal.
>>>>
>>>>>
>>>>>>
>>>>>> I think you missed fixing tx_machine in 802.3ad support. We have been using
>>>>>> the following patch locally which I never got around to submitting.
>>>>>
>>>>> You are right, I will send V3 fix it.
>>>>>
>>>>>>
>>>>>>
>>>>>>    From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
>>>>>> From: "Charles (Chas) Williams" <chwillia@ciena.com>
>>>>>> Date: Tue, 3 May 2022 16:52:37 -0400
>>>>>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
>>>>>>
>>>>>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
>>>>>> packets for transmission. Typically, the prepare step handles the VLAN
>>>>>> headers, but it may need to do other things.
>>>>>>
>>>>>> Signed-off-by: Chas Williams <chwillia@ciena.com>
>>>>>
>>>>> ...
>>>>>
>>>>>>                  * ring if transmission fails so the packet isn't lost.
>>>>>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>>>>>>
>>>>>>         /* Transmit burst on each active slave */
>>>>>>         for (i = 0; i < num_of_slaves; i++) {
>>>>>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>>> +        uint16_t nb_prep;
>>>>>> +
>>>>>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>>>>>>                         bufs, nb_pkts);
>>>>>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>>> +                    bufs, nb_prep);
>>>>>
>>>>> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
>>>>> the packet data is sent and edited at the same time. Is this likely to cause problems ?
>>>>
>>>> This routine is already broken. You can't just increment the refcount
>>>> and send the packet into a PMD's transmit routine. Nothing guarantees
>>>> that a transmit routine will not modify the packet. Many PMDs perform an
>>>> rte_vlan_insert.
>>>
>>> Hmm interesting....
>>> My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
>>> (except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
>>> While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
>>> was introduced.
>>
>> Is that documented anywhere? It's been my experience that the device PMD
>> can do practically anything and you need to protect yourself.  Currently,
>> the af_packet, dpaa2, and vhost driver call rte_vlan_insert. Before 2019,
>> the virtio driver also used to call rte_vlan_insert during its transmit
>> path. Of course, rte_vlan_insert modifies the packet data and the mbuf
>> header. Regardless, it looks like rte_eth_dev_tx_prepare should always be
>> called. Handling that correctly in broadcast mode probably means always
>> make a deep copy of the packet, or check to see if all the members are
>> the same PMD type. If so, you can just call prepare once. You could track
>> the mismatched nature during additional/removal of the members. Or just
>> assume people aren't going to mismatch bonding members.
> 
> the rte_eth_tx_prepare has notes:
>      * Since this function can modify packet data, provided mbufs must be safely
>      * writable (e.g. modified data cannot be in shared segment).
> but rte_eth_tx_burst have not such requirement.
> 
> except above examples of rte_vlan_insert, there are also some PMDs modify mbuf's header
> and data, e.g. hns3/ark/bnxt will invoke rte_pktmbuf_append in case of the pkt-len too small.
> 
> I prefer the rte_eth_tx_burst add such restricts: the PMD should not modify the mbuf except refcnt==1.
> so that application could rely on there explicit definition to do business.
> 
> 
> As for this bonding scenario, we have three alternatives:
> 1) as Chas provided patch, always do tx-prepare before tx-burst. it was simple, but have: it
> may modify the mbuf but application could not detect (unless especial documents)
> 2) my patch, application could invoke the prepare_enable/disable to control whether to do prepare.
> 3) implement bonding PMD's tx-prepare, it do tx-preare for each slave, but existing some problem:
> if the slave device changes (e.g. add new device), some packet errors may occur because we have not
> do prepare for the new add device.
> 
> note1: the above 1/2 both violate rte_eth_tx_burst's requirement, so we should especial document.
> note2: we can do some optimization for 3, e.g. if the same driver name is detected on multiple slave
>         devices, here only need to perform tx-prepare once. but the problem above descripe still exist
>         because of dynamic slave devices at runtime.
> 
> hope for more discuess. @Ferruh @Chas @Humin @Konstantin

I don't think adding additional API due to concerns about performance is
the solution to the performance problem. If the tx_prepare API is slow,
that's what needs to be fixed. I imagine that more drivers will be using
the tx_prepare API over time not less. It would be a good idea to get
used to calling it.

As for broadcast mode, let's just call tx_prepare once for any given
packet. For now, assume that no one would attempt to bond different
PMDs together. In my experience, that would be unusual. I have never
seen anyone do that in a production context. If a bug report comes in
about this failing for someone, we can fix it then.


>>>> You should at least perform a clone of the packet so
>>>> that the mbuf headers aren't mangled by each PMD. Just to be safe you
>>>> should perform a partial deep copy of the packet headers in case some
>>>> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
>>>> not need an rte_vlan_insert.
>>>>
>>>> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
>>>> worse.
>>>>
>>>>>
>>>>>>
>>>>>>             if (unlikely(slave_tx_total[i] < nb_pkts))
>>>>>>                 tx_failed_flag = 1;
>> .
  
Konstantin Ananyev Sept. 26, 2022, 10:18 a.m. UTC | #11
Hi everyone,

Sorry for late reply.

> >>>>> The main problem is hard to design a tx_prepare for bonding device:
> >>>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
> >>>>>       devices.
> >>>>> 2. also more important, if the slave devices have changes(e.g. slave device
> >>>>>       link down or remove), and if the changes happens between bond-tx-prepare and
> >>>>>       bond-tx-burst, the output slave will changes, and this may lead to checksum
> >>>>>       failed. (Note: a bond device with slave devices may from different vendors,
> >>>>>       and slave devices may have different requirements, e.g. slave-A support calc
> >>>>>       IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
> >>>>>       pre-calc).
> >>>>>
> >>>>> Current design cover the above two scenarios by using in-place tx-prepare. and
> >>>>> in addition, bond devices are not transparent to applications, I think it's a
> >>>>> practical method to provide tx-prepare support in this way.
> >>>>>
> >>>>
> >>>>
> >>>> I don't think you need to export an enable/disable routine for the use of
> >>>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
> >>>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
> >>>> branch in drivers/net/bonding.
> >>>
> >>> Our first patch was just like yours (just add tx-prepare default), but community
> >>> is concerned about impacting performance.
> >>>
> >>> As a trade-off, I think we can add the enable/disable API.
> >>
> >> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
> >> performance adversly, that is not a bonding problem. All applications
> >> should be calling rte_eth_dev_tx_prepare. There's no defined API
> >> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
> >> applications should always call rte_eth_dev_tx_prepare. Regardless,
> >> as I previously mentioned, you are just trading the location of
> >> the branch, especially in the bonding case.
> >>
> >> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
> >> should be improved or rewritten. There are PMD that require you to use
> >> that API. Locally, we had maintained a patch to eliminate the use of
> >> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
> >> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
> >> was marginal.
> >>
> >>>
> >>>>
> >>>> I think you missed fixing tx_machine in 802.3ad support. We have been using
> >>>> the following patch locally which I never got around to submitting.
> >>>
> >>> You are right, I will send V3 fix it.
> >>>
> >>>>
> >>>>
> >>>>   From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
> >>>> From: "Charles (Chas) Williams" <chwillia@ciena.com>
> >>>> Date: Tue, 3 May 2022 16:52:37 -0400
> >>>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
> >>>>
> >>>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
> >>>> packets for transmission. Typically, the prepare step handles the VLAN
> >>>> headers, but it may need to do other things.
> >>>>
> >>>> Signed-off-by: Chas Williams <chwillia@ciena.com>
> >>>
> >>> ...
> >>>
> >>>>                 * ring if transmission fails so the packet isn't lost.
> >>>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
> >>>>
> >>>>        /* Transmit burst on each active slave */
> >>>>        for (i = 0; i < num_of_slaves; i++) {
> >>>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> >>>> +        uint16_t nb_prep;
> >>>> +
> >>>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
> >>>>                        bufs, nb_pkts);
> >>>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> >>>> +                    bufs, nb_prep);
> >>>
> >>> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
> >>> the packet data is sent and edited at the same time. Is this likely to cause problems ?
> >>
> >> This routine is already broken. You can't just increment the refcount
> >> and send the packet into a PMD's transmit routine. Nothing guarantees
> >> that a transmit routine will not modify the packet. Many PMDs perform an
> >> rte_vlan_insert.
> >
> > Hmm interesting....
> > My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
> > (except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
> > While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
> > was introduced.
> 
> Is that documented anywhere?

I looked through, but couldn't find too much except what was already mentioned by Fengcheng:
rte_eth_tx_prepare() notes:
    * Since this function can modify packet data, provided mbufs must be safely
    * writable (e.g. modified data cannot be in shared segment).
Probably that's not explicit enough, as it doesn't forbid modifying packets in tx_burst clearly.

> It's been my experience that the device PMD
> can do practically anything and you need to protect yourself.  Currently,
> the af_packet, dpaa2, and vhost driver call rte_vlan_insert. Before 2019,
> the virtio driver also used to call rte_vlan_insert during its transmit
> path. Of course, rte_vlan_insert modifies the packet data and the mbuf
> header.
Interesting, usually apps that trying to use zero-copy multi-cast TX have packet-header portion
in a separate segment, so it might even keep working.. But definetly doesn't look right to me:
if mbuf->refnct > 1,  I think it should be treated as read-only. 

 Regardless, it looks like rte_eth_dev_tx_prepare should always be
> called.

Again, as I remember, initial agreement was: if any TX offload is enabled,
tx_prepare() needs to be called (or user has implement similar stuff on his own).
If no TX offload flags were specified for the packet, tx_prepare() is not necessary.

 > Handling that correctly in broadcast mode probably means always
> make a deep copy of the packet, or check to see if all the members are
> the same PMD type. If so, you can just call prepare once. You could track
> the mismatched nature during additional/removal of the members. Or just
> assume people aren't going to mismatch bonding members.
> 
> 
> >> You should at least perform a clone of the packet so
> >> that the mbuf headers aren't mangled by each PMD. 

Usually you don't need to clone the whole packet. In many cases it is enough to just attach
as first segment l2/l3/l4 header portion of the packet.
At least that's how ip_multicast sample works.  

Just to be safe you
> >> should perform a partial deep copy of the packet headers in case some
> >> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
> >> not need an rte_vlan_insert.
> >>
> >> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
> >> worse.
> >>
> >>>
> >>>>
> >>>>            if (unlikely(slave_tx_total[i] < nb_pkts))
> >>>>                tx_failed_flag = 1;
  
Chas Williams Sept. 26, 2022, 4:36 p.m. UTC | #12
On 9/26/22 06:18, Konstantin Ananyev wrote:
> 
> Hi everyone,
> 
> Sorry for late reply.
> 
>>>>>>> The main problem is hard to design a tx_prepare for bonding device:
>>>>>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>>>>>>        devices.
>>>>>>> 2. also more important, if the slave devices have changes(e.g. slave device
>>>>>>>        link down or remove), and if the changes happens between bond-tx-prepare and
>>>>>>>        bond-tx-burst, the output slave will changes, and this may lead to checksum
>>>>>>>        failed. (Note: a bond device with slave devices may from different vendors,
>>>>>>>        and slave devices may have different requirements, e.g. slave-A support calc
>>>>>>>        IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>>>>>>        pre-calc).
>>>>>>>
>>>>>>> Current design cover the above two scenarios by using in-place tx-prepare. and
>>>>>>> in addition, bond devices are not transparent to applications, I think it's a
>>>>>>> practical method to provide tx-prepare support in this way.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> I don't think you need to export an enable/disable routine for the use of
>>>>>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
>>>>>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
>>>>>> branch in drivers/net/bonding.
>>>>>
>>>>> Our first patch was just like yours (just add tx-prepare default), but community
>>>>> is concerned about impacting performance.
>>>>>
>>>>> As a trade-off, I think we can add the enable/disable API.
>>>>
>>>> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
>>>> performance adversly, that is not a bonding problem. All applications
>>>> should be calling rte_eth_dev_tx_prepare. There's no defined API
>>>> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
>>>> applications should always call rte_eth_dev_tx_prepare. Regardless,
>>>> as I previously mentioned, you are just trading the location of
>>>> the branch, especially in the bonding case.
>>>>
>>>> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
>>>> should be improved or rewritten. There are PMD that require you to use
>>>> that API. Locally, we had maintained a patch to eliminate the use of
>>>> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
>>>> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
>>>> was marginal.
>>>>
>>>>>
>>>>>>
>>>>>> I think you missed fixing tx_machine in 802.3ad support. We have been using
>>>>>> the following patch locally which I never got around to submitting.
>>>>>
>>>>> You are right, I will send V3 fix it.
>>>>>
>>>>>>
>>>>>>
>>>>>>    From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
>>>>>> From: "Charles (Chas) Williams" <chwillia@ciena.com>
>>>>>> Date: Tue, 3 May 2022 16:52:37 -0400
>>>>>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
>>>>>>
>>>>>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
>>>>>> packets for transmission. Typically, the prepare step handles the VLAN
>>>>>> headers, but it may need to do other things.
>>>>>>
>>>>>> Signed-off-by: Chas Williams <chwillia@ciena.com>
>>>>>
>>>>> ...
>>>>>
>>>>>>                  * ring if transmission fails so the packet isn't lost.
>>>>>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>>>>>>
>>>>>>         /* Transmit burst on each active slave */
>>>>>>         for (i = 0; i < num_of_slaves; i++) {
>>>>>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>>> +        uint16_t nb_prep;
>>>>>> +
>>>>>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>>>>>>                         bufs, nb_pkts);
>>>>>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>>> +                    bufs, nb_prep);
>>>>>
>>>>> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
>>>>> the packet data is sent and edited at the same time. Is this likely to cause problems ?
>>>>
>>>> This routine is already broken. You can't just increment the refcount
>>>> and send the packet into a PMD's transmit routine. Nothing guarantees
>>>> that a transmit routine will not modify the packet. Many PMDs perform an
>>>> rte_vlan_insert.
>>>
>>> Hmm interesting....
>>> My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
>>> (except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
>>> While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
>>> was introduced.
>>
>> Is that documented anywhere?
> 
> I looked through, but couldn't find too much except what was already mentioned by Fengcheng:
> rte_eth_tx_prepare() notes:
>      * Since this function can modify packet data, provided mbufs must be safely
>      * writable (e.g. modified data cannot be in shared segment).
> Probably that's not explicit enough, as it doesn't forbid modifying packets in tx_burst clearly.

This certainly seems like one of those gray areas in the DPDK APIs. It
should be made clear what is expected as far as behavior.

> 
>> It's been my experience that the device PMD
>> can do practically anything and you need to protect yourself.  Currently,
>> the af_packet, dpaa2, and vhost driver call rte_vlan_insert. Before 2019,
>> the virtio driver also used to call rte_vlan_insert during its transmit
>> path. Of course, rte_vlan_insert modifies the packet data and the mbuf
>> header.
> Interesting, usually apps that trying to use zero-copy multi-cast TX have packet-header portion
> in a separate segment, so it might even keep working.. But definetly doesn't look right to me:
> if mbuf->refnct > 1,  I think it should be treated as read-only.


rte_vlan_insert might be a problem with broadcast mode. If the refcnt is
> 1, rte_vlan_insert is going to fail. So, the current broadcast mode
implementation probably doesn't work if any PMD uses rte_vlan_insert.

So again, a solution is call tx_pkt_prepare once, then increment the
reference count, and send to the all the members. That works if your
PMD correctly implements tx_pkt_prepare. If it doesn't and call rte_vlan_insert
in the transmit routine, that PMD will need to be fixed to work with
bonding.

  
>   Regardless, it looks like rte_eth_dev_tx_prepare should always be
>> called.
> 
> Again, as I remember, initial agreement was: if any TX offload is enabled,
> tx_prepare() needs to be called (or user has implement similar stuff on his own).
> If no TX offload flags were specified for the packet, tx_prepare() is not necessary.

For the bonding driver, we potentially have a mix of PMDs for the members.
It's difficult to know in advance if your packets will have TX offload flags
or not. If you have a tx_pkt_prepare stub, there's a good chance that your
packets will have some TX offload flags. So, calling tx_pkt_prepare is likely
the "best" intermediate solution.

> 
>   > Handling that correctly in broadcast mode probably means always
>> make a deep copy of the packet, or check to see if all the members are
>> the same PMD type. If so, you can just call prepare once. You could track
>> the mismatched nature during additional/removal of the members. Or just
>> assume people aren't going to mismatch bonding members.
>>
>>
>>>> You should at least perform a clone of the packet so
>>>> that the mbuf headers aren't mangled by each PMD.
> 
> Usually you don't need to clone the whole packet. In many cases it is enough to just attach
> as first segment l2/l3/l4 header portion of the packet.
> At least that's how ip_multicast sample works.

Yes, that's what I meant by deep copy the packet headers. You just copy
enough to modify what you need and keep the bulk of the packet otherwise.

> 
> Just to be safe you
>>>> should perform a partial deep copy of the packet headers in case some
>>>> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
>>>> not need an rte_vlan_insert.
>>>>
>>>> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
>>>> worse.
>>>>
>>>>>
>>>>>>
>>>>>>             if (unlikely(slave_tx_total[i] < nb_pkts))
>>>>>>                 tx_failed_flag = 1;
  

Patch

diff --git a/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst b/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
index 9510368103..a3d91b2091 100644
--- a/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
+++ b/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
@@ -359,6 +359,11 @@  The link status of a bonded device is dictated by that of its slaves, if all
 slave device link status are down or if all slaves are removed from the link
 bonding device then the link status of the bonding device will go down.
 
+Unlike normal PMD drivers, the Tx prepare for the bonding driver is controlled
+by ``rte_eth_bond_tx_prepare_set`` (all bond modes except mode 3 (broadcast)
+are supported). The ``rte_eth_bond_tx_prepare_get`` for querying the enabling
+status is provided.
+
 It is also possible to configure / query the configuration of the control
 parameters of a bonded device using the provided APIs
 ``rte_eth_bond_mode_set/ get``, ``rte_eth_bond_primary_set/get``,
diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst
index 8c021cf050..6e28a6c0af 100644
--- a/doc/guides/rel_notes/release_22_11.rst
+++ b/doc/guides/rel_notes/release_22_11.rst
@@ -55,6 +55,12 @@  New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added Tx prepare for bonding driver.**
+
+  * Added ``rte_eth_bond_tx_prepare_set`` to set whether enable Tx prepare for bonded port.
+    All bond modes except mode 3 (broadcast) are supported.
+  * Added ``rte_eth_bond_tx_prepare_get`` to get whether Tx prepare enabled for bonded port.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/bonding/eth_bond_private.h b/drivers/net/bonding/eth_bond_private.h
index 8222e3cd38..9996f6673c 100644
--- a/drivers/net/bonding/eth_bond_private.h
+++ b/drivers/net/bonding/eth_bond_private.h
@@ -117,6 +117,7 @@  struct bond_dev_private {
 	uint16_t user_defined_primary_port;
 	/**< Flag for whether primary port is user defined or not */
 
+	uint8_t tx_prepare_enabled;
 	uint8_t balance_xmit_policy;
 	/**< Transmit policy - l2 / l23 / l34 for operation in balance mode */
 	burst_xmit_hash_t burst_xmit_hash;
diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
index 874aa91a5f..deae2dd9ad 100644
--- a/drivers/net/bonding/rte_eth_bond.h
+++ b/drivers/net/bonding/rte_eth_bond.h
@@ -343,6 +343,30 @@  rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
 int
 rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
 
+/**
+ * Set whether enable Tx-prepare (rte_eth_tx_prepare) for bonded port
+ *
+ * @param bonded_port_id      Bonded device id
+ * @param en                  Enable flag
+ *
+ * @return
+ *   0 on success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_eth_bond_tx_prepare_set(uint16_t bonded_port_id, bool en);
+
+/**
+ * Get whether Tx-prepare (rte_eth_tx_prepare) is enabled for bonded port
+ *
+ * @param bonded_port_id      Bonded device id
+ *
+ * @return
+ *   0-disabled, 1-enabled, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_eth_bond_tx_prepare_get(uint16_t bonded_port_id);
 
 #ifdef __cplusplus
 }
diff --git a/drivers/net/bonding/rte_eth_bond_api.c b/drivers/net/bonding/rte_eth_bond_api.c
index 4ac191c468..47841289f4 100644
--- a/drivers/net/bonding/rte_eth_bond_api.c
+++ b/drivers/net/bonding/rte_eth_bond_api.c
@@ -1070,3 +1070,35 @@  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id)
 
 	return internals->link_up_delay_ms;
 }
+
+int
+rte_eth_bond_tx_prepare_set(uint16_t bonded_port_id, bool en)
+{
+	struct bond_dev_private *internals;
+
+	if (valid_bonded_port_id(bonded_port_id) != 0)
+		return -1;
+
+	internals = rte_eth_devices[bonded_port_id].data->dev_private;
+	if (internals->mode == BONDING_MODE_BROADCAST) {
+		RTE_BOND_LOG(ERR, "Mode broadcast don't support to configure Tx-prepare");
+		return -ENOTSUP;
+	}
+
+	internals->tx_prepare_enabled = en ? 1 : 0;
+
+	return 0;
+}
+
+int
+rte_eth_bond_tx_prepare_get(uint16_t bonded_port_id)
+{
+	struct bond_dev_private *internals;
+
+	if (valid_bonded_port_id(bonded_port_id) != 0)
+		return -1;
+
+	internals = rte_eth_devices[bonded_port_id].data->dev_private;
+
+	return internals->tx_prepare_enabled;
+}
diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
index 73e6972035..c32c7e6c6c 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -559,6 +559,56 @@  bond_ethdev_rx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	return nb_recv_pkts;
 }
 
+static inline uint16_t
+bond_ethdev_tx_wrap(struct bond_tx_queue *bd_tx_q, uint16_t slave_port_id,
+		    struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	struct bond_dev_private *internals = bd_tx_q->dev_private;
+	uint16_t queue_id = bd_tx_q->queue_id;
+	struct rte_mbuf *fail_pkts[nb_pkts];
+	uint8_t fail_mark[nb_pkts];
+	uint16_t nb_pre, index;
+	uint16_t fail_cnt = 0;
+	int i;
+
+	if (!internals->tx_prepare_enabled)
+		goto tx_burst;
+
+	nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id, tx_pkts, nb_pkts);
+	if (nb_pre == nb_pkts)
+		goto tx_burst;
+
+	fail_pkts[fail_cnt++] = tx_pkts[nb_pre];
+	memset(fail_mark, 0, sizeof(fail_mark));
+	fail_mark[nb_pre] = 1;
+	for (i = nb_pre + 1; i < nb_pkts; /* update in inner loop */) {
+		nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id,
+					    tx_pkts + i, nb_pkts - i);
+		if (nb_pre == nb_pkts - i)
+			break;
+		fail_pkts[fail_cnt++] = tx_pkts[i + nb_pre];
+		fail_mark[i + nb_pre] = 1;
+		i += nb_pre + 1;
+	}
+
+	/* move tx-prepare OK mbufs to the end */
+	for (i = index = nb_pkts - 1; i >= 0; i--) {
+		if (!fail_mark[i])
+			tx_pkts[index--] = tx_pkts[i];
+	}
+	/* move tx-prepare fail mbufs to the begin, and free them */
+	for (i = 0; i < fail_cnt; i++) {
+		tx_pkts[i] = fail_pkts[i];
+		rte_pktmbuf_free(fail_pkts[i]);
+	}
+
+	if (fail_cnt == nb_pkts)
+		return nb_pkts;
+tx_burst:
+	return fail_cnt + rte_eth_tx_burst(slave_port_id, queue_id,
+				tx_pkts + fail_cnt, nb_pkts - fail_cnt);
+}
+
 static uint16_t
 bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
 		uint16_t nb_pkts)
@@ -602,7 +652,7 @@  bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
 	/* Send packet burst on each slave device */
 	for (i = 0; i < num_of_slaves; i++) {
 		if (slave_nb_pkts[i] > 0) {
-			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+			num_tx_slave = bond_ethdev_tx_wrap(bd_tx_q, slaves[i],
 					slave_bufs[i], slave_nb_pkts[i]);
 
 			/* if tx burst fails move packets to end of bufs */
@@ -635,8 +685,8 @@  bond_ethdev_tx_burst_active_backup(void *queue,
 	if (internals->active_slave_count < 1)
 		return 0;
 
-	return rte_eth_tx_burst(internals->current_primary_port, bd_tx_q->queue_id,
-			bufs, nb_pkts);
+	return bond_ethdev_tx_wrap(bd_tx_q, internals->current_primary_port,
+				bufs, nb_pkts);
 }
 
 static inline uint16_t
@@ -951,8 +1001,8 @@  bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #endif
 		}
 
-		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
-				bufs + num_tx_total, nb_pkts - num_tx_total);
+		num_tx_total += bond_ethdev_tx_wrap(bd_tx_q, slaves[i],
+					bufs + num_tx_total, nb_pkts - num_tx_total);
 
 		if (num_tx_total == nb_pkts)
 			break;
@@ -1158,9 +1208,8 @@  tx_burst_balance(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
 		if (slave_nb_bufs[i] == 0)
 			continue;
 
-		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
-				bd_tx_q->queue_id, slave_bufs[i],
-				slave_nb_bufs[i]);
+		slave_tx_count = bond_ethdev_tx_wrap(bd_tx_q,
+			slave_port_ids[i], slave_bufs[i], slave_nb_bufs[i]);
 
 		total_tx_count += slave_tx_count;
 
diff --git a/drivers/net/bonding/version.map b/drivers/net/bonding/version.map
index 9333923b4e..2c121f2559 100644
--- a/drivers/net/bonding/version.map
+++ b/drivers/net/bonding/version.map
@@ -31,3 +31,8 @@  DPDK_23 {
 
 	local: *;
 };
+
+EXPERIMENTAL {
+	rte_eth_bond_tx_prepare_get;
+	rte_eth_bond_tx_prepare_set;
+};
\ No newline at end of file